--- license: mit tags: - multimodal - hate-speech-detection - content-moderation - clip - vision-language - image-text - classification - pytorch - transformers - multi-label-classification - social-media - meme-classification - text-image - late-fusion - gated-attention datasets: - mmhs150k language: - en metrics: - f1 - roc_auc - precision - recall library_name: transformers pipeline_tag: image-text-to-text base_model: - openai/clip-vit-base-patch32 model-index: - name: clip-vit-base-mmhs150k-fusion results: - task: type: multi-label-classification name: Hate Speech Detection dataset: name: MMHS150K type: mmhs150k metrics: - type: f1_macro value: 0.566 name: F1 Macro - type: f1_micro value: 0.635 name: F1 Micro - type: roc_auc value: 0.783 name: ROC-AUC Macro --- # CLIP-ViT-Base Fusion Model for Multi-Modal Hate Speech Detection

PyTorch Transformers License Dataset

A PyTorch-based multi-modal (image + text) hateful content classification model using CLIP encoder with late fusion architecture, trained on the MMHS150K dataset for detecting hate speech in social media memes and posts. ## 🎯 Model Description This model implements a **late fusion architecture with gated attention mechanism** for detecting hateful content in social media memes and posts. It combines visual and textual features using OpenAI's CLIP (ViT-Base-Patch32) as the backbone encoder. The model performs **multi-label classification** across 5 hate speech categories, making it capable of detecting multiple types of hate in a single post (e.g., content that is both racist and sexist). ### πŸ—οΈ Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Image β”‚ β”‚ Text β”‚ β”‚ Encoder β”‚ β”‚ Encoder β”‚ β”‚ (CLIP ViT) β”‚ β”‚ (CLIP Text) β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Projection β”‚ β”‚ Projection β”‚ β”‚ (Linear) β”‚ β”‚ (Linear) β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Gated Fusion│◄── Modality presence flags β”‚ Module β”‚ (handles missing modalities) β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Interaction Features β”‚ β”‚ β€’ Fused embedding β”‚ β”‚ β€’ Text embedding β”‚ β”‚ β€’ Visual embedding β”‚ β”‚ β€’ |text - visual| β”‚ β”‚ β€’ text βŠ™ visual β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚Classificationβ”‚ β”‚ Head (MLP) β”‚ β”‚ β†’ 5 classes β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### πŸ”‘ Key Features | Feature | Description | |---------|-------------| | **Backbone** | `openai/clip-vit-base-patch32` - Pre-trained CLIP model | | **Fusion Dimension** | 512 | | **Max Text Length** | 77 tokens | | **Multi-label Output** | 5 hate speech categories | | **Gated Attention** | Modality-aware fusion with learnable gates | | **Interaction Features** | Rich feature interactions (concatenation, element-wise product, absolute difference) | | **Missing Modality Handling** | Can handle text-only or image-only inputs | ### 🏷️ Output Classes | Class ID | Category | Description | Prior Probability | |----------|----------|-------------|-------------------| | 0 | **Racist** | Racist content targeting race/ethnicity | 32.6% | | 1 | **Sexist** | Sexist content targeting gender | 12.0% | | 2 | **Homophobe** | Homophobic content targeting sexual orientation | 7.6% | | 3 | **Religion** | Religion-based hate speech | 1.5% | | 4 | **OtherHate** | Other types of hate speech | 15.6% | --- ## πŸ“Š Evaluation Results ### Test Set Performance | Metric | Value | |--------|-------| | **F1 Macro** | 0.566 | | **F1 Micro** | 0.635 | | **ROC-AUC Macro** | 0.783 | | **Test Loss** | 1.516 | | **Throughput** | 381.5 samples/sec | ### Per-Class Performance (Validation Set) | Class | Precision | Recall | F1-Score | Support | |-------|-----------|--------|----------|---------| | **Racist** | 0.576 | 0.843 | 0.684 | 1,994 | | **Sexist** | 0.587 | 0.646 | 0.615 | 875 | | **Homophobe** | 0.804 | 0.709 | 0.753 | 612 | | **Religion** | 0.435 | 0.209 | 0.283 | 129 | | **OtherHate** | 0.541 | 0.700 | 0.611 | 1,195 | | **Micro Avg** | 0.588 | 0.737 | 0.654 | 4,805 | | **Macro Avg** | 0.589 | 0.621 | 0.589 | 4,805 | ### βš™οΈ Optimized Thresholds The model uses **per-class calibrated thresholds** for optimal performance (instead of default 0.5): | Class | Threshold | |-------|-----------| | Racist | 0.35 | | Sexist | 0.70 | | Homophobe | 0.75 | | Religion | 0.30 | | OtherHate | 0.60 | ### πŸ“ˆ Model Comparison | Model | F1 Macro | F1 Micro | ROC-AUC Macro | Throughput | |-------|----------|----------|---------------|------------| | **CLIP Fusion** (this model) | 0.566 | 0.635 | 0.783 | 381.5 | | CLIP MTL | **0.569** | **0.644** | 0.783 | 390.9 | | SigLIP Fusion | 0.507 | 0.610 | 0.774 | 236.3 | | CLIP Fusion (Weighted Sampling) | 0.557 | 0.636 | 0.772 | 266.4 | | CLIP Fusion (Bigger Batch) | 0.515 | 0.517 | **0.804** | **400.9** | --- ## πŸŽ“ Training Data ### MMHS150K Dataset The model was trained on the **MMHS150K** (Multi-Modal Hate Speech) dataset, a large-scale multi-modal hate speech dataset collected from Twitter containing 150,000 tweet-image pairs annotated for hate speech detection. | Property | Value | |----------|-------| | **Source** | Twitter | | **Total Samples** | ~150,000 | | **Modalities** | Image + Text | | **Annotation** | Multi-label (5 hate categories) | | **Language** | English | #### Dataset Splits | Split | Samples | |-------|---------| | Train | ~112,500 | | Validation | ~15,000 | | Test | ~22,500 | #### Dataset Reference > **Paper**: ["Exploring Hate Speech Detection in Multimodal Publications"](https://gombru.github.io/2019/10/09/MMHS/) (WACV 2020) > **Authors**: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas --- ## πŸ”§ Training Procedure ### Training Configuration ```yaml # Model Configuration backend: clip head: fusion encoder_name: openai/clip-vit-base-patch32 fusion_dim: 512 max_text_length: 77 freeze_text: false freeze_image: false # Training Configuration num_train_epochs: 6 per_device_train_batch_size: 32 per_device_eval_batch_size: 64 gradient_accumulation_steps: 2 # Learning Rates (Differential) lr_encoder: 1.0e-5 lr_head: 5.0e-4 # Regularization weight_decay: 0.02 max_grad_norm: 1.0 # Scheduler warmup_ratio: 0.05 lr_scheduler_type: cosine # Loss loss_type: bce use_logit_adjustment: false # Precision precision: fp16 # Data Augmentation augment: true aug_scale_min: 0.8 aug_scale_max: 1.0 horizontal_flip: true color_jitter: true # Early Stopping early_stopping_patience: 3 metric_for_best_model: roc_macro ``` ### Training Highlights - **Differential Learning Rates**: Encoder (1e-5) vs Classification Head (5e-4) - **Mixed Precision**: FP16 training for efficiency - **Data Augmentation**: Random scaling, horizontal flip, color jitter - **Threshold Calibration**: Per-class threshold optimization on validation set - **Early Stopping**: Patience of 3 epochs based on ROC-AUC macro - **Best Checkpoint**: Selected based on validation ROC-AUC macro score ### Computational Resources - **Training Time**: ~6 epochs - **Best Checkpoint**: Step 33,708 - **Hardware**: GPU with FP16 support --- ## πŸš€ How to Use ### Quick Start (Recommended) ```python import torch from PIL import Image from transformers import AutoModel, CLIPProcessor # Load model with trust_remote_code=True model = AutoModel.from_pretrained( "Amirhossein75/clip-vit-base-mmhs150k-fusion", trust_remote_code=True ) model.eval() # Load CLIP processor processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") # Prepare inputs image = Image.open("path/to/image.jpg").convert("RGB") text = "sample text from the meme" inputs = processor( text=[text], images=[image], return_tensors="pt", padding=True, truncation=True, max_length=77 ) # Run inference with torch.no_grad(): result = model.predict( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], pixel_values=inputs["pixel_values"] ) print(result) # {'predictions': {'racist': False, 'sexist': True, 'homophobe': False, 'religion': False, 'otherhate': False}, # 'probabilities': {'racist': 0.12, 'sexist': 0.78, 'homophobe': 0.05, 'religion': 0.02, 'otherhate': 0.15}} ``` ### Batch Inference ```python import torch from PIL import Image from transformers import AutoModel, CLIPProcessor # Load model model = AutoModel.from_pretrained( "Amirhossein75/clip-vit-base-mmhs150k-fusion", trust_remote_code=True ) model.eval() processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") # Prepare batch images = [Image.open("image1.jpg").convert("RGB"), Image.open("image2.jpg").convert("RGB")] texts = ["text for image 1", "text for image 2"] inputs = processor( text=texts, images=images, return_tensors="pt", padding=True, truncation=True, max_length=77 ) # Get raw logits and probabilities with torch.no_grad(): outputs = model( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], pixel_values=inputs["pixel_values"] ) probabilities = torch.sigmoid(outputs["logits"]) # Apply optimized thresholds thresholds = torch.tensor([0.35, 0.70, 0.75, 0.30, 0.60]) predictions = (probabilities > thresholds).int() print(predictions) ``` ### Using with GPU ```python import torch from transformers import AutoModel, CLIPProcessor device = "cuda" if torch.cuda.is_available() else "cpu" model = AutoModel.from_pretrained( "Amirhossein75/clip-vit-base-mmhs150k-fusion", trust_remote_code=True ).to(device) model.eval() processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") # ... prepare inputs ... inputs = {k: v.to(device) for k, v in inputs.items()} with torch.no_grad(): result = model.predict(**inputs) ``` ### Download Config Files Only ```python from huggingface_hub import hf_hub_download import json # Download inference config config_path = hf_hub_download( repo_id="Amirhossein75/clip-vit-base-mmhs150k-fusion", filename="inference_config.json" ) with open(config_path) as f: config = json.load(f) print(f"Classes: {config['class_names']}") print(f"Thresholds: {config['thresholds']}") print(f"Encoder: {config['encoder_name']}") ``` --- ## πŸ“ Model Files | File | Description | |------|-------------| | `checkpoint-33708/model.safetensors` | Model weights in safetensors format (617MB) | | `modeling_clip_fusion.py` | Model architecture code (auto-downloaded with trust_remote_code) | | `config.json` | Model architecture configuration | | `inference_config.json` | Inference settings with thresholds and class names | | `label_map.json` | Label name mapping | | `test_metrics.json` | Test set evaluation metrics | | `val_report.json` | Detailed validation classification report | --- ## ☁️ AWS SageMaker Deployment This model is compatible with AWS SageMaker for cloud deployment: ```python from sagemaker.pytorch import PyTorchModel from sagemaker.serializers import JSONSerializer from sagemaker.deserializers import JSONDeserializer model = PyTorchModel( model_data="s3://your-bucket/model.tar.gz", role=role, entry_point='inference.py', source_dir='sagemaker', framework_version='2.1.0', py_version='py310', ) predictor = model.deploy( instance_type='ml.g4dn.xlarge', initial_instance_count=1, ) # Make prediction import base64 with open('image.jpg', 'rb') as f: image_b64 = base64.b64encode(f.read()).decode('utf-8') response = predictor.predict({ 'instances': [{ 'text': 'Sample text content', 'image_base64': image_b64, }] }) ``` See the [SageMaker documentation](https://github.com/amirhossein-yousefi/multimodal-content-moderation/tree/main/sagemaker) for full deployment guide. --- ## ⚠️ Intended Uses & Limitations ### βœ… Intended Uses - **Content moderation** for social media platforms - **Detecting hateful memes** and posts - **Research** in multi-modal hate speech detection - **Building content safety systems** - **Pre-filtering** potentially harmful content for human review ### ⚠️ Limitations | Limitation | Description | |------------|-------------| | **Language** | Trained only on English content | | **Domain** | Twitter-specific; may not generalize to other platforms | | **Class Imbalance** | Lower performance on rare categories (Religion: F1=0.283) | | **Cultural Context** | May miss culturally-specific hate speech | | **Sarcasm/Irony** | May struggle with subtle or ironic hateful content | | **Image-only Hate** | Text encoder is important; purely visual hate may be missed | ### ❌ Out-of-Scope Uses - **NOT** for making final moderation decisions without human review - **NOT** suitable for legal or compliance purposes without additional validation - **NOT** for censorship or suppression of legitimate speech - **NOT** for targeting or profiling individuals ### πŸ›‘οΈ Ethical Considerations - This model should be used as a **tool to assist** human moderators, not replace them - False positives may incorrectly flag legitimate content - False negatives may miss harmful content - Regular evaluation and bias auditing is recommended - Consider the cultural and contextual factors in deployment --- ## πŸ“ Citation If you use this model, please cite: ```bibtex @misc{yousefi2024multimodal, title={Multi-Modal Hateful Content Classification with CLIP Fusion}, author={Yousefi, Amirhossein}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/Amirhossein75/clip-vit-base-mmhs150k-fusion} } ``` ### Dataset Citation ```bibtex @inproceedings{gomez2020exploring, title={Exploring Hate Speech Detection in Multimodal Publications}, author={Gomez, Raul and Gibert, Jaume and Gomez, Lluis and Karatzas, Dimosthenis}, booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, pages={1470--1478}, year={2020} } ``` ### CLIP Citation ```bibtex @inproceedings{radford2021learning, title={Learning Transferable Visual Models From Natural Language Supervision}, author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others}, booktitle={International Conference on Machine Learning}, pages={8748--8763}, year={2021} } ``` --- ## πŸ”— Links | Resource | Link | |----------|------| | **GitHub Repository** | [multimodal-content-moderation](https://github.com/amirhossein-yousefi/multimodal-content-moderation) | | **Base Model** | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) | | **MMHS150K Dataset** | [Official Page](https://gombru.github.io/2019/10/09/MMHS/) | | **CLIP Paper** | [arXiv](https://arxiv.org/abs/2103.00020) | --- ## πŸ“„ License This project is licensed under the **MIT License** - see the [LICENSE](https://github.com/amirhossein-yousefi/multimodal-content-moderation/blob/main/LICENSE) file for details. --- ## 🀝 Contributing Contributions are welcome! Please see the [GitHub repository](https://github.com/amirhossein-yousefi/multimodal-content-moderation) for contribution guidelines.