---
license: mit
tags:
- multimodal
- hate-speech-detection
- content-moderation
- clip
- vision-language
- image-text
- classification
- pytorch
- transformers
- multi-label-classification
- social-media
- meme-classification
- text-image
- late-fusion
- gated-attention
datasets:
- mmhs150k
language:
- en
metrics:
- f1
- roc_auc
- precision
- recall
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
- openai/clip-vit-base-patch32
model-index:
- name: clip-vit-base-mmhs150k-fusion
results:
- task:
type: multi-label-classification
name: Hate Speech Detection
dataset:
name: MMHS150K
type: mmhs150k
metrics:
- type: f1_macro
value: 0.566
name: F1 Macro
- type: f1_micro
value: 0.635
name: F1 Micro
- type: roc_auc
value: 0.783
name: ROC-AUC Macro
---
# CLIP-ViT-Base Fusion Model for Multi-Modal Hate Speech Detection
A PyTorch-based multi-modal (image + text) hateful content classification model using CLIP encoder with late fusion architecture, trained on the MMHS150K dataset for detecting hate speech in social media memes and posts.
## π― Model Description
This model implements a **late fusion architecture with gated attention mechanism** for detecting hateful content in social media memes and posts. It combines visual and textual features using OpenAI's CLIP (ViT-Base-Patch32) as the backbone encoder.
The model performs **multi-label classification** across 5 hate speech categories, making it capable of detecting multiple types of hate in a single post (e.g., content that is both racist and sexist).
### ποΈ Architecture
```
βββββββββββββββ βββββββββββββββ
β Image β β Text β
β Encoder β β Encoder β
β (CLIP ViT) β β (CLIP Text) β
ββββββββ¬βββββββ ββββββββ¬βββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββ
β Projection β β Projection β
β (Linear) β β (Linear) β
ββββββββ¬βββββββ ββββββββ¬βββββββ
β β
βββββββββββ¬ββββββββββ
β
βΌ
βββββββββββββββ
β Gated Fusionββββ Modality presence flags
β Module β (handles missing modalities)
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββββββββββββ
β Interaction Features β
β β’ Fused embedding β
β β’ Text embedding β
β β’ Visual embedding β
β β’ |text - visual| β
β β’ text β visual β
βββββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββ
βClassificationβ
β Head (MLP) β
β β 5 classes β
βββββββββββββββ
```
### π Key Features
| Feature | Description |
|---------|-------------|
| **Backbone** | `openai/clip-vit-base-patch32` - Pre-trained CLIP model |
| **Fusion Dimension** | 512 |
| **Max Text Length** | 77 tokens |
| **Multi-label Output** | 5 hate speech categories |
| **Gated Attention** | Modality-aware fusion with learnable gates |
| **Interaction Features** | Rich feature interactions (concatenation, element-wise product, absolute difference) |
| **Missing Modality Handling** | Can handle text-only or image-only inputs |
### π·οΈ Output Classes
| Class ID | Category | Description | Prior Probability |
|----------|----------|-------------|-------------------|
| 0 | **Racist** | Racist content targeting race/ethnicity | 32.6% |
| 1 | **Sexist** | Sexist content targeting gender | 12.0% |
| 2 | **Homophobe** | Homophobic content targeting sexual orientation | 7.6% |
| 3 | **Religion** | Religion-based hate speech | 1.5% |
| 4 | **OtherHate** | Other types of hate speech | 15.6% |
---
## π Evaluation Results
### Test Set Performance
| Metric | Value |
|--------|-------|
| **F1 Macro** | 0.566 |
| **F1 Micro** | 0.635 |
| **ROC-AUC Macro** | 0.783 |
| **Test Loss** | 1.516 |
| **Throughput** | 381.5 samples/sec |
### Per-Class Performance (Validation Set)
| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **Racist** | 0.576 | 0.843 | 0.684 | 1,994 |
| **Sexist** | 0.587 | 0.646 | 0.615 | 875 |
| **Homophobe** | 0.804 | 0.709 | 0.753 | 612 |
| **Religion** | 0.435 | 0.209 | 0.283 | 129 |
| **OtherHate** | 0.541 | 0.700 | 0.611 | 1,195 |
| **Micro Avg** | 0.588 | 0.737 | 0.654 | 4,805 |
| **Macro Avg** | 0.589 | 0.621 | 0.589 | 4,805 |
### βοΈ Optimized Thresholds
The model uses **per-class calibrated thresholds** for optimal performance (instead of default 0.5):
| Class | Threshold |
|-------|-----------|
| Racist | 0.35 |
| Sexist | 0.70 |
| Homophobe | 0.75 |
| Religion | 0.30 |
| OtherHate | 0.60 |
### π Model Comparison
| Model | F1 Macro | F1 Micro | ROC-AUC Macro | Throughput |
|-------|----------|----------|---------------|------------|
| **CLIP Fusion** (this model) | 0.566 | 0.635 | 0.783 | 381.5 |
| CLIP MTL | **0.569** | **0.644** | 0.783 | 390.9 |
| SigLIP Fusion | 0.507 | 0.610 | 0.774 | 236.3 |
| CLIP Fusion (Weighted Sampling) | 0.557 | 0.636 | 0.772 | 266.4 |
| CLIP Fusion (Bigger Batch) | 0.515 | 0.517 | **0.804** | **400.9** |
---
## π Training Data
### MMHS150K Dataset
The model was trained on the **MMHS150K** (Multi-Modal Hate Speech) dataset, a large-scale multi-modal hate speech dataset collected from Twitter containing 150,000 tweet-image pairs annotated for hate speech detection.
| Property | Value |
|----------|-------|
| **Source** | Twitter |
| **Total Samples** | ~150,000 |
| **Modalities** | Image + Text |
| **Annotation** | Multi-label (5 hate categories) |
| **Language** | English |
#### Dataset Splits
| Split | Samples |
|-------|---------|
| Train | ~112,500 |
| Validation | ~15,000 |
| Test | ~22,500 |
#### Dataset Reference
> **Paper**: ["Exploring Hate Speech Detection in Multimodal Publications"](https://gombru.github.io/2019/10/09/MMHS/) (WACV 2020)
> **Authors**: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas
---
## π§ Training Procedure
### Training Configuration
```yaml
# Model Configuration
backend: clip
head: fusion
encoder_name: openai/clip-vit-base-patch32
fusion_dim: 512
max_text_length: 77
freeze_text: false
freeze_image: false
# Training Configuration
num_train_epochs: 6
per_device_train_batch_size: 32
per_device_eval_batch_size: 64
gradient_accumulation_steps: 2
# Learning Rates (Differential)
lr_encoder: 1.0e-5
lr_head: 5.0e-4
# Regularization
weight_decay: 0.02
max_grad_norm: 1.0
# Scheduler
warmup_ratio: 0.05
lr_scheduler_type: cosine
# Loss
loss_type: bce
use_logit_adjustment: false
# Precision
precision: fp16
# Data Augmentation
augment: true
aug_scale_min: 0.8
aug_scale_max: 1.0
horizontal_flip: true
color_jitter: true
# Early Stopping
early_stopping_patience: 3
metric_for_best_model: roc_macro
```
### Training Highlights
- **Differential Learning Rates**: Encoder (1e-5) vs Classification Head (5e-4)
- **Mixed Precision**: FP16 training for efficiency
- **Data Augmentation**: Random scaling, horizontal flip, color jitter
- **Threshold Calibration**: Per-class threshold optimization on validation set
- **Early Stopping**: Patience of 3 epochs based on ROC-AUC macro
- **Best Checkpoint**: Selected based on validation ROC-AUC macro score
### Computational Resources
- **Training Time**: ~6 epochs
- **Best Checkpoint**: Step 33,708
- **Hardware**: GPU with FP16 support
---
## π How to Use
### Quick Start (Recommended)
```python
import torch
from PIL import Image
from transformers import AutoModel, CLIPProcessor
# Load model with trust_remote_code=True
model = AutoModel.from_pretrained(
"Amirhossein75/clip-vit-base-mmhs150k-fusion",
trust_remote_code=True
)
model.eval()
# Load CLIP processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare inputs
image = Image.open("path/to/image.jpg").convert("RGB")
text = "sample text from the meme"
inputs = processor(
text=[text],
images=[image],
return_tensors="pt",
padding=True,
truncation=True,
max_length=77
)
# Run inference
with torch.no_grad():
result = model.predict(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
pixel_values=inputs["pixel_values"]
)
print(result)
# {'predictions': {'racist': False, 'sexist': True, 'homophobe': False, 'religion': False, 'otherhate': False},
# 'probabilities': {'racist': 0.12, 'sexist': 0.78, 'homophobe': 0.05, 'religion': 0.02, 'otherhate': 0.15}}
```
### Batch Inference
```python
import torch
from PIL import Image
from transformers import AutoModel, CLIPProcessor
# Load model
model = AutoModel.from_pretrained(
"Amirhossein75/clip-vit-base-mmhs150k-fusion",
trust_remote_code=True
)
model.eval()
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Prepare batch
images = [Image.open("image1.jpg").convert("RGB"), Image.open("image2.jpg").convert("RGB")]
texts = ["text for image 1", "text for image 2"]
inputs = processor(
text=texts,
images=images,
return_tensors="pt",
padding=True,
truncation=True,
max_length=77
)
# Get raw logits and probabilities
with torch.no_grad():
outputs = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
pixel_values=inputs["pixel_values"]
)
probabilities = torch.sigmoid(outputs["logits"])
# Apply optimized thresholds
thresholds = torch.tensor([0.35, 0.70, 0.75, 0.30, 0.60])
predictions = (probabilities > thresholds).int()
print(predictions)
```
### Using with GPU
```python
import torch
from transformers import AutoModel, CLIPProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
"Amirhossein75/clip-vit-base-mmhs150k-fusion",
trust_remote_code=True
).to(device)
model.eval()
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# ... prepare inputs ...
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
result = model.predict(**inputs)
```
### Download Config Files Only
```python
from huggingface_hub import hf_hub_download
import json
# Download inference config
config_path = hf_hub_download(
repo_id="Amirhossein75/clip-vit-base-mmhs150k-fusion",
filename="inference_config.json"
)
with open(config_path) as f:
config = json.load(f)
print(f"Classes: {config['class_names']}")
print(f"Thresholds: {config['thresholds']}")
print(f"Encoder: {config['encoder_name']}")
```
---
## π Model Files
| File | Description |
|------|-------------|
| `checkpoint-33708/model.safetensors` | Model weights in safetensors format (617MB) |
| `modeling_clip_fusion.py` | Model architecture code (auto-downloaded with trust_remote_code) |
| `config.json` | Model architecture configuration |
| `inference_config.json` | Inference settings with thresholds and class names |
| `label_map.json` | Label name mapping |
| `test_metrics.json` | Test set evaluation metrics |
| `val_report.json` | Detailed validation classification report |
---
## βοΈ AWS SageMaker Deployment
This model is compatible with AWS SageMaker for cloud deployment:
```python
from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
model = PyTorchModel(
model_data="s3://your-bucket/model.tar.gz",
role=role,
entry_point='inference.py',
source_dir='sagemaker',
framework_version='2.1.0',
py_version='py310',
)
predictor = model.deploy(
instance_type='ml.g4dn.xlarge',
initial_instance_count=1,
)
# Make prediction
import base64
with open('image.jpg', 'rb') as f:
image_b64 = base64.b64encode(f.read()).decode('utf-8')
response = predictor.predict({
'instances': [{
'text': 'Sample text content',
'image_base64': image_b64,
}]
})
```
See the [SageMaker documentation](https://github.com/amirhossein-yousefi/multimodal-content-moderation/tree/main/sagemaker) for full deployment guide.
---
## β οΈ Intended Uses & Limitations
### β
Intended Uses
- **Content moderation** for social media platforms
- **Detecting hateful memes** and posts
- **Research** in multi-modal hate speech detection
- **Building content safety systems**
- **Pre-filtering** potentially harmful content for human review
### β οΈ Limitations
| Limitation | Description |
|------------|-------------|
| **Language** | Trained only on English content |
| **Domain** | Twitter-specific; may not generalize to other platforms |
| **Class Imbalance** | Lower performance on rare categories (Religion: F1=0.283) |
| **Cultural Context** | May miss culturally-specific hate speech |
| **Sarcasm/Irony** | May struggle with subtle or ironic hateful content |
| **Image-only Hate** | Text encoder is important; purely visual hate may be missed |
### β Out-of-Scope Uses
- **NOT** for making final moderation decisions without human review
- **NOT** suitable for legal or compliance purposes without additional validation
- **NOT** for censorship or suppression of legitimate speech
- **NOT** for targeting or profiling individuals
### π‘οΈ Ethical Considerations
- This model should be used as a **tool to assist** human moderators, not replace them
- False positives may incorrectly flag legitimate content
- False negatives may miss harmful content
- Regular evaluation and bias auditing is recommended
- Consider the cultural and contextual factors in deployment
---
## π Citation
If you use this model, please cite:
```bibtex
@misc{yousefi2024multimodal,
title={Multi-Modal Hateful Content Classification with CLIP Fusion},
author={Yousefi, Amirhossein},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Amirhossein75/clip-vit-base-mmhs150k-fusion}
}
```
### Dataset Citation
```bibtex
@inproceedings{gomez2020exploring,
title={Exploring Hate Speech Detection in Multimodal Publications},
author={Gomez, Raul and Gibert, Jaume and Gomez, Lluis and Karatzas, Dimosthenis},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
pages={1470--1478},
year={2020}
}
```
### CLIP Citation
```bibtex
@inproceedings{radford2021learning,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
booktitle={International Conference on Machine Learning},
pages={8748--8763},
year={2021}
}
```
---
## π Links
| Resource | Link |
|----------|------|
| **GitHub Repository** | [multimodal-content-moderation](https://github.com/amirhossein-yousefi/multimodal-content-moderation) |
| **Base Model** | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) |
| **MMHS150K Dataset** | [Official Page](https://gombru.github.io/2019/10/09/MMHS/) |
| **CLIP Paper** | [arXiv](https://arxiv.org/abs/2103.00020) |
---
## π License
This project is licensed under the **MIT License** - see the [LICENSE](https://github.com/amirhossein-yousefi/multimodal-content-moderation/blob/main/LICENSE) file for details.
---
## π€ Contributing
Contributions are welcome! Please see the [GitHub repository](https://github.com/amirhossein-yousefi/multimodal-content-moderation) for contribution guidelines.