---
license: mit
tags:
  - multimodal
  - hate-speech-detection
  - content-moderation
  - clip
  - vision-language
  - image-text
  - classification
  - pytorch
  - transformers
  - multi-label-classification
  - social-media
  - meme-classification
  - text-image
  - late-fusion
  - gated-attention
datasets:
  - mmhs150k
language:
  - en
metrics:
  - f1
  - roc_auc
  - precision
  - recall
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
  - openai/clip-vit-base-patch32
model-index:
  - name: clip-vit-base-mmhs150k-fusion
    results:
      - task:
          type: multi-label-classification
          name: Hate Speech Detection
        dataset:
          name: MMHS150K
          type: mmhs150k
        metrics:
          - type: f1_macro
            value: 0.566
            name: F1 Macro
          - type: f1_micro
            value: 0.635
            name: F1 Micro
          - type: roc_auc
            value: 0.783
            name: ROC-AUC Macro
---

# CLIP-ViT-Base Fusion Model for Multi-Modal Hate Speech Detection

<p align="center">
  <img src="https://img.shields.io/badge/PyTorch-2.0+-red.svg" alt="PyTorch">
  <img src="https://img.shields.io/badge/Transformers-4.30+-yellow.svg" alt="Transformers">
  <img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License">
  <img src="https://img.shields.io/badge/Dataset-MMHS150K-blue.svg" alt="Dataset">
</p>

A PyTorch-based multi-modal (image + text) hateful content classification model using CLIP encoder with late fusion architecture, trained on the MMHS150K dataset for detecting hate speech in social media memes and posts.

## 🎯 Model Description

This model implements a **late fusion architecture with gated attention mechanism** for detecting hateful content in social media memes and posts. It combines visual and textual features using OpenAI's CLIP (ViT-Base-Patch32) as the backbone encoder.

The model performs **multi-label classification** across 5 hate speech categories, making it capable of detecting multiple types of hate in a single post (e.g., content that is both racist and sexist).

### 🏗️ Architecture

```
┌─────────────┐     ┌─────────────┐
│   Image     │     │    Text     │
│   Encoder   │     │   Encoder   │
│ (CLIP ViT)  │     │ (CLIP Text) │
└──────┬──────┘     └──────┬──────┘
       │                   │
       ▼                   ▼
┌─────────────┐     ┌─────────────┐
│  Projection │     │  Projection │
│   (Linear)  │     │   (Linear)  │
└──────┬──────┘     └──────┬──────┘
       │                   │
       └─────────┬─────────┘
                 │
                 ▼
         ┌─────────────┐
         │ Gated Fusion│◄── Modality presence flags
         │   Module    │    (handles missing modalities)
         └──────┬──────┘
                │
                ▼
    ┌───────────────────────┐
    │  Interaction Features │
    │  • Fused embedding    │
    │  • Text embedding     │
    │  • Visual embedding   │
    │  • |text - visual|    │
    │  • text ⊙ visual      │
    └───────────┬───────────┘
                │
                ▼
         ┌─────────────┐
         │Classification│
         │  Head (MLP)  │
         │  → 5 classes │
         └─────────────┘
```

### 🔑 Key Features

| Feature | Description |
|---------|-------------|
| **Backbone** | `openai/clip-vit-base-patch32` - Pre-trained CLIP model |
| **Fusion Dimension** | 512 |
| **Max Text Length** | 77 tokens |
| **Multi-label Output** | 5 hate speech categories |
| **Gated Attention** | Modality-aware fusion with learnable gates |
| **Interaction Features** | Rich feature interactions (concatenation, element-wise product, absolute difference) |
| **Missing Modality Handling** | Can handle text-only or image-only inputs |

### 🏷️ Output Classes

| Class ID | Category | Description | Prior Probability |
|----------|----------|-------------|-------------------|
| 0 | **Racist** | Racist content targeting race/ethnicity | 32.6% |
| 1 | **Sexist** | Sexist content targeting gender | 12.0% |
| 2 | **Homophobe** | Homophobic content targeting sexual orientation | 7.6% |
| 3 | **Religion** | Religion-based hate speech | 1.5% |
| 4 | **OtherHate** | Other types of hate speech | 15.6% |

---

## 📊 Evaluation Results

### Test Set Performance

| Metric | Value |
|--------|-------|
| **F1 Macro** | 0.566 |
| **F1 Micro** | 0.635 |
| **ROC-AUC Macro** | 0.783 |
| **Test Loss** | 1.516 |
| **Throughput** | 381.5 samples/sec |

### Per-Class Performance (Validation Set)

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **Racist** | 0.576 | 0.843 | 0.684 | 1,994 |
| **Sexist** | 0.587 | 0.646 | 0.615 | 875 |
| **Homophobe** | 0.804 | 0.709 | 0.753 | 612 |
| **Religion** | 0.435 | 0.209 | 0.283 | 129 |
| **OtherHate** | 0.541 | 0.700 | 0.611 | 1,195 |
| **Micro Avg** | 0.588 | 0.737 | 0.654 | 4,805 |
| **Macro Avg** | 0.589 | 0.621 | 0.589 | 4,805 |

### ⚙️ Optimized Thresholds

The model uses **per-class calibrated thresholds** for optimal performance (instead of default 0.5):

| Class | Threshold |
|-------|-----------|
| Racist | 0.35 |
| Sexist | 0.70 |
| Homophobe | 0.75 |
| Religion | 0.30 |
| OtherHate | 0.60 |

### 📈 Model Comparison

| Model | F1 Macro | F1 Micro | ROC-AUC Macro | Throughput |
|-------|----------|----------|---------------|------------|
| **CLIP Fusion** (this model) | 0.566 | 0.635 | 0.783 | 381.5 |
| CLIP MTL | **0.569** | **0.644** | 0.783 | 390.9 |
| SigLIP Fusion | 0.507 | 0.610 | 0.774 | 236.3 |
| CLIP Fusion (Weighted Sampling) | 0.557 | 0.636 | 0.772 | 266.4 |
| CLIP Fusion (Bigger Batch) | 0.515 | 0.517 | **0.804** | **400.9** |

---

## 🎓 Training Data

### MMHS150K Dataset

The model was trained on the **MMHS150K** (Multi-Modal Hate Speech) dataset, a large-scale multi-modal hate speech dataset collected from Twitter containing 150,000 tweet-image pairs annotated for hate speech detection.

| Property | Value |
|----------|-------|
| **Source** | Twitter |
| **Total Samples** | ~150,000 |
| **Modalities** | Image + Text |
| **Annotation** | Multi-label (5 hate categories) |
| **Language** | English |

#### Dataset Splits

| Split | Samples |
|-------|---------|
| Train | ~112,500 |
| Validation | ~15,000 |
| Test | ~22,500 |

#### Dataset Reference

> **Paper**: ["Exploring Hate Speech Detection in Multimodal Publications"](https://gombru.github.io/2019/10/09/MMHS/) (WACV 2020)  
> **Authors**: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas

---

## 🔧 Training Procedure

### Training Configuration

```yaml
# Model Configuration
backend: clip
head: fusion
encoder_name: openai/clip-vit-base-patch32
fusion_dim: 512
max_text_length: 77
freeze_text: false
freeze_image: false

# Training Configuration
num_train_epochs: 6
per_device_train_batch_size: 32
per_device_eval_batch_size: 64
gradient_accumulation_steps: 2

# Learning Rates (Differential)
lr_encoder: 1.0e-5
lr_head: 5.0e-4

# Regularization
weight_decay: 0.02
max_grad_norm: 1.0

# Scheduler
warmup_ratio: 0.05
lr_scheduler_type: cosine

# Loss
loss_type: bce
use_logit_adjustment: false

# Precision
precision: fp16

# Data Augmentation
augment: true
aug_scale_min: 0.8
aug_scale_max: 1.0
horizontal_flip: true
color_jitter: true

# Early Stopping
early_stopping_patience: 3
metric_for_best_model: roc_macro
```

### Training Highlights

- **Differential Learning Rates**: Encoder (1e-5) vs Classification Head (5e-4)
- **Mixed Precision**: FP16 training for efficiency
- **Data Augmentation**: Random scaling, horizontal flip, color jitter
- **Threshold Calibration**: Per-class threshold optimization on validation set
- **Early Stopping**: Patience of 3 epochs based on ROC-AUC macro
- **Best Checkpoint**: Selected based on validation ROC-AUC macro score

### Computational Resources

- **Training Time**: ~6 epochs
- **Best Checkpoint**: Step 33,708
- **Hardware**: GPU with FP16 support

---

## 🚀 How to Use

### Quick Start (Recommended)

```python
import torch
from PIL import Image
from transformers import AutoModel, CLIPProcessor

# Load model with trust_remote_code=True
model = AutoModel.from_pretrained(
    "Amirhossein75/clip-vit-base-mmhs150k-fusion",
    trust_remote_code=True
)
model.eval()

# Load CLIP processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Prepare inputs
image = Image.open("path/to/image.jpg").convert("RGB")
text = "sample text from the meme"

inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=77
)

# Run inference
with torch.no_grad():
    result = model.predict(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        pixel_values=inputs["pixel_values"]
    )

print(result)
# {'predictions': {'racist': False, 'sexist': True, 'homophobe': False, 'religion': False, 'otherhate': False},
#  'probabilities': {'racist': 0.12, 'sexist': 0.78, 'homophobe': 0.05, 'religion': 0.02, 'otherhate': 0.15}}
```

### Batch Inference

```python
import torch
from PIL import Image
from transformers import AutoModel, CLIPProcessor

# Load model
model = AutoModel.from_pretrained(
    "Amirhossein75/clip-vit-base-mmhs150k-fusion",
    trust_remote_code=True
)
model.eval()

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Prepare batch
images = [Image.open("image1.jpg").convert("RGB"), Image.open("image2.jpg").convert("RGB")]
texts = ["text for image 1", "text for image 2"]

inputs = processor(
    text=texts,
    images=images,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=77
)

# Get raw logits and probabilities
with torch.no_grad():
    outputs = model(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        pixel_values=inputs["pixel_values"]
    )
    probabilities = torch.sigmoid(outputs["logits"])

# Apply optimized thresholds
thresholds = torch.tensor([0.35, 0.70, 0.75, 0.30, 0.60])
predictions = (probabilities > thresholds).int()
print(predictions)
```

### Using with GPU

```python
import torch
from transformers import AutoModel, CLIPProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained(
    "Amirhossein75/clip-vit-base-mmhs150k-fusion",
    trust_remote_code=True
).to(device)
model.eval()

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# ... prepare inputs ...
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    result = model.predict(**inputs)
```

### Download Config Files Only

```python
from huggingface_hub import hf_hub_download
import json

# Download inference config
config_path = hf_hub_download(
    repo_id="Amirhossein75/clip-vit-base-mmhs150k-fusion",
    filename="inference_config.json"
)

with open(config_path) as f:
    config = json.load(f)

print(f"Classes: {config['class_names']}")
print(f"Thresholds: {config['thresholds']}")
print(f"Encoder: {config['encoder_name']}")
```

---

## 📁 Model Files

| File | Description |
|------|-------------|
| `checkpoint-33708/model.safetensors` | Model weights in safetensors format (617MB) |
| `modeling_clip_fusion.py` | Model architecture code (auto-downloaded with trust_remote_code) |
| `config.json` | Model architecture configuration |
| `inference_config.json` | Inference settings with thresholds and class names |
| `label_map.json` | Label name mapping |
| `test_metrics.json` | Test set evaluation metrics |
| `val_report.json` | Detailed validation classification report |

---

## ☁️ AWS SageMaker Deployment

This model is compatible with AWS SageMaker for cloud deployment:

```python
from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

model = PyTorchModel(
    model_data="s3://your-bucket/model.tar.gz",
    role=role,
    entry_point='inference.py',
    source_dir='sagemaker',
    framework_version='2.1.0',
    py_version='py310',
)

predictor = model.deploy(
    instance_type='ml.g4dn.xlarge',
    initial_instance_count=1,
)

# Make prediction
import base64
with open('image.jpg', 'rb') as f:
    image_b64 = base64.b64encode(f.read()).decode('utf-8')

response = predictor.predict({
    'instances': [{
        'text': 'Sample text content',
        'image_base64': image_b64,
    }]
})
```

See the [SageMaker documentation](https://github.com/amirhossein-yousefi/multimodal-content-moderation/tree/main/sagemaker) for full deployment guide.

---

## ⚠️ Intended Uses & Limitations

### ✅ Intended Uses

- **Content moderation** for social media platforms
- **Detecting hateful memes** and posts
- **Research** in multi-modal hate speech detection
- **Building content safety systems**
- **Pre-filtering** potentially harmful content for human review

### ⚠️ Limitations

| Limitation | Description |
|------------|-------------|
| **Language** | Trained only on English content |
| **Domain** | Twitter-specific; may not generalize to other platforms |
| **Class Imbalance** | Lower performance on rare categories (Religion: F1=0.283) |
| **Cultural Context** | May miss culturally-specific hate speech |
| **Sarcasm/Irony** | May struggle with subtle or ironic hateful content |
| **Image-only Hate** | Text encoder is important; purely visual hate may be missed |

### ❌ Out-of-Scope Uses

- **NOT** for making final moderation decisions without human review
- **NOT** suitable for legal or compliance purposes without additional validation
- **NOT** for censorship or suppression of legitimate speech
- **NOT** for targeting or profiling individuals

### 🛡️ Ethical Considerations

- This model should be used as a **tool to assist** human moderators, not replace them
- False positives may incorrectly flag legitimate content
- False negatives may miss harmful content
- Regular evaluation and bias auditing is recommended
- Consider the cultural and contextual factors in deployment

---

## 📝 Citation

If you use this model, please cite:

```bibtex
@misc{yousefi2024multimodal,
  title={Multi-Modal Hateful Content Classification with CLIP Fusion},
  author={Yousefi, Amirhossein},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Amirhossein75/clip-vit-base-mmhs150k-fusion}
}
```

### Dataset Citation

```bibtex
@inproceedings{gomez2020exploring,
  title={Exploring Hate Speech Detection in Multimodal Publications},
  author={Gomez, Raul and Gibert, Jaume and Gomez, Lluis and Karatzas, Dimosthenis},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  pages={1470--1478},
  year={2020}
}
```

### CLIP Citation

```bibtex
@inproceedings{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
  booktitle={International Conference on Machine Learning},
  pages={8748--8763},
  year={2021}
}
```

---

## 🔗 Links

| Resource | Link |
|----------|------|
| **GitHub Repository** | [multimodal-content-moderation](https://github.com/amirhossein-yousefi/multimodal-content-moderation) |
| **Base Model** | [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) |
| **MMHS150K Dataset** | [Official Page](https://gombru.github.io/2019/10/09/MMHS/) |
| **CLIP Paper** | [arXiv](https://arxiv.org/abs/2103.00020) |

---

## 📄 License

This project is licensed under the **MIT License** - see the [LICENSE](https://github.com/amirhossein-yousefi/multimodal-content-moderation/blob/main/LICENSE) file for details.

---

## 🤝 Contributing

Contributions are welcome! Please see the [GitHub repository](https://github.com/amirhossein-yousefi/multimodal-content-moderation) for contribution guidelines.