---
license: mit
tags:
  - multimodal
  - hate-speech-detection
  - content-moderation
  - clip
  - vision-language
  - image-text
  - classification
  - pytorch
  - transformers
datasets:
  - mmhs150k
language:
  - en
metrics:
  - f1
  - roc_auc
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
  - openai/clip-vit-base-patch32
---

# CLIP-ViT-Base Fusion Model for Multi-Modal Hate Speech Detection

A PyTorch-based multi-modal (image + text) hateful content classification model using CLIP encoder with late fusion architecture, trained on the MMHS150K dataset.

## Model Description

This model implements a **late fusion architecture with gated attention mechanism** for detecting hateful content in social media memes and posts. It combines visual and textual features using OpenAI's CLIP (ViT-Base-Patch32) as the backbone encoder.

### Architecture

```
┌─────────────┐     ┌─────────────┐
│   Image     │     │    Text     │
│   Encoder   │     │   Encoder   │
│ (CLIP ViT)  │     │ (CLIP Text) │
└──────┬──────┘     └──────┬──────┘
       │                   │
       ▼                   ▼
┌─────────────┐     ┌─────────────┐
│  Projection │     │  Projection │
└──────┬──────┘     └──────┬──────┘
       │                   │
       └─────────┬─────────┘
                 │
                 ▼
         ┌─────────────┐
         │ Gated Fusion│◄── Modality presence flags
         └──────┬──────┘
                │
                ▼
    ┌───────────────────────┐
    │  Interaction Features │
    │ [fused, t, v, |t-v|,  │
    │       t*v]            │
    └───────────┬───────────┘
                │
                ▼
         ┌─────────────┐
         │Classification│
         │  Head (MLP) │
         └─────────────┘
```

### Key Features

- **Backbone**: `openai/clip-vit-base-patch32`
- **Fusion Dimension**: 512
- **Multi-label Classification**: 5 hate speech categories
- **Gated Attention**: Modality-aware fusion with learnable gates
- **Interaction Features**: Rich feature interactions including element-wise product and absolute difference

## Intended Uses & Limitations

### Intended Uses

- Content moderation for social media platforms
- Detecting hateful memes and posts
- Research in multi-modal hate speech detection
- Building content safety systems

### Limitations

- Trained only on English content from Twitter (MMHS150K dataset)
- May not generalize well to other languages or platforms
- Performance varies across hate categories (see metrics below)
- Should be used as part of a larger content moderation pipeline, not as a standalone solution

### Out-of-Scope Uses

- This model should not be used to make final decisions without human review
- Not suitable for legal or compliance purposes without additional validation

## Training Data

### MMHS150K Dataset

The model was trained on the **MMHS150K** (Multi-Modal Hate Speech) dataset, a large-scale multi-modal hate speech dataset collected from Twitter containing 150,000 tweet-image pairs annotated for hate speech detection.

**Paper**: ["Exploring Hate Speech Detection in Multimodal Publications"](https://gombru.github.io/2019/10/09/MMHS/) (WACV 2020)  
**Authors**: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas

#### Dataset Statistics

| Split | Samples |
|-------|---------|
| Train | ~112,500 |
| Validation | ~15,000 |
| Test | ~22,500 |

#### Label Categories

| Label ID | Category | Description |
|----------|----------|-------------|
| 0 | Racist | Racist content |
| 1 | Sexist | Sexist content |
| 2 | Homophobe | Homophobic content |
| 3 | Religion | Religion-based hate |
| 4 | OtherHate | Other types of hate speech |

## Training Procedure

### Training Hyperparameters

- **Epochs**: 6
- **Batch Size**: 32
- **Encoder Learning Rate**: 1e-5
- **Classification Head Learning Rate**: 5e-4
- **Weight Decay**: 0.02
- **Warmup Ratio**: 0.05
- **Loss Function**: Binary Cross-Entropy (BCE)
- **Optimizer**: AdamW

### Hardware

- Trained on GPU with mixed precision (FP16)

## Evaluation Results

### Test Set Performance

| Metric | Value |
|--------|-------|
| **F1 Macro** | 0.566 |
| **F1 Micro** | 0.635 |
| **ROC-AUC Macro** | 0.783 |
| **Throughput** | 381.5 samples/sec |

### Per-Class Performance (Validation Set)

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| Racist | 0.576 | 0.843 | 0.684 | 1,994 |
| Sexist | 0.587 | 0.646 | 0.615 | 875 |
| Homophobe | 0.804 | 0.709 | 0.753 | 612 |
| Religion | 0.435 | 0.209 | 0.283 | 129 |
| OtherHate | 0.541 | 0.700 | 0.611 | 1,195 |

### Optimized Thresholds

The model uses per-class optimized thresholds for best performance:

| Class | Threshold |
|-------|-----------|
| Racist | 0.35 |
| Sexist | 0.70 |
| Homophobe | 0.75 |
| Religion | 0.30 |
| OtherHate | 0.60 |

## How to Use

### Installation

```bash
# Clone the training repository
git clone https://github.com/amirhossein-yousefi/multimodal-content-moderation.git
cd multimodal-content-moderation

# Install dependencies
pip install -r requirements.txt
pip install -e .
```

### Inference

```python
from src.models import MultiModalFusionClassifier
from src.utils import load_model_for_inference

# Load trained model
model, processor, tokenizer = load_model_for_inference(
    checkpoint_path="path/to/checkpoint-33708"
)

# Predict
prediction = model.predict(image_path="path/to/image.jpg", text="sample text")
```

### Model Files

- `model.safetensors` - Model weights in safetensors format
- `inference_config.json` - Inference configuration with thresholds and class names
- `label_map.json` - Label mapping
- `test_metrics.json` - Test set evaluation metrics
- `val_report.json` - Detailed validation classification report

## Model Comparison

| Model | F1 Macro | F1 Micro | ROC-AUC Macro | Throughput |
|-------|----------|----------|---------------|------------|
| **CLIP Fusion** (this model) | 0.566 | 0.635 | 0.783 | 381.5 |
| CLIP MTL | 0.569 | 0.644 | 0.783 | 390.9 |
| SigLIP Fusion | 0.507 | 0.610 | 0.774 | 236.3 |

## Citation

If you use this model, please cite:

```bibtex
@misc{multimodal_hate_detection,
  title={Multi-Modal Hateful Content Classification},
  author={Amirhossein Yousefi},
  year={2024},
  url={https://github.com/amirhossein-yousefi/multimodal-content-moderation}
}
```

### Dataset Citation

```bibtex
@inproceedings{gomez2020exploring,
  title={Exploring hate speech detection in multimodal publications},
  author={Gomez, Raul and Gibert, Jaume and Gomez, Lluis and Karatzas, Dimosthenis},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={1470--1478},
  year={2020}
}
```

## License

This project is licensed under the MIT License.

## Links

- **GitHub Repository**: [multimodal-content-moderation](https://github.com/amirhossein-yousefi/multimodal-content-moderation)
- **Base Model**: [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
- **MMHS150K Dataset**: [Official Page](https://gombru.github.io/2019/10/09/MMHS/)