--- license: mit tags: - multimodal - hate-speech-detection - content-moderation - clip - vision-language - image-text - classification - pytorch - transformers datasets: - mmhs150k language: - en metrics: - f1 - roc_auc library_name: transformers pipeline_tag: image-text-to-text base_model: - openai/clip-vit-base-patch32 --- # CLIP-ViT-Base Fusion Model for Multi-Modal Hate Speech Detection A PyTorch-based multi-modal (image + text) hateful content classification model using CLIP encoder with late fusion architecture, trained on the MMHS150K dataset. ## Model Description This model implements a **late fusion architecture with gated attention mechanism** for detecting hateful content in social media memes and posts. It combines visual and textual features using OpenAI's CLIP (ViT-Base-Patch32) as the backbone encoder. ### Architecture ``` ┌─────────────┐ ┌─────────────┐ │ Image │ │ Text │ │ Encoder │ │ Encoder │ │ (CLIP ViT) │ │ (CLIP Text) │ └──────┬──────┘ └──────┬──────┘ │ │ ▼ ▼ ┌─────────────┐ ┌─────────────┐ │ Projection │ │ Projection │ └──────┬──────┘ └──────┬──────┘ │ │ └─────────┬─────────┘ │ ▼ ┌─────────────┐ │ Gated Fusion│◄── Modality presence flags └──────┬──────┘ │ ▼ ┌───────────────────────┐ │ Interaction Features │ │ [fused, t, v, |t-v|, │ │ t*v] │ └───────────┬───────────┘ │ ▼ ┌─────────────┐ │Classification│ │ Head (MLP) │ └─────────────┘ ``` ### Key Features - **Backbone**: `openai/clip-vit-base-patch32` - **Fusion Dimension**: 512 - **Multi-label Classification**: 5 hate speech categories - **Gated Attention**: Modality-aware fusion with learnable gates - **Interaction Features**: Rich feature interactions including element-wise product and absolute difference ## Intended Uses & Limitations ### Intended Uses - Content moderation for social media platforms - Detecting hateful memes and posts - Research in multi-modal hate speech detection - Building content safety systems ### Limitations - Trained only on English content from Twitter (MMHS150K dataset) - May not generalize well to other languages or platforms - Performance varies across hate categories (see metrics below) - Should be used as part of a larger content moderation pipeline, not as a standalone solution ### Out-of-Scope Uses - This model should not be used to make final decisions without human review - Not suitable for legal or compliance purposes without additional validation ## Training Data ### MMHS150K Dataset The model was trained on the **MMHS150K** (Multi-Modal Hate Speech) dataset, a large-scale multi-modal hate speech dataset collected from Twitter containing 150,000 tweet-image pairs annotated for hate speech detection. **Paper**: ["Exploring Hate Speech Detection in Multimodal Publications"](https://gombru.github.io/2019/10/09/MMHS/) (WACV 2020) **Authors**: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas #### Dataset Statistics | Split | Samples | |-------|---------| | Train | ~112,500 | | Validation | ~15,000 | | Test | ~22,500 | #### Label Categories | Label ID | Category | Description | |----------|----------|-------------| | 0 | Racist | Racist content | | 1 | Sexist | Sexist content | | 2 | Homophobe | Homophobic content | | 3 | Religion | Religion-based hate | | 4 | OtherHate | Other types of hate speech | ## Training Procedure ### Training Hyperparameters - **Epochs**: 6 - **Batch Size**: 32 - **Encoder Learning Rate**: 1e-5 - **Classification Head Learning Rate**: 5e-4 - **Weight Decay**: 0.02 - **Warmup Ratio**: 0.05 - **Loss Function**: Binary Cross-Entropy (BCE) - **Optimizer**: AdamW ### Hardware - Trained on GPU with mixed precision (FP16) ## Evaluation Results ### Test Set Performance | Metric | Value | |--------|-------| | **F1 Macro** | 0.566 | | **F1 Micro** | 0.635 | | **ROC-AUC Macro** | 0.783 | | **Throughput** | 381.5 samples/sec | ### Per-Class Performance (Validation Set) | Class | Precision | Recall | F1-Score | Support | |-------|-----------|--------|----------|---------| | Racist | 0.576 | 0.843 | 0.684 | 1,994 | | Sexist | 0.587 | 0.646 | 0.615 | 875 | | Homophobe | 0.804 | 0.709 | 0.753 | 612 | | Religion | 0.435 | 0.209 | 0.283 | 129 | | OtherHate | 0.541 | 0.700 | 0.611 | 1,195 | ### Optimized Thresholds The model uses per-class optimized thresholds for best performance: | Class | Threshold | |-------|-----------| | Racist | 0.35 | | Sexist | 0.70 | | Homophobe | 0.75 | | Religion | 0.30 | | OtherHate | 0.60 | ## How to Use ### Installation ```bash # Clone the training repository git clone https://github.com/amirhossein-yousefi/multimodal-content-moderation.git cd multimodal-content-moderation # Install dependencies pip install -r requirements.txt pip install -e . ``` ### Inference ```python from src.models import MultiModalFusionClassifier from src.utils import load_model_for_inference # Load trained model model, processor, tokenizer = load_model_for_inference( checkpoint_path="path/to/checkpoint-33708" ) # Predict prediction = model.predict(image_path="path/to/image.jpg", text="sample text") ``` ### Model Files - `model.safetensors` - Model weights in safetensors format - `inference_config.json` - Inference configuration with thresholds and class names - `label_map.json` - Label mapping - `test_metrics.json` - Test set evaluation metrics - `val_report.json` - Detailed validation classification report ## Model Comparison | Model | F1 Macro | F1 Micro | ROC-AUC Macro | Throughput | |-------|----------|----------|---------------|------------| | **CLIP Fusion** (this model) | 0.566 | 0.635 | 0.783 | 381.5 | | CLIP MTL | 0.569 | 0.644 | 0.783 | 390.9 | | SigLIP Fusion | 0.507 | 0.610 | 0.774 | 236.3 | ## Citation If you use this model, please cite: ```bibtex @misc{multimodal_hate_detection, title={Multi-Modal Hateful Content Classification}, author={Amirhossein Yousefi}, year={2024}, url={https://github.com/amirhossein-yousefi/multimodal-content-moderation} } ``` ### Dataset Citation ```bibtex @inproceedings{gomez2020exploring, title={Exploring hate speech detection in multimodal publications}, author={Gomez, Raul and Gibert, Jaume and Gomez, Lluis and Karatzas, Dimosthenis}, booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, pages={1470--1478}, year={2020} } ``` ## License This project is licensed under the MIT License. ## Links - **GitHub Repository**: [multimodal-content-moderation](https://github.com/amirhossein-yousefi/multimodal-content-moderation) - **Base Model**: [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) - **MMHS150K Dataset**: [Official Page](https://gombru.github.io/2019/10/09/MMHS/)