mom-multilingual-class
Collection
long context models for MoM multilingual classifier (domain, jailbreak, pii, factual, feedback) • 12 items • Updated
Part of the MoM (Mixture of Models) family for vLLM Semantic Router.
This is the merged (ready-to-use) version of mmbert32k-modality-router-lora. LoRA weights have been merged into the mmbert-32k-yarn base model for easy deployment without the PEFT dependency.
A text classifier based on ModernBERT (307M params, 32K context, 1800+ languages) that determines the appropriate response modality for user prompts:
| Label | Description | Routed To | Example |
|---|---|---|---|
| AR | Text-only response | Autoregressive LLM (e.g., Llama, Qwen) | "What is the capital of France?" |
| DIFFUSION | Image generation | Diffusion model (e.g., Flux, SDXL) | "A cyberpunk city at night, neon lights" |
| BOTH | Text + image response | Both AR + Diffusion pipeline | "Explain photosynthesis and show a diagram" |
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="llm-semantic-router/mmbert32k-modality-router-merged",
)
results = classifier([
"What are the benefits of exercise?",
"A serene Japanese garden with cherry blossoms, watercolor style",
"Explain how neural networks work and generate a diagram",
])
for r in results:
print(f"{r['label']}: {r['score']:.3f}")
# AR: 0.995
# DIFFUSION: 0.717
# BOTH: 0.978
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained(
"llm-semantic-router/mmbert32k-modality-router-merged"
)
tokenizer = AutoTokenizer.from_pretrained(
"llm-semantic-router/mmbert32k-modality-router-merged"
)
prompts = [
"Summarize the key points of quantum computing",
"portrait of a woman in renaissance style, oil painting, dramatic lighting",
"Write a blog post about climate change and include relevant charts",
]
model.eval()
inputs = tokenizer(prompts, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
labels = model.config.id2label
for prompt, pred_id in zip(prompts, predictions):
print(f"{labels[pred_id.item()]}: {prompt[:60]}...")
# AR: Summarize the key points of quantum computing...
# DIFFUSION: portrait of a woman in renaissance style, oil painting, d...
# BOTH: Write a blog post about climate change and include releva...
# Example: Route requests to different model backends
def route_request(prompt: str, classifier) -> str:
"""Route a user prompt to the appropriate model backend."""
result = classifier(prompt)[0]
modality = result["label"]
confidence = result["score"]
if modality == "AR":
return call_llm_backend(prompt) # e.g., Llama, Qwen
elif modality == "DIFFUSION":
return call_diffusion_backend(prompt) # e.g., Flux, SDXL
else: # BOTH
text = call_llm_backend(prompt)
image = call_diffusion_backend(prompt)
return combine_response(text, image)
The base model (mmbert-32k-yarn) supports ONNX export for sub-5ms inference on AMD MI300X GPUs.
| Property | Value |
|---|---|
| Base model | llm-semantic-router/mmbert-32k-yarn (307M params) |
| Architecture | ModernBERT + YaRN RoPE scaling |
| Context length | 32,768 tokens |
| Languages | 1800+ (Gemma 2 tokenizer, 256K vocab) |
| Fine-tuning | LoRA (rank=16, alpha=32) merged into base weights |
| Classes | 3 (AR, DIFFUSION, BOTH) |
| Model size | ~1.23 GB (safetensors) |
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch size | 32 |
| Learning rate | 2e-5 |
| Weight decay | 0.15 (adaptive) |
| Loss function | Focal Loss (gamma=2.0) |
| Class weighting | Inverse-frequency (sqrt-dampened) |
| Minority oversampling | Yes |
| LoRA target modules | attn.Wqkv, attn.Wo, mlp.Wi, mlp.Wo |
| Hardware | AMD Instinct MI300X (192GB VRAM) |
| Training time | ~2 minutes |
Trained on a curated combination of 10 public datasets covering diverse prompt styles:
| Metric | Value |
|---|---|
| Accuracy | 0.9686 |
| F1 (weighted) | 0.9686 |
| Eval Loss | 0.0435 |
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| AR | 0.956 | 0.967 | 0.962 |
| DIFFUSION | 0.974 | 0.979 | 0.977 |
| BOTH | 0.983 | 0.951 | 0.967 |
| Prompt | Predicted | Confidence |
|---|---|---|
| "What is the capital of France?" | AR | 0.995 |
| "A serene Japanese garden with cherry blossoms, watercolor style" | DIFFUSION | 0.717 |
| "Explain how neural networks work and generate a diagram" | BOTH | 0.978 |
| "Write me a poem about autumn" | AR | 0.864 |
| "cyberpunk cityscape, 4k, artstation, trending" | DIFFUSION | 0.971 |
| "Create a travel guide for Tokyo with photos of each location" | BOTH | 0.935 |
This model is designed for routing LLM requests in multi-model serving systems:
| Model | Description |
|---|---|
| mmbert32k-modality-router-lora | LoRA adapter version (for further fine-tuning) |
| mmbert-32k-yarn | Base model (307M, 32K context, 1800+ languages) |
| mmbert32k-intent-classifier-merged | Intent classifier (MoM family) |
| mmbert32k-jailbreak-detector-merged | Jailbreak detector (MoM family) |
| mmbert32k-pii-detector-merged | PII detector (MoM family) |
@misc{modality-router-2025,
title={Modality Router: Smart Output Modality Selection for Multi-Model Serving},
author={vLLM Semantic Router Team},
year={2025},
url={https://huggingface.co/llm-semantic-router/mmbert32k-modality-router-merged}
}
Base model
jhu-clsp/mmBERT-base