mmBERT-32K Jailbreak Detector (LoRA)

LoRA adapter for jailbreak/prompt injection detection based on mmBERT-32K-YaRN.

Model Details

  • Base Model: llm-semantic-router/mmbert-32k-yarn
  • LoRA Rank: 48
  • LoRA Alpha: 96
  • Training: 8 epochs with heavy short-pattern augmentation

Performance

  • Validation Accuracy: 98.16%
  • F1 Score: 98.15%
  • Precision: 98.36%
  • Recall: 97.95%

Key Improvements

This model includes heavy oversampling of short jailbreak patterns to improve generalization:

  • Detects short patterns like "DAN", "jailbreak", "Developer mode" with 100% confidence
  • Properly handles both short and long jailbreak attempts

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

base_model = "llm-semantic-router/mmbert-32k-yarn"
lora_path = "llm-semantic-router/mmbert32k-jailbreak-detector-lora"

tokenizer = AutoTokenizer.from_pretrained(lora_path)
base = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)
model = PeftModel.from_pretrained(base, lora_path)
Downloads last month
86
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/mmbert32k-jailbreak-detector-lora

Adapter
(5)
this model

Datasets used to train llm-semantic-router/mmbert32k-jailbreak-detector-lora

Collection including llm-semantic-router/mmbert32k-jailbreak-detector-lora