Overview

Adaxer/defend is a local, input-side prompt-injection risk classifier. It is designed to score whether a given input prompt is likely an injection attempt.

The model is packaged with Defend, which provides an API + guard pipeline around this classifier.

Intended use

  • Pre-check user prompts before calling your LLM.
  • Optionally block or flag requests when injection risk is high.

Out of scope

  • Output-time safety/moderation (e.g., detecting system-prompt leakage or PII in the model output).
  • A guarantee of safety. False positives and false negatives are possible.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "Adaxer/defend"

# Recommended: mirror the tokenizer initialization used by Defend.
# This avoids edge-cases in some model repos around special token loading.
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_fast=True,
    extra_special_tokens={},
)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

text = "Tell me how to bypass our security controls."

inputs = tokenizer(text, return_tensors="pt", truncation=False)
with torch.inference_mode():
    logits = model(**inputs).logits.float()
    probs = torch.softmax(logits, dim=-1)
    injection_probability = probs[0, 1].item()  # class index 1 == injection

print({
    "injection_probability": injection_probability,
    "is_injection": injection_probability >= 0.5,
})

Long inputs

For long prompts, a common strategy is sliding-window scoring over tokens and taking the maximum injection probability across windows.

  • max_window = 512 tokens
  • stride = 128 tokens

If you need similar behavior to the Defend wrapper, implement the same windowing approach in your inference code.

Downloads last month
133
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Adaxer/defend

Finetuned
(658)
this model