Qwen2.5-7B PII Detection — Merged Model
Standalone merged model — no PEFT library required. Load and run directly with
transformers.
For the lightweight adapter version (160MB vs 15GB): vineeth453/qwen25-7b-pii-detection-lora
LoRA adapter weights merged into Qwen/Qwen2.5-7B-Instruct after fine-tuning on ai4privacy/pii-masking-200k. Extracts 56 types of personally identifiable information across 4 languages and returns structured JSON output.
Built as the PII Detection component of a Phase 1 Input Guardrail gateway for an enterprise LLM security system.
Evaluation Results
Evaluated on 10,464 held-out samples (5% split from ai4privacy/pii-masking-200k).
| Metric | Score |
|---|---|
| Micro F1 | 0.967 |
| Macro F1 | 0.961 |
| Micro Precision | 0.967 |
| Micro Recall | 0.968 |
| Malformed JSON outputs | 0 / 500 (0.0%) |
| Val Loss (final) | 0.0033 |
Per-Entity F1 Scores
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| ACCOUNTNAME | 1.000 | 1.000 | 1.000 | 28 |
| ACCOUNTNUMBER | 1.000 | 1.000 | 1.000 | 41 |
| AGE | 1.000 | 1.000 | 1.000 | 27 |
| AMOUNT | 1.000 | 1.000 | 1.000 | 36 |
| BIC | 1.000 | 1.000 | 1.000 | 7 |
| BITCOINADDRESS | 0.923 | 1.000 | 0.960 | 24 |
| BUILDINGNUMBER | 0.968 | 0.968 | 0.968 | 31 |
| CITY | 1.000 | 0.963 | 0.981 | 27 |
| COMPANYNAME | 1.000 | 1.000 | 1.000 | 35 |
| COUNTY | 1.000 | 1.000 | 1.000 | 29 |
| CREDITCARDCVV | 1.000 | 1.000 | 1.000 | 10 |
| CREDITCARDISSUER | 1.000 | 1.000 | 1.000 | 16 |
| CREDITCARDNUMBER | 0.829 | 0.935 | 0.879 | 31 |
| CURRENCY | 0.909 | 0.870 | 0.889 | 23 |
| CURRENCYCODE | 1.000 | 1.000 | 1.000 | 8 |
| CURRENCYNAME | 0.667 | 0.750 | 0.706 | 8 |
| CURRENCYSYMBOL | 1.000 | 1.000 | 1.000 | 20 |
| DATE | 0.884 | 0.974 | 0.927 | 39 |
| DOB | 0.955 | 0.808 | 0.875 | 26 |
| 1.000 | 1.000 | 1.000 | 42 | |
| ETHEREUMADDRESS | 1.000 | 1.000 | 1.000 | 11 |
| EYECOLOR | 1.000 | 1.000 | 1.000 | 10 |
| FIRSTNAME | 0.994 | 0.994 | 0.994 | 158 |
| GENDER | 1.000 | 1.000 | 1.000 | 35 |
| HEIGHT | 1.000 | 1.000 | 1.000 | 7 |
| IBAN | 1.000 | 1.000 | 1.000 | 29 |
| IP | 0.727 | 0.267 | 0.390 | 30 |
| IPV4 | 0.732 | 0.909 | 0.811 | 33 |
| IPV6 | 0.711 | 1.000 | 0.831 | 27 |
| JOBAREA | 1.000 | 1.000 | 1.000 | 40 |
| JOBTITLE | 1.000 | 1.000 | 1.000 | 37 |
| JOBTYPE | 1.000 | 1.000 | 1.000 | 31 |
| LASTNAME | 1.000 | 1.000 | 1.000 | 47 |
| LITECOINADDRESS | 1.000 | 0.714 | 0.833 | 7 |
| MAC | 1.000 | 1.000 | 1.000 | 12 |
| MASKEDNUMBER | 0.923 | 0.800 | 0.857 | 30 |
| MIDDLENAME | 0.944 | 1.000 | 0.971 | 34 |
| NEARBYGPSCOORDINATE | 1.000 | 1.000 | 1.000 | 17 |
| ORDINALDIRECTION | 1.000 | 1.000 | 1.000 | 17 |
| PASSWORD | 1.000 | 1.000 | 1.000 | 31 |
| PHONEIMEI | 1.000 | 1.000 | 1.000 | 19 |
| PHONENUMBER | 1.000 | 1.000 | 1.000 | 21 |
| PIN | 1.000 | 1.000 | 1.000 | 6 |
| PREFIX | 1.000 | 1.000 | 1.000 | 29 |
| SECONDARYADDRESS | 1.000 | 1.000 | 1.000 | 31 |
| SEX | 1.000 | 1.000 | 1.000 | 26 |
| SSN | 1.000 | 1.000 | 1.000 | 16 |
| STATE | 1.000 | 1.000 | 1.000 | 31 |
| STREET | 1.000 | 1.000 | 1.000 | 39 |
| TIME | 1.000 | 1.000 | 1.000 | 20 |
| URL | 1.000 | 1.000 | 1.000 | 29 |
| USERAGENT | 1.000 | 1.000 | 1.000 | 33 |
| USERNAME | 1.000 | 1.000 | 1.000 | 30 |
| VEHICLEVIN | 1.000 | 1.000 | 1.000 | 13 |
| VEHICLEVRM | 1.000 | 1.000 | 1.000 | 15 |
| ZIPCODE | 0.970 | 0.970 | 0.970 | 33 |
Note on IP label (F1=0.390): The dataset contains three overlapping IP labels (
IP,IPV4,IPV6). The low recall onIPis due to the model correctly identifying the address but tagging it asIPV4orIPV6— a label ambiguity in the dataset, not a detection failure.
How to Get Started
Installation
pip install transformers accelerate torch
# No PEFT required for this merged model
Load and Run Inference
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, json
model = AutoModelForCausalLM.from_pretrained(
"vineeth453/qwen25-7b-pii-detection",
device_map="auto",
torch_dtype=torch.bfloat16, # use bfloat16 for A100/H100; float16 for older GPUs
)
tokenizer = AutoTokenizer.from_pretrained("vineeth453/qwen25-7b-pii-detection")
model.eval()
def detect_pii(text: str) -> dict:
prompt = (
"<|im_start|>system\n"
"You are a PII detection system. Extract all personally identifiable information.\n"
'Return ONLY valid JSON: {"entities":[{"text":"...","label":"..."}]}\n'
"<|im_end|>\n"
"<|im_start|>user\n"
f"{text}\n"
"<|im_end|>\n"
"<|im_start|>assistant\n"
)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=200,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(
out[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
).strip().replace("<|im_end|>", "")
return json.loads(response)
# English
print(detect_pii("Contact John Smith at john@example.com or call +1-555-867-5309"))
# {"entities": [{"text": "John", "label": "FIRSTNAME"}, {"text": "Smith", "label": "LASTNAME"},
# {"text": "john@example.com", "label": "EMAIL"}, {"text": "+1-555-867-5309", "label": "PHONENUMBER"}]}
# German
print(detect_pii("Patient Lena Müller, born 14.03.1987, lives at Hauptstraße 22, Berlin."))
# {"entities": [{"text": "Lena", "label": "FIRSTNAME"}, {"text": "Müller", "label": "LASTNAME"},
# {"text": "14.03.1987", "label": "DOB"}, {"text": "Hauptstraße", "label": "STREET"},
# {"text": "22", "label": "BUILDINGNUMBER"}, {"text": "Berlin", "label": "STATE"}]}
# French
print(detect_pii("Merci de contacter Marie Dupont à marie.dupont@societe.fr avant le 30 mars."))
Memory Requirements
| Precision | VRAM Required |
|---|---|
| bfloat16 (default) | ~15GB |
| 4-bit quantized (use adapter repo instead) | ~5GB |
For GPU-constrained environments, use the adapter version with 4-bit quantization instead.
Training Details
Training Data
- Dataset: ai4privacy/pii-masking-200k
- Size: 209,261 samples (198,797 train / 10,464 val, 95/5 split)
- Languages: English (43k), French (62k), German (53k), Italian (51k)
- Entity types: 56 PII categories
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-7B-Instruct |
| Method | QLoRA → merged |
| Quantization during training | 4-bit NF4 + double quantization |
| Compute dtype | bfloat16 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 40,370,176 (0.53% of 7.6B) |
| Epochs | 1 |
| Per-device batch size | 4 |
| Gradient accumulation | 8 (effective batch = 32) |
| Learning rate | 2e-4 |
| LR scheduler | Cosine decay |
| Warmup steps | 186 |
| Weight decay | 0.01 |
| Optimizer | paged_adamw_8bit |
| Max sequence length | 512 |
| Max grad norm | 1.0 |
| Hardware | NVIDIA A100 40GB |
| Training time | 10.7 hours |
| Final train loss | 0.00517 |
| Best val loss | 0.00330 |
Uses
Direct Use
Enterprise input guardrail systems for detecting and redacting PII from user queries before they reach an LLM. Suitable for HR, legal, healthcare, and financial applications.
Downstream Use
- PII redaction pipelines
- Compliance auditing tools (GDPR, CCPA, HIPAA)
- Data anonymization workflows
- Pre-processing layer in RAG or LLM gateway systems
Out-of-Scope Use
- Real-time very high-throughput inference without GPU (15GB model, CPU too slow)
- Languages outside EN/FR/DE/IT without further fine-tuning
- Should not be the sole PII detection mechanism in high-stakes settings without human review
Bias, Risks, and Limitations
- IP label ambiguity: Model occasionally routes IP addresses to
IPV4/IPV6instead ofIP. Post-processing regex validation recommended. - CREDITCARDNUMBER vs PHONEIMEI: 16-digit numeric strings can be confused between these labels (F1=0.879). Luhn algorithm post-processing can mitigate this.
- Low-support labels: Labels with <10 samples (e.g.,
CURRENCYNAME) have less reliable F1 estimates. - Language coverage: EN/FR/DE/IT only. Other languages may degrade.
- Merged model note: LoRA weights are merged into bf16 base weights. This model is ~15GB and does not support 4-bit quantization natively — use the adapter repo for memory-constrained inference.
Environmental Impact
- Hardware: NVIDIA A100 40GB (Google Colab Pro)
- Training time: ~10.7 hours
- Cloud provider: Google Cloud (Colab)
- Compute region: US
Model Card Authors
Vineeth — Masters project, Enterprise Guardrails System
Adapter repo: vineeth453/qwen25-7b-pii-detection-lora
- Downloads last month
- 9