Qwen2.5-7B PII Detection — Merged Model

Standalone merged model — no PEFT library required. Load and run directly with transformers.
For the lightweight adapter version (160MB vs 15GB): vineeth453/qwen25-7b-pii-detection-lora

LoRA adapter weights merged into Qwen/Qwen2.5-7B-Instruct after fine-tuning on ai4privacy/pii-masking-200k. Extracts 56 types of personally identifiable information across 4 languages and returns structured JSON output.

Built as the PII Detection component of a Phase 1 Input Guardrail gateway for an enterprise LLM security system.

Evaluation Results

Evaluated on 10,464 held-out samples (5% split from ai4privacy/pii-masking-200k).

Metric	Score
Micro F1	0.967
Macro F1	0.961
Micro Precision	0.967
Micro Recall	0.968
Malformed JSON outputs	0 / 500 (0.0%)
Val Loss (final)	0.0033

Per-Entity F1 Scores

Label	Precision	Recall	F1	Support
ACCOUNTNAME	1.000	1.000	1.000	28
ACCOUNTNUMBER	1.000	1.000	1.000	41
AGE	1.000	1.000	1.000	27
AMOUNT	1.000	1.000	1.000	36
BIC	1.000	1.000	1.000	7
BITCOINADDRESS	0.923	1.000	0.960	24
BUILDINGNUMBER	0.968	0.968	0.968	31
CITY	1.000	0.963	0.981	27
COMPANYNAME	1.000	1.000	1.000	35
COUNTY	1.000	1.000	1.000	29
CREDITCARDCVV	1.000	1.000	1.000	10
CREDITCARDISSUER	1.000	1.000	1.000	16
CREDITCARDNUMBER	0.829	0.935	0.879	31
CURRENCY	0.909	0.870	0.889	23
CURRENCYCODE	1.000	1.000	1.000	8
CURRENCYNAME	0.667	0.750	0.706	8
CURRENCYSYMBOL	1.000	1.000	1.000	20
DATE	0.884	0.974	0.927	39
DOB	0.955	0.808	0.875	26
EMAIL	1.000	1.000	1.000	42
ETHEREUMADDRESS	1.000	1.000	1.000	11
EYECOLOR	1.000	1.000	1.000	10
FIRSTNAME	0.994	0.994	0.994	158
GENDER	1.000	1.000	1.000	35
HEIGHT	1.000	1.000	1.000	7
IBAN	1.000	1.000	1.000	29
IP	0.727	0.267	0.390	30
IPV4	0.732	0.909	0.811	33
IPV6	0.711	1.000	0.831	27
JOBAREA	1.000	1.000	1.000	40
JOBTITLE	1.000	1.000	1.000	37
JOBTYPE	1.000	1.000	1.000	31
LASTNAME	1.000	1.000	1.000	47
LITECOINADDRESS	1.000	0.714	0.833	7
MAC	1.000	1.000	1.000	12
MASKEDNUMBER	0.923	0.800	0.857	30
MIDDLENAME	0.944	1.000	0.971	34
NEARBYGPSCOORDINATE	1.000	1.000	1.000	17
ORDINALDIRECTION	1.000	1.000	1.000	17
PASSWORD	1.000	1.000	1.000	31
PHONEIMEI	1.000	1.000	1.000	19
PHONENUMBER	1.000	1.000	1.000	21
PIN	1.000	1.000	1.000	6
PREFIX	1.000	1.000	1.000	29
SECONDARYADDRESS	1.000	1.000	1.000	31
SEX	1.000	1.000	1.000	26
SSN	1.000	1.000	1.000	16
STATE	1.000	1.000	1.000	31
STREET	1.000	1.000	1.000	39
TIME	1.000	1.000	1.000	20
URL	1.000	1.000	1.000	29
USERAGENT	1.000	1.000	1.000	33
USERNAME	1.000	1.000	1.000	30
VEHICLEVIN	1.000	1.000	1.000	13
VEHICLEVRM	1.000	1.000	1.000	15
ZIPCODE	0.970	0.970	0.970	33

Note on IP label (F1=0.390): The dataset contains three overlapping IP labels (IP, IPV4, IPV6). The low recall on IP is due to the model correctly identifying the address but tagging it as IPV4 or IPV6 — a label ambiguity in the dataset, not a detection failure.

How to Get Started

Installation

pip install transformers accelerate torch
# No PEFT required for this merged model

Load and Run Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, json

model = AutoModelForCausalLM.from_pretrained(
    "vineeth453/qwen25-7b-pii-detection",
    device_map="auto",
    torch_dtype=torch.bfloat16,   # use bfloat16 for A100/H100; float16 for older GPUs
)
tokenizer = AutoTokenizer.from_pretrained("vineeth453/qwen25-7b-pii-detection")
model.eval()

def detect_pii(text: str) -> dict:
    prompt = (
        "<|im_start|>system\n"
        "You are a PII detection system. Extract all personally identifiable information.\n"
        'Return ONLY valid JSON: {"entities":[{"text":"...","label":"..."}]}\n'
        "<|im_end|>\n"
        "<|im_start|>user\n"
        f"{text}\n"
        "<|im_end|>\n"
        "<|im_start|>assistant\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=200,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
    response = tokenizer.decode(
        out[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    ).strip().replace("<|im_end|>", "")
    return json.loads(response)

# English
print(detect_pii("Contact John Smith at john@example.com or call +1-555-867-5309"))
# {"entities": [{"text": "John", "label": "FIRSTNAME"}, {"text": "Smith", "label": "LASTNAME"},
#               {"text": "john@example.com", "label": "EMAIL"}, {"text": "+1-555-867-5309", "label": "PHONENUMBER"}]}

# German
print(detect_pii("Patient Lena Müller, born 14.03.1987, lives at Hauptstraße 22, Berlin."))
# {"entities": [{"text": "Lena", "label": "FIRSTNAME"}, {"text": "Müller", "label": "LASTNAME"},
#               {"text": "14.03.1987", "label": "DOB"}, {"text": "Hauptstraße", "label": "STREET"},
#               {"text": "22", "label": "BUILDINGNUMBER"}, {"text": "Berlin", "label": "STATE"}]}

# French
print(detect_pii("Merci de contacter Marie Dupont à marie.dupont@societe.fr avant le 30 mars."))

Memory Requirements

Precision	VRAM Required
bfloat16 (default)	~15GB
4-bit quantized (use adapter repo instead)	~5GB

For GPU-constrained environments, use the adapter version with 4-bit quantization instead.

Training Details

Training Data

Dataset: ai4privacy/pii-masking-200k
Size: 209,261 samples (198,797 train / 10,464 val, 95/5 split)
Languages: English (43k), French (62k), German (53k), Italian (51k)
Entity types: 56 PII categories

Training Hyperparameters

Parameter	Value
Base model	Qwen/Qwen2.5-7B-Instruct
Method	QLoRA → merged
Quantization during training	4-bit NF4 + double quantization
Compute dtype	bfloat16
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
LoRA target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	40,370,176 (0.53% of 7.6B)
Epochs	1
Per-device batch size	4
Gradient accumulation	8 (effective batch = 32)
Learning rate	2e-4
LR scheduler	Cosine decay
Warmup steps	186
Weight decay	0.01
Optimizer	paged_adamw_8bit
Max sequence length	512
Max grad norm	1.0
Hardware	NVIDIA A100 40GB
Training time	10.7 hours
Final train loss	0.00517
Best val loss	0.00330

Uses

Direct Use

Enterprise input guardrail systems for detecting and redacting PII from user queries before they reach an LLM. Suitable for HR, legal, healthcare, and financial applications.

Downstream Use

PII redaction pipelines
Compliance auditing tools (GDPR, CCPA, HIPAA)
Data anonymization workflows
Pre-processing layer in RAG or LLM gateway systems

Out-of-Scope Use

Real-time very high-throughput inference without GPU (15GB model, CPU too slow)
Languages outside EN/FR/DE/IT without further fine-tuning
Should not be the sole PII detection mechanism in high-stakes settings without human review

Bias, Risks, and Limitations

IP label ambiguity: Model occasionally routes IP addresses to IPV4/IPV6 instead of IP. Post-processing regex validation recommended.
CREDITCARDNUMBER vs PHONEIMEI: 16-digit numeric strings can be confused between these labels (F1=0.879). Luhn algorithm post-processing can mitigate this.
Low-support labels: Labels with <10 samples (e.g., CURRENCYNAME) have less reliable F1 estimates.
Language coverage: EN/FR/DE/IT only. Other languages may degrade.
Merged model note: LoRA weights are merged into bf16 base weights. This model is ~15GB and does not support 4-bit quantization natively — use the adapter repo for memory-constrained inference.

Environmental Impact

Hardware: NVIDIA A100 40GB (Google Colab Pro)
Training time: ~10.7 hours
Cloud provider: Google Cloud (Colab)
Compute region: US

Model Card Authors

Vineeth — Masters project, Enterprise Guardrails System
Adapter repo: vineeth453/qwen25-7b-pii-detection-lora

Downloads last month: 9

Safetensors

Model size

8B params

Tensor type

F32

Model tree for vineeth453/qwen25-7b-pii-detection

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Quantized

(313)

this model

vineeth453
/

qwen25-7b-pii-detection