OpenMed Privacy Filter (Nemotron) β MLX 8-bit
A native MLX port of
OpenMed/privacy-filter-nemotron,
affine-quantized to 8-bit for fast on-device PII detection on Apple
Silicon. For the unquantized BF16 reference, see
OpenMed/privacy-filter-nemotron-mlx.
Family at a glance. Same architecture and training data, three runtimes:
- PyTorch β
OpenMed/privacy-filter-nemotronβ CPU + CUDA.- MLX BF16 β
OpenMed/privacy-filter-nemotron-mlxβ Apple Silicon, full precision (~2.6 GB).- MLX 8-bit (this repo) β Apple Silicon, ~1.4 GB, ~1.7Γ faster than BF16.
Why 8-bit?
| BF16 sibling | This repo (Q8) | |
|---|---|---|
weights.safetensors size |
2.6 GB | 1.4 GB (-47%) |
| Forward pass (10-token PII sample) | ~14 ms | |
| Argmax agreement vs. BF16 | (reference) | 100% on every test sample |
| Entity-group preservation | (reference) | identical on every test sample |
Numbers above are from scripts/export/verify_privacy_filter_nemotron_mlx.py
over 10 golden PII samples (email, phone, ssn, credit card, name, ipv4,
address, date_of_birth, url, mixed). Q8 with group_size=64 was validated
against BF16; argmax matched on 100% of tokens, all entity-group sets
matched exactly.
What it does
The model is a token classifier built on OpenAI's open Privacy Filter
architecture (the same openai_privacy_filter model type used by
openai/privacy-filter).
It tags each token with a BIOES label across 55 PII span classes, then
a Viterbi pass over the BIOES grammar yields clean entity spans. Detected
categories include:
- Personal identifiers β
first_name,last_name,user_name,gender,age,date_of_birth - Contact β
email,phone_number,fax_number,street_address,city,state,country,county,postcode,coordinate - Government / legal IDs β
ssn,national_id,tax_id,certificate_license_number - Financial β
account_number,bank_routing_number,credit_debit_card,cvv,pin,swift_bic - Medical β
medical_record_number,health_plan_beneficiary_number,blood_type - Workplace β
company_name,occupation,employee_id,customer_id,employment_status,education_level - Online β
url,ipv4,ipv6,mac_address,http_cookie,api_key,password,device_identifier - Demographic β
race_ethnicity,religious_belief,political_view,sexuality,language - Vehicles β
license_plate,vehicle_identifier - Time β
date,date_time,time - Misc β
biometric_identifier,unique_id
Full label schema (221 labels)
The output space is O plus B-, I-, E-, S- for each of the 55
span classes (4 Γ 55 + 1 = 221). The runtime PrivacyFilterMLXPipeline
runs Viterbi over this BIOES grammar, so the consumer sees clean grouped
entities rather than raw token tags.
The full id2label.json is shipped alongside the weights in this repo.
For per-label accuracy, training recipe, and dataset details, see the base PyTorch checkpoint.
Architecture
| Field | Value |
|---|---|
| Source model type | openai_privacy_filter |
| Source architecture | OpenAIPrivacyFilterForTokenClassification |
| Hidden size | 640 |
| Transformer layers | 8 |
| Attention | Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks |
| FFN | Sparse Mixture-of-Experts β 128 experts, top-4 routing, SwiGLU |
| Position encoding | YARN-scaled RoPE (rope_theta=150_000, factor=32) |
| Context length | 131,072 tokens (initial 4,096) |
| Tokenizer | o200k_base (tiktoken) β vocab 200,064 |
| Output head | Linear(640 β 221) with bias |
Quantization
| Field | Value |
|---|---|
| Bits | 8 |
| Group size | 64 |
| Mode | affine (MLX mx.quantize, weight-only) |
| Quantized modules | embedding, attention qkv & out, MoE gate, expert swiglu & out, unembedding |
| Non-quantized modules | RMSNorms, attention sinks (kept in BF16) |
Expert tensors are stored in MLX's packed transposed layout and run through
mx.gather_qmm at inference time. RMSNorm scales and attention sinks
remain BF16 because their parameter count is negligible relative to the
rest of the model.
File set
| File | Size | Purpose |
|---|---|---|
weights.safetensors |
1.4 GB | Q8 packed weights + scales/biases (uint32 packed for quantized modules, BF16 for norms/sinks) |
config.json |
20 KB | Model + MLX runtime config (with _mlx_quantization block) |
id2label.json |
5.4 KB | Numeric ID β BIOES label string |
openmed-mlx.json |
0.8 KB | OpenMed MLX manifest with quantization: {bits: 8, group_size: 64, mode: affine} |
tokenizer.json, tokenizer_config.json |
27 MB | Source tokenizer files (kept for reference) |
The MLX runtime uses tiktoken o200k_base directly for tokenization;
the tokenizer.json is kept so consumers can inspect or re-tokenize via
transformers if desired.
Quick start
With OpenMed β recommended
OpenMed gives you a single extract_pii() / deidentify() API that
auto-selects MLX on Apple Silicon and PyTorch elsewhere β same code on
every host.
pip install -U "openmed[mlx]"
from openmed import extract_pii, deidentify
text = (
"Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
"phone 415-555-0123, email sarah.johnson@example.com."
)
# Extract grouped entity spans (runs on MLX 8-bit here, PyTorch fallback elsewhere)
result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
for ent in result.entities:
print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")
# De-identify
masked = deidentify(text, method="mask",
model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
fake = deidentify(
text,
method="replace",
model_name="OpenMed/privacy-filter-nemotron-mlx-8bit",
consistent=True,
seed=42, # deterministic locale-aware Faker surrogates
)
When MLX isn't available (Linux, Windows, Intel Mac, missing mlx package),
this exact same call automatically falls back to the PyTorch checkpoint
OpenMed/privacy-filter-nemotron
with a one-time warning. Family-aware fallback: a Nemotron MLX request never
substitutes the unrelated openai/privacy-filter baseline.
Direct MLX usage (lower-level)
from huggingface_hub import snapshot_download
from openmed.mlx.inference import PrivacyFilterMLXPipeline
model_path = snapshot_download("OpenMed/privacy-filter-nemotron-mlx-8bit")
pipe = PrivacyFilterMLXPipeline(model_path)
print(pipe("Email me at alice.smith@example.com after 5pm."))
# [{'entity_group': 'email',
# 'score': 0.92,
# 'word': 'alice.smith@example.com',
# 'start': 12,
# 'end': 35}]
The pipeline returns a list of dicts with entity_group, score, word,
start, and end (character offsets into the input string).
Loading from a local snapshot
from openmed.mlx.models import load_model
import mlx.core as mx
model = load_model("/path/to/privacy-filter-nemotron-mlx-8bit")
ids = mx.array([[1, 100, 200, 300]], dtype=mx.int32)
mask = mx.ones((1, 4), dtype=mx.bool_)
logits = model(ids, attention_mask=mask) # shape (1, 4, 221)
Hardware notes
- Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower.
- Tested on macOS with
mlx>=0.18. - Q8 inference is ~1.7Γ faster than the BF16 sibling on the same hardware while preserving 100% argmax agreement on the test set.
Credits & Acknowledgements
This model wouldn't exist without two open-source releases β sincere thanks to both teams:
- OpenAI for open-sourcing the Privacy Filter
(architecture, modeling code, and
opftraining/eval CLI). The 8-bit MLX port in this repo runs that same architecture under Apple's MLX framework with affine weight-only quantization. - NVIDIA for releasing the Nemotron-PII dataset used to fine-tune the source PyTorch checkpoint.
Additional thanks to Apple for MLX and the HuggingFace team for the model-distribution ecosystem.
License
Apache 2.0 (matches the source checkpoint).
- Downloads last month
- 257
Quantized