gliner-opf-ptbr-pii-v1

Fine-tune of openai/privacy-filter on Brazilian-Portuguese PII, 9-round chunked schedule (3 epochs Γ— 3 saves per epoch). Trained on 914,452 rows of natural-text upstream data (arthrod/oai-pf-ptbr-chunked-v2), evaluated on the same 5,000 PT-BR val set used for the GLiNER models.

Best checkpoint: e3_c3 (final pass) β€” detection.span typed F1 0.885 (P 0.894 / R 0.876).

Headline performance

opf eval --eval-mode typed on the 5,000-row natural val:

  • detection.span typed F1 = 0.885 (P=0.894 R=0.876)

Apples-to-apples vs the GLiNER series (same val, same 24 PT-BR labels, nervaluate)

Model partial P partial R partial F1 exact F1
gliner-opf-ptbr-pii-v1 (this) 0.917 0.879 0.897 0.853
mmBERT-small Γ— 3 (41400) 0.951 0.832 0.888 0.870
ettin-68m easter-egg 0.893 0.761 0.822 0.789
ettin-32m easter-egg 0.905 0.729 0.808 0.769

This model wins partial F1 by +0.01; mmBERT wins exact F1 by +0.017. Roughly tied overall, with different strengths:

opf wins on free-text sensitive descriptors (medical +0.16, organizational +0.34, political +0.24, sexual +0.14, religious +0.04, ethnicity +0.03) mmBERT wins on structured PII + names (first/middle/last names by 0.04–0.16, locations by 0.05–0.14, full address) Both perfect on cpf/rg/pis/credit_card/phone/email/zip (F1 β‰₯ 0.99)

Learning curve (9-round chunked schedule)

ckpt P R F1
baseline (untyped) 0.633 0.466 0.537
e1_c1 0.746 0.749 0.748
e1_c2 0.831 0.808 0.819
e1_c3 (epoch 1) 0.852 0.836 0.844
e2_c1 0.876 0.835 0.855
e2_c2 0.889 0.844 0.866
e2_c3 (epoch 2) 0.895 0.851 0.872
e3_c1 0.896 0.860 0.878
e3_c2 0.880 0.880 0.880
e3_c3 (final, released) 0.894 0.876 0.885

Per-entity F1 (e3_c3, span-typed, top entities)

label P R F1
cpf_document_number 0.998 1.000 0.999
rg_document_number 0.999 0.999 0.999
phone_number 0.997 0.999 0.998
pis_document_number 0.999 0.996 0.997
dob 0.995 0.999 0.997
email_address 0.996 0.996 0.996
location_zip 0.990 1.000 0.995
credit_card 1.000 0.990 0.995
location_building_number 0.967 0.972 0.969
last_name 0.949 0.965 0.957
location_street 0.926 0.954 0.940
location_state_abbreviation 0.921 0.861 0.890
first_name 0.881 0.874 0.878
personal_description_of_ethnicity 0.847 0.842 0.844
personal_description_of_religious_convictions 0.846 0.805 0.825
personal_description_of_sexual_information 0.853 0.788 0.819
personal_description_of_political_opinion 0.821 0.815 0.818
personal_description_of_organizational_affiliation 0.800 0.794 0.797
personal_description_of_medical_conditions 0.816 0.755 0.784
location_state 0.775 0.763 0.769
location_neighborhood 0.792 0.636 0.705
location_city 0.666 0.591 0.626
middle_name 0.600 0.530 0.563

(Entries with zero gold in val are omitted.)

Training recipe

  • Backbone: openai/privacy-filter (8-layer MoE transformer, 128 experts, ~2.7B-equivalent params via top-4 routing)
  • Schedule: 3 epochs Γ— 3 saves per epoch (9 sequential opf train --epochs 1 invocations, each on a deterministic 1/3 chunk, resuming from the previous checkpoint)
  • Optimizer: AdamW, LR 1e-5, weight decay 0.01, max grad norm 1.0
  • Batch: 32 windows Γ— 4 grad-accum = effective 128
  • Context: n-ctx 256
  • Precision: bf16 weights, fp32 accumulators
  • Loss: standard CE on BIESO token labels (1 + 72 entities Γ— 4 = 289 token labels)
  • Decoding: constrained Viterbi
  • Hardware: AMD MI300X single-GPU partition, ROCm 7.2

Dataset

  • Train: arthrod/oai-pf-ptbr-chunked-v2 (private) β€” 914,452 rows, 100% upstream raw text
    • 99.8% from ai4privacy/open-pii-masking-500k-ai4privacy
    • 100% from ai4privacy/pii-masking-400k
    • 84.6% from arthrod/gliner2-pii-ptbr-reward-split
    • 93.4% from nvidia/Nemotron-PII
    • 4 small spam/phishing sources at 100% (negative evidence)
    • 3 sources dropped entirely (schema mismatches, ~18.6k rows)
  • Val: same 5,000 PT-BR rows used for the GLiNER models β€” direct head-to-head comparison

Usage

import opf
# CLI:
# opf redact --checkpoint <download_dir> "text com cpf 123.456.789-09 e telefone (11) 91234-5678"

Related

  • GLiNER series (same val): mmBERT-small Γ— 3 (partial F1 0.823), ettin-68m-easter-egg (0.682), ettin-32m-easter-egg (0.603)
  • Demo: arthrod/gliner-ptbr-pii-demo
  • Note: easter-egg label berco-de-tiradentes is NOT supported here β€” use the mmBERT-small model for that.
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for arthrod/gliner-opf-ptbr-pii-v1

Finetuned
(26)
this model

Space using arthrod/gliner-opf-ptbr-pii-v1 1