--- license: apache-2.0 datasets: - beki/privy - gretelai/synthetic_pii_finance_multilingual - eriktks/conll2003 language: - en base_model: - prajjwal1/bert-small pipeline_tag: token-classification --- **A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the datasets described in metatada. ### About the dataset: We combined various datasets in order to cover wide range of document formats like: 1. JSON, 2. HTML, 3. XML, 4. SQL 5. Documents ### Label Set ``` AGE, COORDINATE, CREDIT_CARD, DATE_TIME, EMAIL_ADDRESS, FINANCIAL, IBAN_CODE, IMEI, IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NUMBER, TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN ``` ## How to Use ### Quick start (pipeline) ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline repo = "gravitee-io/bert-small-pii-detection" tok = AutoTokenizer.from_pretrained(repo) model = AutoModelForTokenClassification.from_pretrained(repo) pipe = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple") text = "" pipe(text) ``` ## Evaluation **Metric:** precision / recall / F1 per entity, micro/macro averages | Entity | Precision | Recall | F1-score | Support | |--------------------|-----------|--------|----------|---------| | AGE | 0.9898 | 0.8858 | 0.9349 | 219 | | COORDINATE | 0.9627 | 0.8738 | 0.9161 | 325 | | CREDIT_CARD | 0.9273 | 0.8870 | 0.9067 | 115 | | DATE_TIME | 0.8598 | 0.7364 | 0.7933 | 3255 | | EMAIL_ADDRESS | 0.9428 | 0.8941 | 0.9178 | 387 | | FINANCIAL | 0.9862 | 0.9565 | 0.9711 | 299 | | IBAN_CODE | 0.9577 | 0.9252 | 0.9412 | 147 | | IMEI | 0.9885 | 0.9663 | 0.9773 | 89 | | IP_ADDRESS | 0.9338 | 0.8812 | 0.9068 | 160 | | LOCATION | 0.8849 | 0.8222 | 0.8524 | 4264 | | MAC_ADDRESS | 0.9889 | 1.0000 | 0.9944 | 89 | | NRP | 1.0000 | 0.9818 | 0.9908 | 494 | | ORGANIZATION | 0.7454 | 0.6688 | 0.7051 | 3551 | | PASSWORD | 0.8384 | 0.8137 | 0.8259 | 102 | | PERSON | 0.9123 | 0.8826 | 0.8972 | 4454 | | PHONE_NUMBER | 0.9462 | 0.8199 | 0.8785 | 322 | | TITLE | 0.9887 | 0.9734 | 0.9810 | 451 | | URL | 1.0000 | 0.9787 | 0.9892 | 188 | | US_BANK_NUMBER | 1.0000 | 0.9579 | 0.9785 | 95 | | US_DRIVER_LICENSE | 0.9167 | 0.9167 | 0.9167 | 120 | | US_ITIN | 0.9659 | 0.8763 | 0.9189 | 97 | | US_LICENSE_PLATE | 1.0000 | 0.9000 | 0.9474 | 90 | | US_PASSPORT | 0.9200 | 0.9200 | 0.9200 | 100 | | US_SSN | 0.9744 | 0.9580 | 0.9661 | 119 | | **micro avg** | 0.8804 | 0.8141 | 0.8460 | 19532 | | **macro avg** | 0.9429 | 0.8948 | 0.9178 | 19532 | | **weighted avg** | 0.8785 | 0.8141 | 0.8446 | 19532 | ## Intended Uses & Limitations **Use this model for:** * **Low resource environmens** * Redacting PII in customer support logs, dev/test environments, API traces and articles * Real-time hints in form fields or data entry systems **Limitations:** * English-focused; other languages will degrade * Domain drift is real: audit on your own data --- ## Citation If you use the model, please consider citing the papers: ``` @misc{bhargava2021generalization, title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics}, author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers}, year={2021}, eprint={2110.01518}, archivePrefix={arXiv}, primaryClass={cs.CL} } @article{DBLP:journals/corr/abs-1908-08962, author = {Iulia Turc and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova}, title = {Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation}, journal = {CoRR}, volume = {abs/1908.08962}, year = {2019}, url = {http://arxiv.org/abs/1908.08962}, eprinttype = {arXiv}, eprint = {1908.08962}, timestamp = {Thu, 29 Aug 2019 16:32:34 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } @online{WinNT, author = {Benjamin Kilimnik}, title = {{Privy} Synthetic PII Protocol Trace Dataset}, year = 2022, url = {https://huggingface.co/datasets/beki/privy}, } @online{gretel2023, author = {Gretel.ai}, title = {{Synthetic PII Finance Multilingual Dataset}}, year = 2023, url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual}, } @inproceedings{tjong-kim-sang-de-meulder-2003-introduction, title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition", author = "Tjong Kim Sang, Erik F. and De Meulder, Fien", booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003", year = "2003", url = "https://aclanthology.org/W03-0419", } } ```