ModernFinBERT

EU PII Safeguard

Multilingual PII Detection Model for European Languages

A state-of-the-art multilingual model for detecting Personally Identifiable Information (PII) across 26 European languages (all EU official languages). It is designed for GDPR compliance, privacy-preserving AI applications, and secure handling of sensitive data in multilingual settings. This model enables enterprises, researchers, and data protection teams to identify and safeguard PII with high accuracy (≈98%) across diverse European contexts.

🎯 Model Performance

Global F1 Score: 97.02%
26 Languages Supported
42 PII Entity Types
Consistent 95%+ F1 across all languages

🌍 Supported Languages

🇧🇬 Bulgarian • 🇨🇿 Czech • 🇩🇰 Danish • 🇩🇪 German • 🇬🇷 Greek • 🇬🇧 English • 🇪🇸 Spanish • 🇪🇪 Estonian • 🇫🇮 Finnish • 🇫🇷 French • 🇮🇪 Irish • 🇭🇷 Croatian • 🇭🇺 Hungarian • 🇮🇹 Italian • 🇱🇹 Lithuanian • 🇱🇻 Latvian • 🇲🇹 Maltese • 🇳🇱 Dutch • 🇵🇱 Polish • 🇵🇹 Portuguese • 🇷🇴 Romanian • 🇷🇺 Russian • 🇸🇰 Slovak • 🇸🇮 Slovenian • 🇸🇪 Swedish • 🇺🇦 Ukrainian

🔍 Detected PII Types

Personal: First/Last/Middle Names, Age, Gender, Ethnicity
Contact: Email, Phone, Address, City, Country, Postal Code
Financial: Credit Card, IBAN, Account Numbers, Salary
Identity: National ID, Passport, Driver License, Tax ID
Health: Medical Conditions, Health Insurance ID
Digital: IP Address, MAC Address, URL, Username, Password
And more: 42 total entity types

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "tabularisai/eu-pii-safeguard"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example text (French)
text = "Bonjour, je suis Marie Dubois, email: [email protected]"

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Get predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]

print("Detected PII:")
for token, label in zip(tokens, predicted_labels):
    if label != "O":
        print(f"  {label}: {token}")

📊 Performance by Language

Language	F1 Score	Language	F1 Score
Irish (ga)	97.98%	Dutch (nl)	97.24%
Bulgarian (bg)	97.80%	Slovak (sk)	97.21%
Italian (it)	97.68%	Swedish (sv)	97.09%
Portuguese (pt)	97.61%	Russian (ru)	97.04%
Slovenian (sl)	97.51%	Croatian (hr)	96.93%
Czech (cs)	97.51%	Polish (pl)	96.63%
Hungarian (hu)	97.50%	French (fr)	96.59%
Estonian (et)	97.41%	Romanian (ro)	96.54%
Latvian (lv)	97.40%	Danish (da)	96.36%
English (en)	97.36%	German (de)	96.22%
Spanish (es)	97.34%	Ukrainian (uk)	96.09%
Finnish (fi)	97.30%	Maltese (mt)	95.78%
Lithuanian (lt)	97.24%	Greek (el)	95.42%

💼 Use Cases

🔒 Data Privacy: Automatically detect and anonymize PII before processing
⚖️ GDPR Compliance: Ensure regulatory compliance across EU markets
🛡️ Security: Prevent data breaches by identifying sensitive information
📊 Data Governance: Audit and catalog personal data in multilingual datasets

🏗️ Model Architecture

Base Model: XLM-RoBERTa-large
Task: Token Classification
Labels: 74 (B-/I- format for 42 entity types)
Max Length: 256 tokens

🔄 Community Feedback

We're actively seeking feedback from the community! Please:

🐛 Report issues or edge cases
💡 Suggest improvements
🧪 Share your use cases and results
📊 Contribute evaluation on new datasets

🏢 About Tabularis AI

Developed by Tabularis AI - Building privacy-preserving AI solutions for enterprise data protection.

For questions, collaborations, or licensing inquiries: [email protected]

Downloads last month: 695

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for tabularisai/eu-pii-safeguard

Base model

FacebookAI/xlm-roberta-large

Finetuned

(897)

this model

Evaluation results

F1 Score
self-reported

0.970
Precision
self-reported

0.970
Recall
self-reported

0.970