ai4privacy Collection

PII Masking 2M European Release

The largest open-source collection of synthetic PII datasets for European languages. 2M+ examples across 32 locales and 98 entity types, purpose-built for training privacy-preserving NLP models.

View on Hugging Face License Enterprise Data

Token Classification NER Text Generation Multilingual Synthetic GDPR EU AI Act

2M+

Examples

32

European Locales

98

Entity Types

10M+

Annotations

6 component datasets Collection Contents

OpenPII 1M

pii-masking-openpii-1m

Open

The core open-source component with 1.4M examples across 23 European languages and 19 PII entity types.

1.43M rows 23 languages 19 entities CC-BY-4.0

Health / PHI

pii-masking-health-phi-200k

Enterprise

Personal Health Information with 24 medical-specific labels including diagnoses, medications, test results, and allergies.

200K rows 24 entities Commercial License

Financial / PFI

pii-masking-financial-pfi-200k

Enterprise

Personal Financial Information covering finance and insurance-specific PII entities for banking and fintech applications.

200K rows Commercial License

Digital / PDI

pii-masking-digital-pdi-200k

Enterprise

Personal Digital Information for tech platforms, covering digital identifiers, usernames, IPs, and online activity data.

200K rows Commercial License

Work / PWI

pii-masking-work-pwi-200k

Enterprise

Personal Work Information for HR and employment, including employee IDs, salary data, performance reviews, and contracts.

200K rows Commercial License

Location / PLI

pii-masking-location-pli-200k

Enterprise

Personal Location Information with fine-grained geographic and address entities across all 32 European locales.

200K rows Commercial License

Schema Data Structure

example.json

{
  "source_text": "Dear John Smith, your appointment at St. Mary's Hospital...",
  "masked_text": "Dear [GIVENNAME_1] [SURNAME_1], your appointment at [HOSPITALNAME_1]...",
  "privacy_mask": [
    {
      "value": "[REDACTED]",
      "start": 5,
      "end": 9,
      "label": "GIVENNAME",
      "label_index": 1
    }
  ],
  "language": "en",
  "region": "GB",
  "mbert_token_classes": ["O", "B-GIVENNAME", "B-SURNAME", "O", ...]
}

19 core + 79 industry-specific Entity Types

Core PII Labels (Open)

DATE GIVENNAME SURNAME EMAIL CITY TITLE TELEPHONENUM AGE STREET BUILDINGNUM ZIPCODE IDCARDNUM CREDITCARDNUMBER DRIVERLICENSENUM GENDER TAXNUM SEX SOCIALNUM PASSPORTNUM

Industry-Specific Labels (Enterprise)

DIAGNOSES MEDICATION TESTRESULTS ALLERGIES HOSPITALNAME ACCOUNTNUM IBAN EMPLOYEEID IPADDRESS USERNAME + 69 more

32 locales Language Coverage

European language coverage map showing 32 locales

Language distribution across the dataset

Annotations Label Distribution

Distribution of PII entity labels across the dataset

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR and the EU AI Act.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction tasks.

Need enterprise-grade data?

Get access to the full 2M dataset including all industry-specific components with commercial licensing for your organization.

Schedule a call Browse on Hugging Face

Company

Follow us

PII Masking 2M European Release

6 component datasets Collection Contents

OpenPII 1M

Health / PHI

Financial / PFI

Digital / PDI

Work / PWI

Location / PLI

Schema Data Structure

19 core + 79 industry-specific Entity Types

Core PII Labels (Open)

Industry-Specific Labels (Enterprise)

32 locales Language Coverage

Annotations Label Distribution

Tasks supported Use Cases

Named Entity Recognition

Data Anonymization

LLM Fine-tuning

Need enterprise-grade data?