AI4Privacy ai4privacy Collection

PII Masking 2M European Release

The largest open-source collection of synthetic PII datasets for European languages. 2M+ examples across 32 locales and 98 entity types, purpose-built for training privacy-preserving NLP models.

Token Classification NER Text Generation Multilingual Synthetic GDPR EU AI Act
2M+
Examples
32
European Locales
98
Entity Types
10M+
Annotations

6 component datasets Collection Contents

Schema Data Structure

example.json
{
  "source_text": "Dear John Smith, your appointment at St. Mary's Hospital...",
  "masked_text": "Dear [GIVENNAME_1] [SURNAME_1], your appointment at [HOSPITALNAME_1]...",
  "privacy_mask": [
    {
      "value": "[REDACTED]",
      "start": 5,
      "end": 9,
      "label": "GIVENNAME",
      "label_index": 1
    }
  ],
  "language": "en",
  "region": "GB",
  "mbert_token_classes": ["O", "B-GIVENNAME", "B-SURNAME", "O", ...]
}

19 core + 79 industry-specific Entity Types

Core PII Labels (Open)

DATE GIVENNAME SURNAME EMAIL CITY TITLE TELEPHONENUM AGE STREET BUILDINGNUM ZIPCODE IDCARDNUM CREDITCARDNUMBER DRIVERLICENSENUM GENDER TAXNUM SEX SOCIALNUM PASSPORTNUM

Industry-Specific Labels (Enterprise)

DIAGNOSES MEDICATION TESTRESULTS ALLERGIES HOSPITALNAME ACCOUNTNUM IBAN EMPLOYEEID IPADDRESS USERNAME + 69 more

32 locales Language Coverage

European language coverage map showing 32 locales
Language distribution across the dataset

Annotations Label Distribution

Distribution of PII entity labels across the dataset

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR and the EU AI Act.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction tasks.

Need enterprise-grade data?

Get access to the full 2M dataset including all industry-specific components with commercial licensing for your organization.