New - PII Masking 2M European Release is here
AI4Privacy ai4privacy Collection

OpenPII 1M

A large-scale, multilingual collection of 1,428,143 synthetic text examples with fine-grained PII annotations, spanning 23 European languages and 19 entity types. Open under CC-BY-4.0.

Token Classification NER Text Generation Multilingual Synthetic CC-BY-4.0
1.43M+
Examples
23
Languages
19
Entity Types
10M+
Annotations

Schema Data Structure

example.json
{
  "source_text": "John Smith lives at 42 Rue de Rivoli, 75001 Paris.",
  "masked_text": "[GIVENNAME_1] [SURNAME_1] lives at [BUILDINGNUM_1] [STREET_1], [ZIPCODE_1] [CITY_1].",
  "privacy_mask": [
    {
      "value": "[REDACTED]",
      "start": 0,
      "end": 4,
      "label": "GIVENNAME",
      "label_index": 1
    }
  ],
  "language": "fr",
  "region": "FR",
  "mbert_token_classes": ["O", "B-GIVENNAME", "B-SURNAME", "O", ...]
}

19 core labels Entity Types

DATE GIVENNAME SURNAME EMAIL CITY TITLE TELEPHONENUM AGE STREET BUILDINGNUM ZIPCODE IDCARDNUM CREDITCARDNUMBER DRIVERLICENSENUM GENDER TAXNUM SEX SOCIALNUM PASSPORTNUM

23 languages Language Coverage

European language coverage map showing 23 languages
Language distribution across the dataset

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR and the EU AI Act.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction tasks.

Get started with OpenPII 1M

Download the full dataset with 1.4M+ examples and 10M+ annotations, freely available under CC-BY-4.0.