ai4privacy Collection

OpenPII 1M

A large-scale, multilingual collection of 1,428,143 synthetic text examples with fine-grained PII annotations, spanning 23 European languages and 19 entity types. Open under CC-BY-4.0.

View on Hugging Face License Enterprise Data

Token Classification NER Text Generation Multilingual Synthetic CC-BY-4.0

1.43M+

Examples

23

Languages

19

Entity Types

10M+

Annotations

Schema Data Structure

example.json

{
  "source_text": "John Smith lives at 42 Rue de Rivoli, 75001 Paris.",
  "masked_text": "[GIVENNAME_1] [SURNAME_1] lives at [BUILDINGNUM_1] [STREET_1], [ZIPCODE_1] [CITY_1].",
  "privacy_mask": [
    {
      "value": "[REDACTED]",
      "start": 0,
      "end": 4,
      "label": "GIVENNAME",
      "label_index": 1
    }
  ],
  "language": "fr",
  "region": "FR",
  "mbert_token_classes": ["O", "B-GIVENNAME", "B-SURNAME", "O", ...]
}

19 core labels Entity Types

DATE GIVENNAME SURNAME EMAIL CITY TITLE TELEPHONENUM AGE STREET BUILDINGNUM ZIPCODE IDCARDNUM CREDITCARDNUMBER DRIVERLICENSENUM GENDER TAXNUM SEX SOCIALNUM PASSPORTNUM

23 languages Language Coverage

European language coverage map showing 23 languages

Language distribution across the dataset

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR and the EU AI Act.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction tasks.

Get started with OpenPII 1M

Download the full dataset with 1.4M+ examples and 10M+ annotations, freely available under CC-BY-4.0.

Download on Hugging Face Browse Collection

Company

Follow us