New - PII Masking 2M European Release is here
AI4Privacy ai4privacy Collection

PII Masking 400K and Below

The foundational dataset series that started open-source PII masking research. Five progressively larger releases covering 6 languages and up to 54 entity types.

Token Classification NER Text Generation Multilingual Synthetic GDPR
400K+
Examples (largest release)
6
Languages
54
Entity Types
5
Releases

5 dataset releases Collection Contents

Schema Data Structure

example.json
{
  "source_text": "My child faozzsd379223 (DOB: May/58) will undergo treatment...",
  "masked_text": "My child [USERNAME_2] (DOB: [DATEOFBIRTH_1]) will undergo treatment...",
  "privacy_mask": [
    {
      "value": "[REDACTED]",
      "start": 12,
      "end": 25,
      "label": "USERNAME",
      "label_index": 2
    }
  ],
  "language": "en",
  "locale": "US",
  "mbert_token_classes": ["O", "O", "B-USERNAME", "O", ...]
}

17 public classes (400K) / 54 extended (200K) Entity Types

Core Labels (17)

USERNAME GIVENNAME SURNAME DATEOFBIRTH EMAIL TELEPHONENUM STREET CITY ZIPCODE CREDITCARDNUMBER IDCARDNUM DRIVERLICENSENUM BUILDINGNUM SOCIALNUM PASSWORD ACCOUNTNUM TAXNUM

Extended Labels (200K+)

IPV4 IPV6 MAC URL USERAGENT BITCOINADDRESS ETHEREUMADDRESS LITECOINADDRESS IBAN VEHICLEVIN VEHICLEVRM PIN JOBTITLE HEIGHT EYECOLOR + 20 more

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR and the EU AI Act.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction tasks.

Ready to scale up?

The 400K series is great for research and prototyping. For production workloads, explore our 1M and 2M European collections.