ai4privacy Collection

PII Masking 400K and Below

The foundational dataset series that started open-source PII masking research. Five progressively larger releases covering 6 languages and up to 54 entity types.

View on Hugging Face License Enterprise Data

Token Classification NER Text Generation Multilingual Synthetic GDPR

400K+

Examples (largest release)

6

Languages

54

Entity Types

5

Releases

5 dataset releases Collection Contents

PII Masking 400K

pii-masking-400k

Latest

The largest release with 406,896 entries, 20M+ tokens, and 2.3M PII tokens across 17 public entity types.

406K rows 6 languages 17 entities Custom License

PII Masking 300K

pii-masking-300k

Open

Extended dataset with 27 PII classes targeting 749 discussion subjects across education, health, and psychology.

225K rows 6 languages 27 entities Custom License

PII Masking 200K

pii-masking-200k

Open

Broad coverage release with 54 PII classes across 229 discussion topics. Includes crypto addresses, IPs, and device identifiers.

209K rows 4 languages 54 entities Custom License

PII Masking 65K

pii-masking-65k

Open

Early research release with multilingual support and comprehensive entity coverage for privacy masking evaluation.

21.6K rows 4 languages 40+ entities Custom License

PII Masking 43K

pii-masking-43k

Original

The original PII masking dataset released alongside a fine-tuned DistilBERT model. 5.6M tokens with human-in-the-loop validation.

43K rows English 54 entities Custom License

Schema Data Structure

example.json

{
  "source_text": "My child faozzsd379223 (DOB: May/58) will undergo treatment...",
  "masked_text": "My child [USERNAME_2] (DOB: [DATEOFBIRTH_1]) will undergo treatment...",
  "privacy_mask": [
    {
      "value": "[REDACTED]",
      "start": 12,
      "end": 25,
      "label": "USERNAME",
      "label_index": 2
    }
  ],
  "language": "en",
  "locale": "US",
  "mbert_token_classes": ["O", "O", "B-USERNAME", "O", ...]
}

17 public classes (400K) / 54 extended (200K) Entity Types

Core Labels (17)

USERNAME GIVENNAME SURNAME DATEOFBIRTH EMAIL TELEPHONENUM STREET CITY ZIPCODE CREDITCARDNUMBER IDCARDNUM DRIVERLICENSENUM BUILDINGNUM SOCIALNUM PASSWORD ACCOUNTNUM TAXNUM

Extended Labels (200K+)

IPV4 IPV6 MAC URL USERAGENT BITCOINADDRESS ETHEREUMADDRESS LITECOINADDRESS IBAN VEHICLEVIN VEHICLEVRM PIN JOBTITLE HEIGHT EYECOLOR + 20 more

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR and the EU AI Act.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction tasks.

Ready to scale up?

The 400K series is great for research and prototyping. For production workloads, explore our 1M and 2M European collections.

Explore 2M European Browse on Hugging Face

Company

Follow us

PII Masking 400K and Below

5 dataset releases Collection Contents

PII Masking 400K

PII Masking 300K

PII Masking 200K

PII Masking 65K

PII Masking 43K

Schema Data Structure

17 public classes (400K) / 54 extended (200K) Entity Types

Core Labels (17)

Extended Labels (200K+)

Tasks supported Use Cases

Named Entity Recognition

Data Anonymization

LLM Fine-tuning

Ready to scale up?