Company
Founded in Switzerland.
Artificial Intelligence Suisse SA, PO 280, Delemont, Switzerland.
ai4privacy
Collection
The foundational dataset series that started open-source PII masking research. Five progressively larger releases covering 6 languages and up to 54 entity types.
pii-masking-400k
The largest release with 406,896 entries, 20M+ tokens, and 2.3M PII tokens across 17 public entity types.
pii-masking-300k
Extended dataset with 27 PII classes targeting 749 discussion subjects across education, health, and psychology.
pii-masking-200k
Broad coverage release with 54 PII classes across 229 discussion topics. Includes crypto addresses, IPs, and device identifiers.
pii-masking-65k
Early research release with multilingual support and comprehensive entity coverage for privacy masking evaluation.
pii-masking-43k
The original PII masking dataset released alongside a fine-tuned DistilBERT model. 5.6M tokens with human-in-the-loop validation.
{
"source_text": "My child faozzsd379223 (DOB: May/58) will undergo treatment...",
"masked_text": "My child [USERNAME_2] (DOB: [DATEOFBIRTH_1]) will undergo treatment...",
"privacy_mask": [
{
"value": "[REDACTED]",
"start": 12,
"end": 25,
"label": "USERNAME",
"label_index": 2
}
],
"language": "en",
"locale": "US",
"mbert_token_classes": ["O", "O", "B-USERNAME", "O", ...]
}
Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.
Build production-grade anonymization pipelines compliant with GDPR and the EU AI Act.
Fine-tune large language models for privacy-aware text generation and redaction tasks.
The 400K series is great for research and prototyping. For production workloads, explore our 1M and 2M European collections.