Study: ~250 malicious documents can implant backdoors in LLMs
Researchers (Anthropic, UK AI Security Institute, Alan Turing Institute) show that inserting roughly 250 poisoned documents into pretraining corpora can implant backdoors in large language models regardless of model size. The paper (arXiv) demonstrates the attack and explains why the absolute number of poisoned samples matters more than their percentage of the dataset.
Key points
- ~250 poisoned docs (~420k tokens) can trigger a backdoor across models from 600M to 13B parameters.
- The attack can be triggered by a specific phrase and causes the model to produce nonsense or attacker-chosen outputs.
- Because much training data is scraped from the web, adversaries could plant such poisoned texts.
Mitigations (suggested)
- Stricter dataset provenance and curation; blocklisting suspicious sources.
- Data sanitization and anomaly detection on training corpora.
- Robust training procedures, e.g., differential privacy, fine-tuning with adversarial defense.
- Continued research into provable defenses and dataset auditing tools.
Read the paper: arXiv:2510.07192
