Study: ~250 malicious docs can backdoor LLMs — research summary

Study: ~250 malicious documents can implant backdoors in LLMs

Researchers (Anthropic, UK AI Security Institute, Alan Turing Institute) show that inserting roughly 250 poisoned documents into pretraining corpora can implant backdoors in large language models regardless of model size. The paper (arXiv) demonstrates the attack and explains why the absolute number of poisoned samples matters more than their percentage of the dataset.

Key points

  • ~250 poisoned docs (~420k tokens) can trigger a backdoor across models from 600M to 13B parameters.
  • The attack can be triggered by a specific phrase and causes the model to produce nonsense or attacker-chosen outputs.
  • Because much training data is scraped from the web, adversaries could plant such poisoned texts.

Mitigations (suggested)

  • Stricter dataset provenance and curation; blocklisting suspicious sources.
  • Data sanitization and anomaly detection on training corpora.
  • Robust training procedures, e.g., differential privacy, fine-tuning with adversarial defense.
  • Continued research into provable defenses and dataset auditing tools.

Read the paper: arXiv:2510.07192

Leave a Reply

Your email address will not be published. Required fields are marked *

Diese Seite verwendet Cookies, um die Nutzerfreundlichkeit zu verbessern. Mit der weiteren Verwendung stimmst du dem zu.

Datenschutzerklärung