Anthropic: 250 Malicious Documents Can Backdoor LLMs — Summary & Implications
Anthropic: 250 Malicious Documents Can Backdoor LLMs — Summary & Implications Anthropic, in collaboration with the UK AI Security Institute and the Alan Turing Institute, released a report showing that a surprisingly small number of malicious documents can poison large language models during pretraining. Key points: Core finding: Injecting as few as 250 malicious documents into pretraining data can backdoor LLMs across model sizes (tested from ~600M to 13B parameters). Why it matters: The result challenges the assumption that attackers need to control a large percentage of training data to succeed. Small, targeted data injections can induce dangerous or unwanted behaviors. Evidence: Anthropic’s report and accompanying PDF detail experiments and results. An arXiv version is also available. Collaborators: UK AI Security Institute; Alan Turing Institute. Links for more information: Anthropic…
