What Is Data Poisoning?

What Is Data Poisoning?

Data poisoning is a type of security vulnerability that targets the training process of machine learning and large language models (LLMs). Instead of attacking the system, an attacker corrupts the training data itself. By inserting malicious data, they can shape how the model behaves.

This is dangerous because once the model is deployed, the malicious training data can cause it to make biased decisions, misclassify certain inputs, or even include backdoors that attackers can exploit later via indirect prompt injection.

For example, if a model is trained on user submitted reviews, an attacker might insert a small number of crafted reviews that contain instructions for the LLM to run an undesired action when a trigger word is inputted.

How Data Poisoning Works

  1. The attacker hides malicious content into a data source that will be used for the model’s training

  2. The model trains on the poisoned dataset

  3. When the model is deployed, it behaves incorrectly under certain conditions

Why It Matters

Any model trained on data from external or user-generated sources is vulnerable to poisoning. Poisoned training data can lead to models that:

  • Leak sensitive information

  • Deliver biased or unreliable outputs

  • Misclassify critical inputs (e.g., in fraud detection or content moderation)

  • Create exploitable backdoors

TPRM and security teams need to assess whether the model being used by an application has been poisoned, or if the training of their model on customer data puts it at risk of future poisoning. If an application relies on poisoned trained models, an attacker can undermine an AI system before it ever reaches production.