Data poisoning is a type of security vulnerability that targets the training process of machine learning and large language models (LLMs). Instead of attacking the system, an attacker corrupts the training data itself. By inserting malicious data, they can shape how the model behaves.
This is dangerous because once the model is deployed, the malicious training data can cause it to make biased decisions, misclassify certain inputs, or even include backdoors that attackers can exploit later via indirect prompt injection.
For example, if a model is trained on user submitted reviews, an attacker might insert a small number of crafted reviews that contain instructions for the LLM to run an undesired action when a trigger word is inputted.
How Data Poisoning Works
The attacker hides malicious content into a data source that will be used for the model’s training
The model trains on the poisoned dataset
When the model is deployed, it behaves incorrectly under certain conditions
Why It Matters
Any model trained on data from external or user-generated sources is vulnerable to poisoning. Poisoned training data can lead to models that:
Leak sensitive information
Deliver biased or unreliable outputs
Misclassify critical inputs (e.g., in fraud detection or content moderation)
Create exploitable backdoors
TPRM and security teams need to assess whether the model being used by an application has been poisoned, or if the training of their model on customer data puts it at risk of future poisoning. If an application relies on poisoned trained models, an attacker can undermine an AI system before it ever reaches production.