What is OWASP LLM04 data and model poisoning and how do I detect poisoned training or fine-tuning data?

Question

Accepted Answer

OWASP LLM04 Data and Model Poisoning refers to the manipulation of training, fine-tuning, or RAG-corpus data to introduce backdoors, biases, or vulnerabilities into an AI model. This can lead to compromised behavior in the model.

To detect poisoned training or fine-tuning data, consider the following controls:

Data-source vetting & integrity checks: Ensure the reliability and trustworthiness of data sources and verify the integrity of the data itself. This aligns with ISO/IEC 42001 A.7 Data for AI systems, which emphasizes data governance, provenance, and quality.
Anomaly detection on training data: Implement automated systems to identify unusual patterns or outliers in incoming training data that could indicate poisoning.
Signed datasets with provenance: Utilize signed datasets to establish and verify the origin and history of the data, ensuring its authenticity and preventing tampering. This also relates to ISO/IEC 42001 A.7 Data for AI systems.
Isolation between training data sources and production retrieval corpora: Maintain separation between different data sources to prevent contamination from one to another.
Red-team testing for backdoor triggers post-training: Conduct adversarial testing after training to actively search for and identify any embedded backdoors or vulnerabilities. This can involve techniques like those used in "abliteration" to test the robustness of safeguards.
Post-deployment monitoring: Continuously monitor the model's behavior after deployment for any signs of malicious activity or unintended behaviors that might stem from poisoned data. Advanced techniques like "Mechanistic Interpretability" can be used to catch hidden behaviors within the model that traditional output-based testing might miss.

What is OWASP LLM04 data and model poisoning and how do I detect poisoned training or fine-tuning data?

How does your AI agent score?

Related questions