Home · AI Security Answers · OWASP LLM Top 10
What is OWASP LLM04 data and model poisoning and how do I detect poisoned training or fine-tuning data?
OWASP LLM04 Data and Model Poisoning refers to the manipulation of training, fine-tuning, or RAG-corpus data to introduce backdoors, biases, or vulnerabilities into an AI model. This can lead to compromised behavior in the model.
To detect poisoned training or fine-tuning data, consider the following controls:
- Data-source vetting & integrity checks: Ensure the reliability and trustworthiness of data sources and verify the integrity of the data itself. This aligns with ISO/IEC 42001 A.7 Data for AI systems, which emphasizes data governance, provenance, and quality.
- Anomaly detection on training data: Implement automated systems to identify unusual patterns or outliers in incoming training data that could indicate poisoning.
- Signed datasets with provenance: Utilize signed datasets to establish and verify the origin and history of the data, ensuring its authenticity and preventing tampering. This also relates to ISO/IEC 42001 A.7 Data for AI systems.
- Isolation between training data sources and production retrieval corpora: Maintain separation between different data sources to prevent contamination from one to another.
- Red-team testing for backdoor triggers post-training: Conduct adversarial testing after training to actively search for and identify any embedded backdoors or vulnerabilities. This can involve techniques like those used in "abliteration" to test the robustness of safeguards.
- Post-deployment monitoring: Continuously monitor the model's behavior after deployment for any signs of malicious activity or unintended behaviors that might stem from poisoned data. Advanced techniques like "Mechanistic Interpretability" can be used to catch hidden behaviors within the model that traditional output-based testing might miss.
Grounded in
- Designing Agentic AI Systems with the ORCHIDEAS Framework
- owasp_llm_top10
- Mechanistic Interpretability of Claude Mythos: Inside Anthropic’s Groundbreaking Work
- 100 Refusals to 9: How Cheap It Is to Decensor an Open Model — and Why That’s a Policy Problem
- iso_42001
- The Computational Wall: Why the Defense Trilemma and the NP-Hardness of Reward Hacking Detection Demand a New Security Posture for AI
How does your AI agent score?
Get a free, instant AI agent security readiness snapshot — mapped to NIST, OWASP & ISO — then unlock the full report with a prioritized, cited fix-list.
This AI-generated answer is for guidance only — not a certification, audit, or penetration test. Grounded in the NIST AI RMF, OWASP LLM Top 10, and ISO/IEC 42001 control text; verify applicability to your environment.