Llama Guard — agentic threat model

4.8AIVSS 4.8 · Medium

Llama Guard is a passive safety classifier with low agentic risk, acting as a defensive guardrail rather than an active agent. Its primary risks lie in model-level bypasses (jailbreaks) and evasion of its classification taxonomy rather than autonomous actions.

OWASP AIVSS score rationale

AIVSS = (CVSS_Base + AARS) × Mitigation_Factor, where AARS = (10 − CVSS_Base) × (Factor_Sum / 10) × ThM

CVSS base 5.5AARS uplift 0.45Factor sum 1.0/10Threat ×1.0Mitigation ×0.8

Autonomy of Action		0.10
Goal-Driven Planning		0.00
Self-Modification		0.00
Dynamic Tool Use		0.00
Persistent Memory		0.00
Contextual Awareness		0.30
Dynamic Identity		0.00
Multi-Agent Interactions		0.10
Non-Determinism		0.20
Opacity & Reflexivity		0.30

Scored with the canonical OWASP AIVSS formula (AIVSS calculator reference); agentic risk factors estimated from the agent’s described capabilities.

MAESTRO 7-layer threat model

Per-layer threats for this agent. Layers tagged “not certain from listing” are general, caveated commentary where the public description didn’t pin that layer.

L1 · Foundation Models✓ mapped

Llama Guard is an instruction-tuned LLM. It is highly vulnerable to adversarial jailbreaks, prompt injection bypasses, and model utility degradation if fine-tuned on poisoned safety datasets.

L2 · Data Operations⚠ not certain from listing

Not certain from the listing — The model relies on pre-defined safety taxonomies and training data. If users customize or fine-tune the model with custom datasets, data poisoning or lack of lineage tracking could introduce blind spots.

L3 · Agent Frameworks⚠ not certain from listing

Not certain from the listing — Llama Guard acts as a utility rather than an orchestrator. Vulnerabilities would arise from how the hosting framework parses its binary classification outputs (e.g., failing to block a prompt due to parsing errors).

L4 · Deployment & Infrastructure⚠ not certain from listing

Not certain from the listing — Deployment is self-hosted or cloud-hosted. Risks include unauthorized API access to the model endpoint, denial of service (DoS) via resource-intensive inputs, and side-channel attacks.

L5 · Evaluation & Observability✓ mapped

Llama Guard is itself an observability and guardrail tool. Its primary threat is evaluation gaming, where attackers systematically probe the model to map its decision boundaries and find bypasses.

L6 · Security & Compliance (cross-cutting)⚠ not certain from listing

Not certain from the listing — While Llama Guard helps enforce safety policies (e.g., aligning with NIST AI RMF), the model itself does not enforce authentication, authorization, or audit logging out of the box.

L7 · Agent Ecosystem⚠ not certain from listing

Not certain from the listing — Llama Guard does not natively participate in multi-agent marketplaces, but a compromise or bypass of this guardrail could lead to cascading safety failures across downstream agents relying on it.

MAESTRO — the 7-layer agentic threat-modeling framework (Cloud Security Alliance / Ken Huang).

These scores are auto-generated from public information (the agent's own listing, docs, and repository) using the canonical OWASP AIVSS formula and the MAESTRO framework — an estimate for guidance, not a penetration test, audit, or certification. See the scoring methodology. Are you the vendor? Factual corrections are free.