← constitutional-ai (AI-Research-SKILLs)
constitutional-ai (AI-Research-SKILLs) — agentic threat model
This agent focuses on Constitutional AI alignment and training code execution, presenting moderate-to-high risk due to its ability to write and run training code, which could be exploited to execute arbitrary code or poison model training pipelines.
OWASP AIVSS score rationale
| Autonomy of Action | 0.60 | |
| Goal-Driven Planning | 0.50 | |
| Self-Modification | 0.40 | |
| Dynamic Tool Use | 0.70 | |
| Persistent Memory | 0.30 | |
| Contextual Awareness | 0.60 | |
| Dynamic Identity | 0.20 | |
| Multi-Agent Interactions | 0.30 | |
| Non-Determinism | 0.50 | |
| Opacity & Reflexivity | 0.40 |
Scored with the canonical OWASP AIVSS formula (AIVSS calculator reference); agentic risk factors estimated from the agent’s described capabilities.
MAESTRO 7-layer threat model
Per-layer threats for this agent. Layers tagged “not certain from listing” are general, caveated commentary where the public description didn’t pin that layer.
The agent interacts directly with foundation models to perform self-critique and alignment. The primary threat is adversarial prompt injection that bypasses the constitutional rules, leading to misaligned training outputs or model reprogramming during fine-tuning.
The agent manipulates training and fine-tuning datasets. A key threat is training data poisoning, where malicious inputs are injected into the self-critique loop to systematically bias or backdoor the resulting model.
The agent framework orchestrates the self-critique loop and generates training code. Vulnerabilities here include insecure tool integration, where the code-generation engine can be manipulated into executing malicious payloads under the guise of training scripts.
Because the agent writes and runs training code, the underlying deployment infrastructure is highly exposed. Without strict sandboxing, container escape, privilege escalation, and unauthorized resource consumption during training runs are critical threats.
The agent acts as an evaluation and guardrail mechanism itself. The primary threat is evaluation gaming, where adversarial training samples bypass the constitutional critique phase undetected due to blind spots in the critique prompts.
Not certain from the listing — there is no explicit mention of identity management, access control, or compliance auditing for the code execution environment. General risk of unauthorized users triggering training runs.
The agent is described as a sibling to LlamaGuard/NeMo-Guardrails skills. In a multi-agent ecosystem, a compromise of this agent could allow it to feed poisoned guardrail configurations or corrupted alignment policies to downstream agents, causing cascading failures.
MAESTRO — the 7-layer agentic threat-modeling framework (Cloud Security Alliance / Ken Huang).