Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. We introduce ClawSafety, a benchmark of 120 adversarial test cases organized along three dimensions — harm domain, attack vector, and harmful action type — grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Across five frontier LLMs and 2,520 sandboxed trials, attack success rates range from 40% to 75%, with skill instructions consistently more dangerous than email or web content. Cross-scaffold experiments further demonstrate that safety depends on the full deployment stack, not the backbone model alone.
Each test case embeds adversarial content in exactly one channel the agent encounters during normal work, ordered by implicit trust level.
Adversarial instructions in privileged workspace files the agent treats as operating procedures. Highest implicit trust.
69.4% avg ASRAdversarial emails from spoofed trusted senders, mixed into the inbox. Trust depends on sender identity and role.
60.5% avg ASRAdversarial web pages encountered during normal work. Lowest trust — agents prefer local data over web content.
38.4% avg ASRSoftware Engineering, Financial Ops, Healthcare Administration, Legal/Contract Management, DevOps/Infrastructure.
5 scenarios × 24 cases eachData exfiltration, config modification, destination substitution, credential forwarding, and destructive actions.
120 total test casesFour-phase protocol: warm-up, context building, injection encounter, and disclosure window with 16 varied framings.
tunable granularityASR (%) by model, scaffold, and injection vector across all scenarios. Three independent trials per configuration; majority vote.
| Model | Skill | Web | Overall | |
|---|---|---|---|---|
| OpenClaw v2026.3 | ||||
| Claude Sonnet 4.6 | 55.0 | 45.0 | 20.0 | 40.0 |
| Gemini 2.5 Pro | 72.5 | 55.0 | 37.5 | 55.0 |
| Kimi K2.5 | 77.5 | 60.0 | 45.0 | 60.8 |
| DeepSeek V3 | 82.5 | 67.5 | 52.5 | 67.5 |
| GPT-5.1 | 90.0 | 75.0 | 60.0 | 75.0 |
| Nanobot v0.8 | ||||
| Claude Sonnet 4.6 | 50.0 | 62.5 | 33.3 | 48.6 |
| NemoClaw v0.1 | ||||
| Claude Sonnet 4.6 | 58.3 | 58.3 | 20.8 | 45.8 |
| Overall Vector Avg. | 69.4 | 60.5 | 38.4 | 56.1 |
Models refusing harmful chat requests comply at 40–75% under indirect injection. Safety alignment transfers poorly from chat to agentic contexts.
Scaffold choice alone shifts ASR by up to 8.6pp and can reverse vector effectiveness rankings. Nanobot flips the trust-level gradient observed on OpenClaw.
The strongest model maintains 0% ASR on credential forwarding and destructive actions across all domains and vectors — a hard boundary no other model exhibits.
DevOps is nearly twice as exploitable as legal settings. Attorney-client privilege framing provides an additional defense layer that compliance urgency does not.
Imperative phrasing ("update X") triggers defenses; declarative phrasing ("X does not match Y") bypasses all defenses — regardless of content or styling.
Replacing named colleagues with role titles drops honey token leakage from 100% to 47.5%. Depersonalized workspaces substantially reduce exfiltration risk.
Clone, install, and run a single evaluation in minutes. Each test case runs in a fresh sandboxed environment.
If you use ClawSafety in your research, please cite our paper.