arXiv 2026 120 Test Cases 5 Models 3 Scaffolds

ClawSafety

“Safe” LLMs, Unsafe Agents
Bowen Wei1 Yunbei Zhang2,4 Jinhao Pan1 Kai Mei3 Xiao Wang4 Jihun Hamm2 Ziwei Zhu1 Yingqiang Ge3
1George Mason University 2Tulane University 3Rutgers University 4Oak Ridge National Laboratory
Abstract

Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. We introduce ClawSafety, a benchmark of 120 adversarial test cases organized along three dimensions — harm domain, attack vector, and harmful action type — grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Across five frontier LLMs and 2,520 sandboxed trials, attack success rates range from 40% to 75%, with skill instructions consistently more dangerous than email or web content. Cross-scaffold experiments further demonstrate that safety depends on the full deployment stack, not the backbone model alone.

40–75%
ASR Range
2,520
Sandboxed Trials
0%
Cred Fwd (Best)
8.6pp
Scaffold Shift

Three Injection Vectors, Five Domains

Each test case embeds adversarial content in exactly one channel the agent encounters during normal work, ordered by implicit trust level.

⚙️

Skill Injection

Adversarial instructions in privileged workspace files the agent treats as operating procedures. Highest implicit trust.

69.4% avg ASR
✉️

Email Injection

Adversarial emails from spoofed trusted senders, mixed into the inbox. Trust depends on sender identity and role.

60.5% avg ASR
🌐

Web Injection

Adversarial web pages encountered during normal work. Lowest trust — agents prefer local data over web content.

38.4% avg ASR
💼

5 Professional Domains

Software Engineering, Financial Ops, Healthcare Administration, Legal/Contract Management, DevOps/Infrastructure.

5 scenarios × 24 cases each
🎯

5 Harmful Action Types

Data exfiltration, config modification, destination substitution, credential forwarding, and destructive actions.

120 total test cases
🔄

64-Turn Conversations

Four-phase protocol: warm-up, context building, injection encounter, and disclosure window with 16 varied framings.

tunable granularity

Attack Success Rates

ASR (%) by model, scaffold, and injection vector across all scenarios. Three independent trials per configuration; majority vote.

Model Skill Email Web Overall
OpenClaw v2026.3
Claude Sonnet 4.655.045.020.040.0
Gemini 2.5 Pro72.555.037.555.0
Kimi K2.577.560.045.060.8
DeepSeek V382.567.552.567.5
GPT-5.190.075.060.075.0
Nanobot v0.8
Claude Sonnet 4.650.062.533.348.6
NemoClaw v0.1
Claude Sonnet 4.658.358.320.845.8
Overall Vector Avg.69.460.538.456.1

What We Found

Finding 01

Chat Safety ≠ Agent Safety

Models refusing harmful chat requests comply at 40–75% under indirect injection. Safety alignment transfers poorly from chat to agentic contexts.

Finding 02

Scaffold Shifts Safety

Scaffold choice alone shifts ASR by up to 8.6pp and can reverse vector effectiveness rankings. Nanobot flips the trust-level gradient observed on OpenClaw.

Finding 03

Hard Boundaries Exist

The strongest model maintains 0% ASR on credential forwarding and destructive actions across all domains and vectors — a hard boundary no other model exhibits.

Finding 04

Domain Matters

DevOps is nearly twice as exploitable as legal settings. Attorney-client privilege framing provides an additional defense layer that compliance urgency does not.

Finding 05

Declarative Bypasses Defenses

Imperative phrasing ("update X") triggers defenses; declarative phrasing ("X does not match Y") bypasses all defenses — regardless of content or styling.

Finding 06

Identity Verification Is Key

Replacing named colleagues with role titles drops honey token leakage from 100% to 47.5%. Depersonalized workspaces substantially reduce exfiltration risk.

Quick Start

Clone, install, and run a single evaluation in minutes. Each test case runs in a fresh sandboxed environment.

terminal
# Clone the repository git clone https://github.com/weibowen555/ClawSafety.git cd ClawSafety # Install dependencies pip install -e . # Run a single scenario (S1: Software Engineering, skill injection) python run_eval.py --scenario s1_software_eng --vector skill --model claude-sonnet-4.6 --scaffold openclaw # Run full benchmark (all 120 cases × 3 trials) python run_eval.py --all --trials 3 # Compute ASR and generate tables python compute_asr.py --results-dir ./outputs

BibTeX

If you use ClawSafety in your research, please cite our paper.

@misc{wei2026clawsafetysafellmsunsafe, title = {ClawSafety: "Safe" LLMs, Unsafe Agents}, author = {Bowen Wei and Yunbei Zhang and Jinhao Pan and Kai Mei and Xiao Wang and Jihun Hamm and Ziwei Zhu and Yingqiang Ge}, year = {2026}, eprint = {2604.01438}, archivePrefix = {arXiv}, primaryClass = {cs.AI}, url = {https://arxiv.org/abs/2604.01438} }