arXiv 2026 120 Test Cases 5 Models 3 Scaffolds

ClawSafety

“Safe” LLMs, Unsafe Agents

Bowen Wei¹ Yunbei Zhang^2,4 Jinhao Pan¹ Kai Mei³ Xiao Wang⁴ Jihun Hamm² Ziwei Zhu¹ Yingqiang Ge³

¹George Mason University ²Tulane University ³Rutgers University ⁴Oak Ridge National Laboratory

Abstract

Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. We introduce ClawSafety, a benchmark of 120 adversarial test cases organized along three dimensions — harm domain, attack vector, and harmful action type — grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Across five frontier LLMs and 2,520 sandboxed trials, attack success rates range from 40% to 75%, with skill instructions consistently more dangerous than email or web content. Cross-scaffold experiments further demonstrate that safety depends on the full deployment stack, not the backbone model alone.

Code & Data Paper (arXiv)

40–75%

ASR Range

2,520

Sandboxed Trials

Cred Fwd (Best)

8.6pp

Scaffold Shift

Attack Taxonomy

Three Injection Vectors, Five Domains

Each test case embeds adversarial content in exactly one channel the agent encounters during normal work, ordered by implicit trust level.

⚙️

Skill Injection

Adversarial instructions in privileged workspace files the agent treats as operating procedures. Highest implicit trust.

69.4% avg ASR

✉️

Email Injection

Adversarial emails from spoofed trusted senders, mixed into the inbox. Trust depends on sender identity and role.

60.5% avg ASR

🌐

Web Injection

Adversarial web pages encountered during normal work. Lowest trust — agents prefer local data over web content.

38.4% avg ASR

💼

5 Professional Domains

Software Engineering, Financial Ops, Healthcare Administration, Legal/Contract Management, DevOps/Infrastructure.

5 scenarios × 24 cases each

🎯

5 Harmful Action Types

Data exfiltration, config modification, destination substitution, credential forwarding, and destructive actions.

120 total test cases

🔄

64-Turn Conversations

Four-phase protocol: warm-up, context building, injection encounter, and disclosure window with 16 varied framings.

tunable granularity

Experimental Results

Attack Success Rates

ASR (%) by model, scaffold, and injection vector across all scenarios. Three independent trials per configuration; majority vote.

Model	Skill	Email	Web	Overall
OpenClaw v2026.3
Claude Sonnet 4.6	55.0	45.0	20.0	40.0
Gemini 2.5 Pro	72.5	55.0	37.5	55.0
Kimi K2.5	77.5	60.0	45.0	60.8
DeepSeek V3	82.5	67.5	52.5	67.5
GPT-5.1	90.0	75.0	60.0	75.0
Nanobot v0.8
Claude Sonnet 4.6	50.0	62.5	33.3	48.6
NemoClaw v0.1
Claude Sonnet 4.6	58.3	58.3	20.8	45.8
Overall Vector Avg.	69.4	60.5	38.4	56.1

Key Findings

What We Found

Finding 01

Chat Safety ≠ Agent Safety

Models refusing harmful chat requests comply at 40–75% under indirect injection. Safety alignment transfers poorly from chat to agentic contexts.

Finding 02

Scaffold Shifts Safety

Scaffold choice alone shifts ASR by up to 8.6pp and can reverse vector effectiveness rankings. Nanobot flips the trust-level gradient observed on OpenClaw.

Finding 03

Hard Boundaries Exist

The strongest model maintains 0% ASR on credential forwarding and destructive actions across all domains and vectors — a hard boundary no other model exhibits.

Finding 04

Domain Matters

DevOps is nearly twice as exploitable as legal settings. Attorney-client privilege framing provides an additional defense layer that compliance urgency does not.

Finding 05

Declarative Bypasses Defenses

Imperative phrasing ("update X") triggers defenses; declarative phrasing ("X does not match Y") bypasses all defenses — regardless of content or styling.

Finding 06

Identity Verification Is Key

Replacing named colleagues with role titles drops honey token leakage from 100% to 47.5%. Depersonalized workspaces substantially reduce exfiltration risk.

Getting Started

Quick Start

Clone, install, and run a single evaluation in minutes. Each test case runs in a fresh sandboxed environment.

terminal

# Clone the repository
git clone https://github.com/weibowen555/ClawSafety.git
cd ClawSafety

# Install dependencies
pip install -e .

# Run a single scenario (S1: Software Engineering, skill injection)
python run_eval.py --scenario s1_software_eng --vector skill --model claude-sonnet-4.6 --scaffold openclaw

# Run full benchmark (all 120 cases × 3 trials)
python run_eval.py --all --trials 3

# Compute ASR and generate tables
python compute_asr.py --results-dir ./outputs
      

Citation

BibTeX

If you use ClawSafety in your research, please cite our paper.

@misc{wei2026clawsafetysafellmsunsafe, title = {ClawSafety: "Safe" LLMs, Unsafe Agents}, author = {Bowen Wei and Yunbei Zhang and Jinhao Pan and Kai Mei and Xiao Wang and Jihun Hamm and Ziwei Zhu and Yingqiang Ge}, year = {2026}, eprint = {2604.01438}, archivePrefix = {arXiv}, primaryClass = {cs.AI}, url = {https://arxiv.org/abs/2604.01438} }