OpenClaw Scores 2/100 on ZeroLeaks Prompt Security Audit

An independent ZeroLeaks benchmark ran against OpenClaw's default agent configuration and scored it 2 out of 100 on prompt security: 84% system-prompt extraction success, 91% prompt-injection success, and full system prompt leakage on turn 1 of interaction.

CONFIRMED

🔓 SECURITY LEAK📢 PR NIGHTMARE

Incident Brief

ZeroLeaks, a widely-used prompt-security benchmark, evaluated OpenClaw's shipped default agent against its standard test battery. The result — 2/100 — was described by the benchmark's maintainers as the worst score recorded against a mainstream AI framework. 84% of system-prompt extraction attempts succeeded. 91% of prompt-injection attacks caused the agent to execute attacker instructions over operator instructions. In one test, the full system prompt (including tool descriptions and internal policies) was exfiltrated in a single turn with no evasion effort. Because OpenClaw agents routinely held sensitive workflows, API keys in memory files, and proprietary skill configurations, the implication was that every deployed agent was treating its configuration as public by default.

Root Cause

The Actual Culprit

OpenClaw used a single system-prompt layer with no output filtering, no instruction-hierarchy enforcement, and no separation between trusted operator instructions and untrusted user input.

What Was Done

[OK]Instruction-hierarchy enforcement added (operator > user > tool output)

[--]Output filter that detects system-prompt echo before returning to user

[OK]Tool output sandboxed and tagged as untrusted in the context

[--]Public response framing benchmark methodology

Lessons Learned

eye