An OpenClaw bug caused roughly 10% of messages to vanish silently, logged only as 'typing TTL reached (2m)'. No user-facing error, no retry. A Codex AI agent autonomously spent 40 minutes across 24 hours diagnosing the issue before a human engineer caught on.
OpenClaw's message pipeline included a 2-minute 'typing TTL' — if a message took longer than 2 minutes to complete generation (e.g., due to backend slowness or a long tool-call chain), the pipeline timed out and discarded the message. The bug was twofold: the TTL was applied silently (no error surfaced to the user), and the TTL was being reached far more often than intended because the typing state was not being refreshed after tool calls. In aggregate, about 10% of messages in certain high-tool-use workflows were simply vanishing. The failure mode was particularly insidious because the only trace was a log line ('typing TTL reached (2m)') buried in the gateway logs. A Codex AI agent assigned to investigate message-loss reports autonomously worked the problem for roughly 40 minutes of wall-clock time spread across a 24-hour window, eventually identifying the log pattern and correlating it with affected conversations — before a human engineer caught on.
The Actual Culprit
Typing TTL was applied to message generation, not just typing-state UI. Tool calls did not refresh the typing state. Failures were logged but not surfaced to users.
A failure that the user never sees is a failure the user can't complain about — and therefore a failure you'll never fix. Every drop must surface somewhere.
If a typing-indicator TTL is causing your messages to vanish, your timer is coupled to the wrong thing. UI state and data delivery are separate concerns.
A Codex agent running in the background correlated logs faster than a human on-call could notice the issue. Build this into your incident-response workflow.
Loading comments...