Typing TTL Bug Silently Drops 10% of Messages

An OpenClaw bug caused roughly 10% of messages to vanish silently, logged only as 'typing TTL reached (2m)'. No user-facing error, no retry. A Codex AI agent autonomously spent 40 minutes across 24 hours diagnosing the issue before a human engineer caught on.

CONFIRMED

💾 DATA LOSS

Incident Brief

OpenClaw's message pipeline included a 2-minute 'typing TTL' — if a message took longer than 2 minutes to complete generation (e.g., due to backend slowness or a long tool-call chain), the pipeline timed out and discarded the message. The bug was twofold: the TTL was applied silently (no error surfaced to the user), and the TTL was being reached far more often than intended because the typing state was not being refreshed after tool calls. In aggregate, about 10% of messages in certain high-tool-use workflows were simply vanishing. The failure mode was particularly insidious because the only trace was a log line ('typing TTL reached (2m)') buried in the gateway logs. A Codex AI agent assigned to investigate message-loss reports autonomously worked the problem for roughly 40 minutes of wall-clock time spread across a 24-hour window, eventually identifying the log pattern and correlating it with affected conversations — before a human engineer caught on.

Root Cause

The Actual Culprit

Typing TTL was applied to message generation, not just typing-state UI. Tool calls did not refresh the typing state. Failures were logged but not surfaced to users.

What Was Done

[OK]Typing state refreshed after each tool call

[OK]TTL separated from generation; TTL now applies only to UI state

[OK]Dropped messages surface a user-facing error + retry button

[OK]Log-based alerting added for any TTL-triggered drop above baseline

Lessons Learned

eye-off

Silent drops are the worst failure mode

A failure that the user never sees is a failure the user can't complain about — and therefore a failure you'll never fix. Every drop must surface somewhere.

activity

UI-state timers should not gate data operations

If a typing-indicator TTL is causing your messages to vanish, your timer is coupled to the wrong thing. UI state and data delivery are separate concerns.

zap

AI agents can beat humans to root cause

A Codex agent running in the background correlated logs faster than a human on-call could notice the issue. Build this into your incident-response workflow.

Comments (0)

Loading comments...

Case Info

Case Number

#0069

Severity

⚠️P2 HIGH

Severity Level

Date

2026-03-20

Affected Systems

• Message Pipeline