All Glossary Terms
Reliability

What is a dead-letter queue?

A dead-letter queue stores webhook events that failed processing after all retries — for manual inspection or replay later. Without a DLQ, exhausted retries silently drop events or bury them in error logs. With one, failures are durably stored, indexed by event ID, and available for replay once the underlying issue is fixed. Implement as a Postgres table (`dead_letter_events` with `event_id`, `body`, `headers`, `last_error`, `failed_at`), an SQS/Kafka built-in DLQ, or S3-backed JSON files for high-volume cases. Hygiene: alert on growth, build a one-click replay UI, never auto-replay (it loops forever if the bug persists), set 30-90 day retention.

A dead-letter queue (DLQ) is where events go when retries give up. The retry policy says "try N times, then stop." Without a DLQ, those events are lost — silently dropped or buried in error logs. With a DLQ, they're durably stored, indexed by event ID, and available for manual replay once the underlying issue is fixed.

The shape: any time your handler (or your worker, downstream of the handler) fails an event past its retry budget, it writes the event — full body, headers, attempt count, last error — to a DLQ table or queue. The DLQ is read-only to the normal processing path; failures don't loop back.

Why a DLQ matters: when something breaks, you want the failures preserved for two reasons. Replay: once you fix the bug, you can re-process the failed events and get back to a consistent state. Forensics: the failed events tell you what went wrong. "10,000 events failed with 'invalid currency code'" is a much better debugging signal than "lots of 500s in the logs."

Implementation choices:

- Postgres table. dead_letter_events(event_id, source, body, headers, last_error, failed_at, attempt_count). Simple, queryable from the admin UI, no extra infra. Fine until you're at high volume (>100k DLQ events). - SQS/Kafka/RabbitMQ DLQ feature. Most queue systems have a built-in DLQ — events that exceed max-receive-count automatically move to a designated DLQ. Use this if you're already using these queues. - Object storage (S3, R2). For very high volume or long retention, write each failed event as a JSON file. Cheap, durable, but not as queryable.

DLQ hygiene:

- Alert on growth. A DLQ with steady-state size 0 is healthy. A DLQ growing by 100 events/hour is a symptom of a bug; alert before it's 100,000. - Build a replay UI. Pulling events out of the DLQ and re-pushing them through the handler should be a one-click operation. If replay requires SQL surgery, you'll never replay anything. - Don't auto-replay blindly. A bug that put events in the DLQ may still be in the code; auto-replaying creates an infinite loop. Replay should be an explicit human-triggered action after the bug is fixed. - Set retention. DLQ entries past 30-90 days are usually unrecoverable anyway — the source has aged out the original event, the source-side dashboard no longer shows it, the business state has moved on. Age them out.

DLQs apply to outbound webhooks too. If you're sending webhooks to customer endpoints and a customer's endpoint is broken for a week, eventually retries exhaust — those events should be in a DLQ the customer can inspect, not silently lost.

See Dead-Letter Queue in real traffic

WebhookWhisper captures every webhook with full headers, body, signature, and timing — so concepts like dead-letter queue stop being abstract and become something you can inspect.

Start Free