If you operate webhooks long enough, retry logic becomes the most consequential part of the stack. It's the difference between "we missed three events that day" and "we missed nothing the customer ever noticed." It's also the part that's least visible until something goes wrong: when retries work, you don't see them; when they fail, you find out from a customer support ticket weeks later.
This post is the consolidated reference: how every major webhook provider retries (with the actual schedule, not vague "exponential backoff"), why the schedules differ, what an idempotent receiver looks like, when to use a dead-letter queue, and the patterns that scale once you're processing more than a few thousand events a day. I run WebhookWhisper, so the receiver-side patterns here are also what we ship in our own forwarding stack.
Why webhooks retry at all
The provider's job is to deliver the event. They don't know whether your handler returned 500 because of a bug, because of a deploy, because of an overload, or because the public internet is briefly broken. To them, all of these look identical: the response wasn't a 2xx. Their only safe move is to retry.
The retry contract every major provider operates under is at-least-once delivery. That phrase has a specific meaning: your handler will receive duplicates under normal operation. It is not a misconfiguration; it is the contract. Exactly-once delivery is impossible over an unreliable network — that's a standard distributed-systems result, not provider laziness. The provider gives you at-least-once and expects you to make your handler idempotent.
Failures that trigger a retry, in roughly the order I see them in production:
- Your handler returned 5xx. Application crash, downstream service down, database deadlock.
- Your handler timed out. Stripe gives 30 seconds, GitHub 10, Shopify 5 — slow handlers fail this constantly.
- Your handler returned 4xx. Most providers retry on 408, 429, and a few others; permanent 4xx like 401 and 404 are usually not retried.
- Network error before the handler responded. TLS handshake failed, DNS, TCP reset.
- Provider-side failure during delivery (rare, but happens — e.g. their queue backed up).
The retry schedules of the major providers
The single most-cited table in this whole post. This is the actual retry behavior in 2026, sourced from each provider's docs at time of writing:
| Provider | Max attempts | Total window | Backoff strategy | Stops on |
|---|---|---|---|---|
| Stripe | ~16 attempts | 3 days | Exponential, increasing intervals | 3 days elapsed; or you disable the endpoint |
| GitHub | 10 attempts | ~3 days | Exponential | Max attempts hit; or repeated 4xx (deactivates webhook) |
| Shopify | 19 attempts | 48 hours | Increasing intervals (not strictly exponential) | 48 hours; or 19 consecutive failures (removes subscription) |
| Twilio | 11 attempts | 24 hours | Linear-ish (intervals grow more slowly than exponential) | 24 hours; or all attempts exhausted |
| Slack | 3 attempts | 1 hour | Linear (immediate, 1 min, 5 min) | 3 attempts; Slack does not retry beyond this |
| SendGrid | ~10 attempts | 24 hours | Exponential | 24 hours; or all attempts exhausted |
| PagerDuty | 1 attempt | Immediate | No retry — fire-and-forget | First attempt |
| HubSpot | 10 attempts | 24 hours | Exponential | 24 hours |
| Discord | ~5 attempts | ~10 minutes | Short exponential | 10 minutes |
Two things to notice. First, the spread is enormous: PagerDuty doesn't retry at all (their model is "the next page will come anyway, don't queue stale alerts"); Stripe will retry for 3 days. Second, retry budget tells you something about how the provider thinks about their event semantics. Stripe is slow and patient because payment_intent.succeeded must be processed; your code shipping a product depends on it. Slack is short and aggressive because if a notification didn't deliver in an hour, the user has moved on.
Treat the table above as a reference, not an SLA. Providers update their schedules; the only ground truth is the provider's own docs at the moment you ship.
How exponential backoff actually works
"Exponential backoff" is shorthand for: each retry waits longer than the previous one, with the wait time growing exponentially. The standard formula:
delay = base * (multiplier ^ attempt) + jitter
For Stripe's schedule (approximated):
attempt 1: ~immediate
attempt 2: ~5 minutes
attempt 3: ~10 minutes
attempt 4: ~30 minutes
attempt 5: ~1 hour
attempt 6: ~2 hours
attempt 7: ~4 hours
...
attempt 16: ~3 days (cumulative)
Two design decisions in this:
Jitter. If you didn't add randomness to each delay, every event the provider sent during a downstream outage would retry at exactly the same time. When the outage ended, the provider would slam your recovered handler with hours of queued retries simultaneously, knocking it back over. Jitter spreads them out. Most providers add 10-50% random variance.
Cap. Without an upper bound on the delay, retry intervals grow without limit. The cap is usually somewhere between 1 and 6 hours for major providers — beyond that interval, retrying further isn't useful because if the issue hasn't recovered in that long, it's structural.
If you're building your own retry logic on the receiver side (e.g. your handler enqueues to a job system that retries the work), the canonical implementation in pseudocode:
function nextDelay(attemptNumber) {
const base = 1000 // 1 second
const multiplier = 2
const cap = 6 * 60 * 60 * 1000 // 6 hours
let delay = base * Math.pow(multiplier, attemptNumber)
delay = Math.min(delay, cap)
// Full jitter — interval is random in [0, computed]
const jittered = Math.random() * delay
return Math.floor(jittered)
}
"Full jitter" — picking a uniform random value between 0 and the computed delay — is the simplest scheme that works well. The competing "equal jitter" approach (half the computed delay plus a uniform random between 0 and the other half) is slightly better for some workloads but the difference is usually within noise.
Idempotency: the receiver's mandatory companion to retries
If your handler is not idempotent, retries cause real bugs: orders fulfilled twice, customers charged twice, emails sent in duplicate. Every webhook integration that has ever shipped without idempotency has eventually had this incident — see duplicate event received for the standard troubleshooting flow. The contract you're working against is at-least-once delivery; the fix is cheap and easy to add on day one; it is painful and requires data cleanup if you add it later.
The minimal pattern: deduplicate on the provider's event ID with an atomic database insert.
CREATE TABLE processed_webhook_events (
provider TEXT NOT NULL,
event_id TEXT NOT NULL,
started_at TIMESTAMPTZ NOT NULL,
completed_at TIMESTAMPTZ,
PRIMARY KEY (provider, event_id)
);
-- In your handler:
async function handleEvent(provider, eventId, payload) {
// Atomic insert-if-not-exists. Race-safe under concurrent retries.
const result = await db.query(`
INSERT INTO processed_webhook_events (provider, event_id, started_at)
VALUES ($1, $2, NOW())
ON CONFLICT (provider, event_id) DO NOTHING
RETURNING event_id
`, [provider, eventId])
if (result.rowCount === 0) {
// Another worker is already on it, or it's already done.
return { received: true, deduplicated: true }
}
// First time seeing this event. Do the work.
await processEvent(payload)
await db.query(`
UPDATE processed_webhook_events SET completed_at = NOW()
WHERE provider = $1 AND event_id = $2
`, [provider, eventId])
return { received: true }
}
Three things people get wrong here:
- Two-step check-then-insert. "If exists return, else insert" is two queries with a race window between them. Two retries arriving at the same instant can both pass the check and both do the work. Always use the atomic
INSERT ... ON CONFLICT DO NOTHING. - Dedupe key choice. Use the provider's event ID as your idempotency key, not your own derived key. Stripe gives you
event.id, GitHub givesX-GitHub-Delivery, Shopify givesidin the payload, Twilio givesEventSid. Each is unique per event for the lifetime of that event in the provider's system. - Idempotency vs deduplication scope. Idempotency means "doing the same operation N times produces the same result as doing it once." Deduplication is one way to achieve idempotency, but not the only way: an
UPSERTinto your business table on a unique constraint is also idempotent. Both are valid; the dedup table is more flexible because it works for any handler.
The retry-induced thundering herd
This is the production failure mode that catches teams off guard. Picture: your handler is healthy and processing events normally. A downstream service (a database, an external API, a Redis instance) goes down for 30 minutes. Your handler returns 500 for every event during the outage. The provider queues retries.
The downstream recovers. Now your handler is healthy again. But every retried event is scheduled to fire roughly at the same time, because the provider's exponential backoff means most of the queued events are scheduled at similar large intervals. They all hit your handler simultaneously, knocking it over again.
This is the herd. The defenses, in order of leverage:
Provider jitter. Stripe and most others already add jitter to their retry schedule. This helps but doesn't eliminate the problem.
Receiver-side rate limiting. If your handler is constrained to N requests per second, the herd flattens automatically. Excess gets retried again on the provider's schedule, with more jitter applied each cycle.
Capture-and-forward inspector with controlled forward rate. A forwarding service in the path can throttle the rate at which events leave it, regardless of how many the provider sends. This decouples the provider's retry storm from your handler's actual throughput. WebhookWhisper does this; Hookdeck does this; building your own version is a few hundred lines plus a queue.
Slow-start after outage. If you can detect that you've recovered from a downtime, deliberately limit your handler's intake for a few minutes — let the queue drain at a controlled rate rather than processing at full throttle.
Dead-letter queues: the safety net for retry exhaustion
Retries stop eventually. Stripe gives up after 3 days. Shopify after 48 hours. Twilio after 24. If your handler is still failing — typically with 504 gateway timeouts or 429 rate-limit responses — when retries exhaust, the event is gone. The provider considers the delivery permanently failed and moves on.
For events that materially matter — payment events, subscription state changes, order fulfilment triggers — silent loss is unacceptable. The fix is a dead-letter queue: when your worker exhausts its own retry budget on an event, write the event to a separate table (or queue) along with the error. Page on inserts.
CREATE TABLE webhook_dlq (
id BIGSERIAL PRIMARY KEY,
provider TEXT NOT NULL,
event_id TEXT NOT NULL,
event_type TEXT,
payload BYTEA NOT NULL,
error TEXT NOT NULL,
attempt INT NOT NULL,
failed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
resolved_at TIMESTAMPTZ
);
CREATE INDEX idx_webhook_dlq_unresolved
ON webhook_dlq (failed_at) WHERE resolved_at IS NULL;
The DLQ doesn't replace retries — it complements them. The flow:
- Provider delivers event. Handler signature-verifies, persists, enqueues, acks 200.
- Worker picks up the event, tries to process. On failure, retries internally with its own backoff schedule (3-5 attempts is typical for receiver-side worker retry).
- If the worker's own retry budget exhausts, the event is written to
webhook_dlqwith the error. - An alert fires on DLQ insert. A human inspects, classifies (transient vs. real bug), and either re-enqueues or marks resolved.
The point of the DLQ is not to recover automatically. It's to make sure no event is ever silently dropped. Recovery is manual on purpose — by the time something has hit a DLQ, you want a human to look at it before retrying.
Receiver-side retry: when the provider isn't enough
Sometimes the provider's retry schedule isn't aligned with your needs. Stripe's 3-day budget might be longer than you can afford to leave an order unfulfilled (the customer will have asked for a refund by then). Slack's 3-attempt budget might be shorter than you need (your handler was down for 90 minutes during a deploy and Slack already gave up).
The fix on the receiver side: don't process directly in the handler. Persist + enqueue + ack 200, then let your own queue retry the work on its own schedule.
// Handler: receive, persist, ack — fast and reliable
app.post('/webhooks/stripe',
express.raw({ type: 'application/json' }),
async (req, res) => {
const event = stripe.webhooks.constructEvent(
req.body, req.headers['stripe-signature'], secret
)
await db.webhookEvents.insert({
eventId: event.id,
provider: 'stripe',
payload: req.body,
receivedAt: new Date(),
status: 'received',
})
await queue.add('process-event', { eventId: event.id })
res.json({ received: true })
}
)
// Worker: process with its own retry policy
queue.process('process-event', async (job) => {
const event = await db.webhookEvents.findOne({ eventId: job.data.eventId })
try {
await processEvent(event.payload)
await db.webhookEvents.update({ eventId: event.eventId },
{ status: 'completed', completedAt: new Date() })
} catch (err) {
if (job.attemptsMade < 5) throw err // BullMQ will retry
// Exhausted: move to DLQ
await db.webhookDlq.insert({
provider: 'stripe', eventId: event.eventId, payload: event.payload,
error: err.message, attempt: job.attemptsMade,
})
await alerting.page('webhook-dlq-insert', { eventId: event.eventId })
}
})
Now you have two retry layers: the provider's (3 days for Stripe, etc.) and your worker's (5 attempts with your own backoff). The provider's retry catches "my handler was unreachable;" the worker's retry catches "my downstream was briefly slow." They're independent and complementary.
Replay: the third retry layer
Even with two retry layers and a DLQ, there's a category of failure neither catches: a bug shipped that ran for hours before you noticed, returning 200s but doing the wrong thing. The provider thinks delivery succeeded. Your worker thinks processing succeeded. The events are gone from both retry layers.
The recovery is replay. Two paths:
Re-trigger from the provider. Stripe and GitHub let you "Resend" past events one at a time from the dashboard. Useful for a handful of events, useless for thousands.
Replay from a capture-and-forward inspector. If you had an inspector in the path, every event for the last 14+ days is durably stored. Fix the bug, click "Replay all from window X," and the inspector re-fires the original payloads at your fixed handler. This is the only path that scales.
This is the unappreciated argument for an inspector in front of every production webhook from day one. The other reasons (forensics, debugging) are nice. The replay capability is what saves you the day a bug ships and a thousand events processed wrong.
The retry observability checklist
You only know if your retry strategy is working if you measure it. The metrics that matter:
- First-attempt success rate per provider. Should be > 99%. If it drops, your handler is unhealthy or middleware is breaking signatures.
- Eventual success rate (after retries) per provider. Should be > 99.9%. The gap between this and first-attempt success is your retry-saved-the-day rate.
- Events ending in DLQ per day. Should be 0 most days. Any non-zero day means a real failure mode that needs investigation.
- P50 / P95 / P99 of attempts-to-success. Most events should succeed on attempt 1. P95 attempts > 2 means you have a systematic flake.
- Time-to-acknowledge. P99 should be well under the provider's timeout — 1-2 seconds for Stripe (timeout 30s), under 1 second for Shopify (timeout 5s).
Plot all of the above per provider per day. Page on DLQ inserts and on first-attempt success rate dropping below 95% for > 5 minutes.
Frequently asked questions
Should I implement my own retry logic if I'm just calling the provider's webhooks?
You don't implement retry logic to replace the provider's — you implement it on the worker side (the code that processes events out of your durable queue) to retry your downstream operations. The provider retries delivery to your endpoint; your worker retries processing within your system. Both are needed; they solve different problems.
How is "exponential backoff" different from "linear backoff"?
Linear: each retry waits a fixed additional amount (1s, 2s, 3s, 4s). Exponential: each retry waits a multiple of the previous (1s, 2s, 4s, 8s, 16s). Exponential is preferred because it handles longer outages efficiently — you don't waste retry budget hammering a service that's been down for an hour. Linear backoff is occasionally appropriate for short, high-frequency operations where retries are cheap and outages are bounded.
Why does Stripe retry for 3 days but Slack only for 1 hour?
The answer is event semantics. A Stripe payment_intent.succeeded event must be processed eventually — your fulfilment depends on it, and the longer Stripe retries, the more likely your handler comes back online and the customer gets their order. A Slack notification that's been queued for 2 hours is stale; the user has moved on, and re-delivering it now would be confusing. Retry duration tracks "how long is this event still useful." It's a product decision more than a technical one.
What's the right number of internal retry attempts in my worker?
3 to 5 with exponential backoff is a reasonable default. The math: at 5 attempts with 2x backoff and 1s base, you cover ~31 seconds of total delay. That's enough to ride out most transient downstream issues (a brief network blip, a database failover, a 10-second deploy). Beyond 5, you should be moving to a DLQ because the issue isn't transient.
How do I prevent retry storms after a downstream outage?
Three layers. First, ensure provider jitter is engaged — most do this for you. Second, rate-limit at your handler so excess retries get rejected and re-retried on the provider's schedule rather than overloading you. Third, use a capture-and-forward inspector that throttles its forward rate independent of provider retry rate. The combination flattens almost any post-outage spike.
What HTTP status codes do providers consider "retry-worthy"?
The exact list varies, but the common pattern: 5xx is always retried, timeouts and connection errors are always retried, 408 (Request Timeout) and 429 (Too Many Requests) are retried by most providers, and 4xx other than these are usually not retried (the provider treats them as "your endpoint actively rejected this; retrying won't help"). Some providers will deactivate a webhook endpoint entirely after a streak of 4xx responses — GitHub disables hooks that return 4xx many times in a row, on the theory that the endpoint is broken or no longer wanted. If your handler ever needs to signal "I rejected this event but please try again later," return 503 with a Retry-After header rather than a generic 4xx.
Closing
The compressed version of this post: providers retry on at-least-once, with each provider's specific schedule and budget. Your handler must be idempotent. Persist before acknowledging, work async with your own worker-level retries, and use a dead-letter queue for events that exhaust retries. Have a replay path for the bugs neither retry layer catches. Measure first-attempt success and eventual success per provider; alert on DLQ inserts.
If you want the inspector + forwarding + replay layer in five minutes, that's WebhookWhisper — paste your endpoint URL into your provider, point forwarding at your handler, and every event is captured durably (7 days on Free, 14 on Starter, 30 on Pro), replayable on demand. The combination of provider retries, worker retries, DLQ, and replay is what production webhook infrastructure actually looks like in 2026.
For deeper guides on the related operational patterns, see the broader webhook checklist, the webhook debugging triage process, and the security defense-in-depth stack.