What is a exponential backoff?
Exponential backoff is a retry strategy where the delay between attempts doubles each time, often with random jitter added. After failure 1 wait 1 second, then 2, 4, 8, 16, 32, 64. The "exponential" part means delays grow multiplicatively, not linearly — overloaded downstream services recover faster when callers back off increasingly fast. Modern retry libraries default to "full jitter" (delay is uniform random in `[0, 2^attempt]`) to prevent synchronized retry storms when many clients fail at the same instant. First retries arrive within seconds; later retries take hours. A 12-hour-old retry is normal, not a bug.
Exponential backoff is the standard pattern for retry delays. After failure 1 you wait 1 second, after failure 2 you wait 2 seconds, then 4, 8, 16, 32, 64. The base and the multiplier vary; "exponential" just means the delay grows multiplicatively, not linearly.
Why exponential, not constant or linear: a downstream service that's overloaded recovers faster if its callers back off increasingly. Constant retries (1s, 1s, 1s) create a thundering herd; linear (1s, 2s, 3s) helps but still grows too slowly when many clients are retrying together. Exponential gets out of the way fast.
Two variants matter. Pure exponential (1, 2, 4, 8, 16, 32) is what Stripe and most providers use. Exponential with jitter adds randomness — instead of "wait exactly 4 seconds" it's "wait 0-4 seconds." Jitter prevents synchronized retry storms, where a thousand clients fail at the same instant and all retry at exactly the 1s, 2s, 4s marks together. Most modern retry libraries default to "full jitter" — the actual delay is uniform random in [0, 2^attempt].
For a receiver, the practical implications:
- First retry is fast (often 1-30 seconds). If your handler is briefly broken — say during a 60-second deploy — you'll see retries arriving almost immediately on recovery. This is when idempotency matters most: your retried events haven't aged out, your handler may have partially processed them. - Later retries are slow. By attempt 5-6 the delay is in minutes; by attempt 8-10 it's in hours. A retry that arrives 12 hours after the original event is normal, not a bug. - The total retry window is hours to days, not minutes. Stripe retries for 3 days. GitHub for ~24 hours. If your incident response time is shorter than that, you'll likely catch most retries before the source gives up.
When *you're* the source — sending webhooks to your customers — exponential backoff is the right default. Start at 30s-1min for the first retry, double up to 5-15 minutes, cap at maybe 12-24 hours total. Always add jitter. Always cap max attempts so a permanently broken endpoint doesn't generate retries forever (5-10 attempts is typical).
The math on a 10-attempt exponential schedule: 1+2+4+8+16+32+64+128+256+512 = ~17 minutes if base unit is 1 second; ~17 hours if base is 1 minute. Pick the base such that total retry time matches your tolerance.
Example
async function delayForAttempt(attempt) {
const baseMs = 1000
const maxMs = 60_000
const exp = Math.min(maxMs, baseMs * 2 ** attempt)
// Full jitter: actual delay is uniform [0, exp]
const ms = Math.floor(Math.random() * exp)
return new Promise(r => setTimeout(r, ms))
}See Exponential Backoff in real traffic
WebhookWhisper captures every webhook with full headers, body, signature, and timing — so concepts like exponential backoff stop being abstract and become something you can inspect.
Start Free