Webhooks fail in ways nothing else fails. A REST API call that breaks, you can re-run from your terminal. A cron job that breaks, you can re-trigger by hand. A webhook that breaks happened minutes ago in someone else's data center, was retried four times against your handler before you noticed, and now exists only as a single line in a delivery log you have to find.
I've been debugging webhooks for ten years across about a dozen companies. I've built and now run WebhookWhisper, which means most weeks I'm also reading other people's debug stories — what they got stuck on, what fixed it, what they wish they'd checked first. This post is the consolidated playbook: the order to check things in, the tools that actually help, the failure modes you'll hit again and again, and the production-grade fixes that mean you don't hit them a third time.
The structure mirrors how I actually debug. Confirm the event was sent, capture the exact bytes, compare to what your handler received, identify which of seven failure classes you're in, fix and replay. Five steps, in that order. Skip a step and you'll spend an hour on the wrong cause.
Why webhooks are hard to debug in the first place
The reasons matter because they tell you which tools work and which don't.
The sender is a third party. Stripe, GitHub, Shopify, Twilio — you can't add a console.log on their side. You can't step through their code. You only see what they tell you in their delivery dashboard, and how complete that dashboard is varies wildly by provider.
The trigger is an event, not an HTTP call you control. To reproduce the bug, you sometimes have to perform a real action — place an order, push a commit, complete a Stripe checkout, rotate a Twilio phone number. That's slow and sometimes expensive. Many bugs only show up under specific event subtypes you can't easily fire on demand.
Delivery is asynchronous. By the time your monitoring screams about a 500 in a webhook handler, the event happened ten minutes ago. The provider is now partway through its retry schedule. You're racing the retry window: Stripe will keep trying for 3 days; Shopify gives up after 48 hours; Twilio gives up after about 4. If you don't fix it in time, the event is permanently lost.
Failures are silent by default. Most providers don't email you when a webhook handler returns 500. They retry quietly. Days later you discover that the order-fulfilment webhook has been failing the entire week and 40 customers paid for products that never shipped. The first symptom is a customer support ticket, not an alert.
The body is bytes, not data. Webhook signatures are computed over exact request bytes — what we call the raw body. Half of all webhook bugs come from middleware in your stack quietly mutating the body before your handler sees it. The data still looks right; the signature doesn't match. Your tools have to show you bytes, not pretty-printed JSON.
If you keep these five facts in mind while debugging, the rest of this post is mostly tactical. If you don't, you'll waste hours.
Step 1 — Confirm the event was actually sent
Before you touch your handler, your logs, your network, your code — confirm the provider sent the event. Half the "the webhook isn't firing" support tickets I've ever seen turned out to be that the event simply didn't fire. The trigger condition wasn't met, the event type wasn't subscribed, the webhook endpoint was disabled in the dashboard.
Every major provider has a delivery log. Open it first.
- Stripe: Developers → Webhooks → click your endpoint → "Events" tab. Shows every delivery attempt with status, response body, and full request body.
- GitHub: Settings → Webhooks → click your webhook → "Recent Deliveries". Each entry expands to show request and response.
- Shopify: Partners Dashboard → App → Webhooks → delivery history. Shopify's log is the thinnest of the major providers — you get status and timestamp, sometimes not full body.
- Twilio: Console → Monitor → Logs → Errors. Filter by your webhook URL.
- Slack: No first-party delivery log for incoming webhooks. You're flying blind unless you have a forwarding inspector in front of your endpoint.
What you're looking for, in order:
- Is the event in the log at all? If not — the trigger didn't fire, or the webhook isn't subscribed to that event type. Stop debugging your handler. Go check the event subscription.
- Was the event sent and a 2xx returned? Provider thinks delivery succeeded. The bug is in your downstream business logic, not in webhook delivery itself.
- Was the event sent and a 4xx returned? Your handler explicitly rejected it. Most likely: signature verification failed (401/400) or auth header missing (401). Jump to Step 4.
- Was the event sent and a 5xx returned? Your handler crashed. Check your application logs for the request ID or timestamp. Jump to Step 5.
- Was the event sent and a timeout reported? Your handler took too long. Stripe times out at 30 seconds, Shopify at 5, GitHub at 10. Jump to "Timeouts" below.
This single sort already eliminates 60% of "weird webhook bug" tickets. Always start here.
Step 2 — Capture the exact bytes
The provider delivery log will show you what it thinks it sent. To debug seriously, you also want to see what your endpoint actually received, byte for byte, before any middleware touches it. The two sometimes don't match — proxies, CDNs, and WAFs can quietly rewrite headers or transform bodies on the inbound path. (The most common offender is Cloudflare's "Auto Minify" feature, which removes whitespace from JSON responses; some setups misconfigure it to apply to inbound POST bodies too.)
The cleanest way to capture exact bytes is to point the provider at an inspector temporarily and trigger the event again. Three options, in order of friction:
Option A — Capture-and-forward inspector. Create a free WebhookWhisper endpoint, paste it into your provider's dashboard as the webhook URL, and add a forwarding rule from that endpoint to your real handler. Now every event is logged with full headers + raw body before being relayed to you. You see the provider's exact bytes; your handler sees them too. If the provider sends and you don't see it in the inspector, the request never reached your network. If the inspector shows it but your handler doesn't see it, your local routing or middleware ate it.
Option B — A request-bin tool with no forwarding. webhook.site, requestbin, beeceptor. Paste the URL into the provider, fire the event, see the request. Faster than option A but you've broken your real handler in the process — you have to flip the URL back when you're done. Use this for one-off captures, not for ongoing debugging.
Option C — tcpdump on your own server. If the bug only happens in production and you don't want to redirect traffic, capture the wire bytes directly:
# On your server (root), capture POSTs to /webhooks/stripe
sudo tcpdump -A -s0 -i any 'tcp port 443 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504F5354'
Slow, noisy, requires you to terminate TLS somewhere visible. I've used it twice in ten years. Almost always option A is what you want.
Step 3 — Compare what your handler actually received
Once you have the provider's exact bytes from Step 2, compare them to what your handler logs. Specifically:
- Body length. If the provider sent 4,812 bytes and your handler sees 4,803, something stripped 9 bytes. (Almost always trailing whitespace removed by an over-eager middleware.)
- Headers. Specifically the signature header (
Stripe-Signature,X-Hub-Signature-256,X-Shopify-Hmac-Sha256) and content-type. If your reverse proxy is stripping the signature header, you'll never verify it. - Body content. A diff of the raw bytes. Most middleware mangles whitespace, key ordering, or Unicode escapes — these are subtle byte-level changes that don't change the semantic JSON but break HMAC every time.
The diagnostic move I use most: log the SHA-256 of the raw request body in your handler, log the same SHA-256 of what the inspector captured, and compare.
// Node: log a SHA of the exact bytes your handler sees
import crypto from 'crypto'
app.post('/webhooks/stripe',
express.raw({ type: 'application/json' }),
(req, res) => {
const bodyHash = crypto.createHash('sha256').update(req.body).digest('hex')
log.info({ bodyHash, len: req.body.length, sig: req.headers['stripe-signature'] },
'webhook received')
// ... rest of handler
}
)
If the hash matches the inspector and verification still fails, your secret is wrong. If the hash differs, something in your stack is mutating the body before your handler reads it — that's the bug.
Step 4 — Identify which failure class you're in
From the symptoms in steps 1-3, every webhook bug I've ever seen falls into one of seven classes. Triage to the right class fast and the fix is usually in the next paragraph.
| Symptom | Failure class | Where to look first |
|---|---|---|
| Provider says event not sent / not in delivery log | Trigger or subscription | Provider dashboard: event subscription, endpoint enabled, filter conditions |
| Provider got 4xx (401/400) from your handler | Signature or auth | Raw body middleware, signing secret, signature header passthrough |
| Provider got 5xx from your handler | Handler crash | Application logs at the timestamp of the delivery; null-deref in payload parsing is the most common |
| Provider got timeout | Slow handler or queue saturation | Are you doing synchronous work? DB queries hanging? External API calls inline? |
| Same event arrives multiple times, your code processes it twice | Missing idempotency | Add deduplication on the event ID — every provider does at-least-once delivery. The fix pattern is in our duplicate-event troubleshooting. |
| Event arrives, signature verifies, but the data is "wrong" | Misunderstood payload shape | Provider's payload reference; some events have nested objects with the same field names at different scopes |
| Sporadic 1-5% failure rate with no pattern | Race condition or load balancer behavior | Logs by node/pod; sticky sessions, stateful deduplication in only one replica, clock skew |
Step 5 — Fix the actual bug
The most common fixes by class, with the language-specific incantations.
Signature verification failures
The pattern is universal across providers: pass the raw bytes to the verifier, never a parsed object. The framework-specific escape hatches:
- Express:
express.raw({ type: 'application/json' })on the route, beforeexpress.json(). - Next.js App Router:
const body = await req.text()— notreq.json(). - Next.js Pages Router:
export const config = { api: { bodyParser: false } }. - Django:
request.bodyfor raw bytes, with@csrf_exempt. - Flask:
request.get_data()for raw bytes; don't callrequest.get_json()first. - FastAPI:
await request.body(), notawait request.json(). - Rails:
request.body.read, withskip_before_action :verify_authenticity_token. - Go:
io.ReadAll(r.Body)— the standard library leaves you alone here. - PHP:
file_get_contents('php://input').
If you're stuck on a Stripe-specific failure, the deeper guide is Stripe webhook signature verification; the same principles apply to GitHub's X-Hub-Signature-256, Shopify's X-Shopify-Hmac-Sha256, and Twilio's X-Twilio-Signature. The error-page reference for this class is signature mismatch (and the Stripe-specific provider page for header format). To compute and compare signatures interactively, the free signature playground handles five providers and lets you paste a body + secret to see exactly what the header should be.
Handler crashes (5xx)
The most common crash I see in webhook handlers: assuming a field exists in the payload when it's optional. Stripe events for canceled subscriptions don't include current_period_end (the subscription is over). GitHub push events for deleted branches have a null head_commit. Shopify product events include nested variant arrays that may be empty. Defensive payload parsing matters more in webhooks than in REST endpoints because you can't predict every event subtype the provider will send you.
The fix pattern: validate the payload shape on entry with a schema (Zod, Pydantic, struct tags), and log + skip events that don't match the shape rather than crashing the whole handler.
Timeouts
If your handler is taking 5+ seconds, you're heading for a 504 gateway timeout on the provider's side. Two options:
Option 1: Acknowledge fast, work async. The canonical pattern. Verify the signature, enqueue the work in a job queue, return 200 immediately. The handler itself does no business work — only durable enqueueing.
app.post('/webhooks/stripe',
express.raw({ type: 'application/json' }),
async (req, res) => {
const event = stripe.webhooks.constructEvent(
req.body, req.headers['stripe-signature'], secret
)
// Persist the raw event for replay if the worker fails
await db.webhookEvents.insert({ id: event.id, type: event.type, payload: req.body })
// Enqueue and acknowledge
await queue.add('process-stripe-event', { eventId: event.id })
res.json({ received: true })
}
)
Option 2: A capture-and-forward service that retries for you. If your handler is slow and unfixably synchronous, point the provider at WebhookWhisper, set the forwarding target to your handler, and let the forwarder retry on timeout while keeping the provider happy with a fast 200. This is the operational pattern for "I have a legacy handler I can't easily refactor but the provider keeps marking it as failing."
Duplicate processing
Every major webhook provider uses at-least-once delivery. If your handler returns 5xx or times out, the same event arrives again. If your code is not idempotent, you fulfill orders twice, send duplicate emails, double-charge customers.
The minimal fix: deduplicate on the provider's event ID before doing any work. The convention behind this is the idempotency key. If you want a deeper treatment of how retries and idempotency interact, see our guide to retry schedules across providers.
async function handleStripeEvent(event) {
// Atomic insert-if-not-exists. If this row already exists,
// a previous delivery already started processing this event.
const result = await db.query(`
INSERT INTO processed_webhook_events (provider, event_id, started_at)
VALUES ($1, $2, NOW())
ON CONFLICT (provider, event_id) DO NOTHING
RETURNING event_id
`, ['stripe', event.id])
if (result.rowCount === 0) {
// Already being processed by another worker, or already done.
return { received: true, deduplicated: true }
}
// First time seeing this event. Do the work.
await processEvent(event)
await db.query(`UPDATE processed_webhook_events SET completed_at = NOW()
WHERE provider = $1 AND event_id = $2`, ['stripe', event.id])
}
The unique constraint on (provider, event_id) is what makes this safe under concurrent retries. Don't try to do "check if exists, then insert" in two queries — there's a race window between them. Use the database's atomic insert.
Step 6 — Replay the events you missed
Once you've shipped the fix, you have two recovery paths for events that failed while the bug was live.
Provider retries. Stripe retries over 3 days, GitHub over 3 days with 10 attempts max, Shopify over 48 hours. If you fix the handler within the retry window, some events will be retried automatically and succeed. You can't control timing, and events that hit max attempts before your fix shipped are gone.
Manual re-trigger from the provider dashboard. Stripe lets you "Resend" any past event from the dashboard. GitHub has the same on Recent Deliveries. Shopify does not — once delivered, you can't replay from Shopify's side. Use this for a small number of events.
Replay from a capture-and-forward inspector. If you had WebhookWhisper (or Hookdeck, or any similar service) in front of your handler the entire time, every event is durably stored. You fix the bug, click Replay on each failed event, and the inspector re-fires the exact original payload (with optionally regenerated timestamps) at your handler. This is the only path that scales to "we had a bug for two weeks and missed 4,000 events."
This is the unappreciated argument for putting an inspector in front of every production webhook from day one — not for daily debugging, but for the day a bug ships and you need to replay a thousand events from a week ago.
The forwarding stack: making bugs visible by default
I've described tools to use after a bug shows up. The better setup is one where bugs are visible from the start. The pattern that's worked best for me, on three different products:
- Provider → capture-and-forward service → your handler. The service stores every event durably (retention varies by service and tier — WebhookWhisper is 7 days on Free, 14 on Starter, 30 on Pro). Your handler still does the work; the inspector just sits in the path.
- Structured logs in your handler. Log: provider, event ID, body hash (SHA-256), signature header presence, time-to-first-byte, time-to-200, error if any. Every webhook handler should produce one structured log line per event with these fields. You will be glad of it the day you need to query "all events from the last 24 hours where verification failed."
- Alerting on signature-verification failure rate and timeout rate. A spike in signature failures is either an attack or a deploy that broke middleware ordering. A spike in timeouts is your queue backing up. Both deserve a page.
- Idempotency table from day one. Don't add it after the first duplicate-charge incident. Add it when you write the first webhook handler. The cost is one table; the benefit is a whole category of bug that can't happen.
None of this is exotic. All of it is what production webhook setups at companies that take webhooks seriously look like. The pattern is in our webhook best practices guide in more detail.
The debug checklist (printable)
When something is broken right now, work through this in order:
- ☑ Is the event in the provider's delivery log? If not, the bug is in the trigger / subscription / event filter, not your handler.
- ☑ What HTTP status did the provider record? 4xx, 5xx, timeout, or 2xx — each routes to a different debug path.
- ☑ Capture the exact request bytes (inspector, request-bin, or
tcpdump). Compare body length and SHA-256 to what your handler logs. - ☑ Verify your route is using raw body middleware, not parsed JSON. This is half of all signature failures.
- ☑ Verify the signing secret matches the endpoint you registered. (Test mode vs live, CLI vs Dashboard, dev vs prod.)
- ☑ Check your handler logs at the exact timestamp of the delivery. Look for null-deref in payload parsing.
- ☑ If timing out, check whether you're doing synchronous work. Move it to a queue.
- ☑ If duplicates are being processed, add an idempotency table keyed on
(provider, event_id). - ☑ After the fix ships, replay the events that failed during the bug window. From the provider, from your inspector, or both.
- ☑ Add a structured log line per event so the next time this happens, debugging takes 5 minutes instead of 5 hours.
Frequently asked questions
What's the fastest way to inspect a webhook payload without changing my handler?
Create a free WebhookWhisper endpoint, paste it into your provider as a second webhook URL (most providers allow multiple endpoints subscribed to the same events), trigger the event, and inspect the captured request. Your real handler keeps running unaffected. When you're done, delete the second endpoint.
Can I debug webhooks without giving up my localhost?
Yes. Forwarding from a public URL to localhost is exactly this — the provider hits a public WebhookWhisper URL, the WebhookWhisper service forwards to http://localhost:3000/webhooks/... via an outbound connection your laptop opens, and you debug locally with full breakpoints, hot reload, and IDE tooling. The Stripe CLI and ngrok are alternative shapes of the same idea.
How do I tell whether a webhook is failing because of my code or because of an upstream proxy?
Compute a SHA-256 of the raw request body in your handler and log it. Compare to a SHA-256 of what an inspector captured before your proxy. If they match, your code is the issue. If they differ, the proxy or middleware is mutating the body. The most common culprits are CDN body transformations, express.json() firing before your route handler, or NGINX with an unusual proxy_buffering config.
How do I replay webhook events after fixing a bug?
Three options: provider's dashboard "Resend" button (works for one-off events from Stripe and GitHub; not Shopify), automatic provider retries within their retry window (no control over timing), or replay from a capture-and-forward inspector that stored the original payload. The third is the only one that scales to many events or to providers that don't support manual resend.
Is it OK to skip signature verification in development?
It's tempting and I've done it. Don't. The dev-only "skip verification" flag is exactly how production secrets get committed to .env.example with values, how the wrong code path ships, and how a forgotten SKIP_SIGNATURE=1 in production exposes you. Use a real test-mode secret in dev and verify properly. Our in-browser HMAC playground can generate valid headers for synthetic test payloads if you need them.
How long should I retain webhook events for forensic debugging?
Long enough to cover the longest provider retry window plus your bug-discovery time. Stripe retries for 3 days; if a bug ships and you discover it 4 days later, the events from days 0-1 of the bug are already gone from Stripe's retry queue and from any inspector with less than 7-day retention. 14 days covers most teams' bug-discovery windows (WebhookWhisper offers 7 days on Free, 14 on Starter, 30 on Pro); the longer end is appropriate for high-stakes payment integrations where the cost of losing an event materially exceeds the storage cost. Don't rely on shorter windows than you can confidently fix bugs in.
Closing
Most webhook bugs are two or three failure modes deep in a stack of middleware, async work, and provider retry behavior — but they're not infinite. The seven classes in the table above cover everything I've personally debugged in ten years. Once you've got a triage process, an inspector in front of your handler, an idempotency table, and structured logs, the average time to root cause drops from hours to minutes.
If you want the inspector + forwarding + replay setup running in five minutes, that's exactly what WebhookWhisper's free tier is for — paste your endpoint URL into the provider, point forwarding at your localhost or production handler, and every event is captured, replayable, and inspectable (7 days on Free, 14 on Starter, 30 on Pro). The bugs you can't reproduce on demand become bugs you can replay until they're fixed.