A webhook 502 Bad Gateway means a reverse proxy in front of your handler couldn't get a response from the upstream. Cloudflare, nginx, ALB, Caddy — the proxy was up, but your application server wasn't. Webhook deliveries during a 502 window are retried by the source, so you usually don't lose events — but the operational cause is worth fixing.
Root Causes
1. Application server restart during a deploy
Your deploy stops the old container, starts the new one, and the proxy sees 502s during the gap. With webhook traffic averaging 100 req/min, even a 5-second gap means a handful of failed deliveries — most of which retry, but it's noise in dashboards and customer-visible delays.
2. Upstream connection refused / timeout
Your application crashed, or its port isn't accepting connections. The proxy tries, fails, returns 502. Common after OOM kills, panics, or container exit.
3. Idle timeout mismatch between proxy and upstream
nginx's proxy_read_timeout defaults to 60s. Cloudflare's idle timeout is 100s. If your handler holds the connection longer (slow business logic, long DB query), the proxy gives up and returns 502 to the client even though your handler eventually responds.
4. Health check failure
The load balancer's health check fails (your /health endpoint is broken or slow), the LB marks all instances unhealthy, all traffic 502s.
Fix It
Graceful shutdown during deploys
// Express — drain in-flight requests before exiting
const server = app.listen(3000)
process.on('SIGTERM', () => {
log.info('SIGTERM received, draining…')
server.close((err) => {
if (err) log.error({ err }, 'shutdown error')
process.exit(err ? 1 : 0)
})
// Force exit after 10s if drain hangs
setTimeout(() => process.exit(1), 10_000).unref()
})
Match proxy timeouts to handler latency
# nginx — give webhook routes a longer timeout
location /webhooks/ {
proxy_read_timeout 120s;
proxy_send_timeout 120s;
proxy_pass http://upstream;
}
Health check that actually reflects readiness
// /health should fail BEFORE the process exits
let isShuttingDown = false
process.on('SIGTERM', () => { isShuttingDown = true })
app.get('/health', (req, res) => {
if (isShuttingDown) {
return res.status(503).send('shutting down')
}
// Optional: check DB and Redis pings here
res.status(200).send('ok')
})
Webhook-Specific Mitigations
- Webhook handlers should return 200 in <100ms — long handlers are why 502s happen. Queue first, process async.
- Run multiple replicas behind the LB so one container restart never takes all traffic offline.
- Use blue-green or rolling deploys — never restart all replicas simultaneously.
- Webhook senders retry, so a brief 502 window is recoverable. But don't ignore patterns of 502 — they're a signal of operational fragility.
How to Reproduce
Deliberately kill -9 your application process while a webhook delivery is in flight. Watch the proxy's response — should be 502. Then fix the graceful shutdown handler and SIGTERM the process; the in-flight request should complete cleanly with no 502.
Frequently Asked Questions
How do I deploy without any 502s?
Rolling deploy with health-check-aware load balancing. New container starts, LB waits for it to be healthy, then drains the old container (giving it 30s to finish in-flight requests), then kills the old container. No 502 window.
Cloudflare 502s sometimes even when my server is fine. Why?
Cloudflare's edge can fail to reach your origin for 0.5% of requests on a normal day — network blips, edge node failures. Acceptable noise. Alert on rate (>1% sustained), not individual 502s.
Should I worry about webhook events lost during 502s?
Usually no — sources retry. Verify by checking the source's webhook dashboard for 'recent failed deliveries' after a 502 incident. If retries succeed, you didn't lose events.