Article Map
TL;DR
Teams believe retries improve resilience. They actually hide real error rates and amplify load during degradation. Retries delay detection because error dilution spreads failures across attempts, success bias masks degradation in metrics, load amplification multiplies traffic under partial failure, and queue growth occurs without corresponding error spikes.
- Retries are added everywhere as resilience mechanisms, creating false confidence that errors are handled
- Error dilution spreads failures across multiple attempts, making error rates appear lower than they are
- Success bias in metrics hides degradation because retries eventually succeed while masking underlying problems
- Load amplification multiplies traffic during partial failure, accelerating system collapse
- Queue growth occurs without corresponding error spikes, delaying detection until systems are overwhelmed
The Pattern
Retries are added everywhere. In production, we routinely see retries implemented in clients, SDKs, proxies, and platforms. They're framed as resilience mechanisms that handle transient failures. This framing persists because retries appear to improve reliability by automatically recovering from errors.
Retries reduce complex failure behavior into simple retry logic. A system that experiences partial degradation, dependency failures, and capacity constraints gets reduced to "retry on error." This simplification hides the complexity it's meant to handle. The retry mechanism becomes the solution, even when it's hiding the problem.
This pattern appears across system types. Client libraries retry failed requests while upstream services degrade. Load balancers retry failed connections while backends saturate. Message queues retry failed deliveries while consumers fail. The pattern is universal: retries hide failure until it's too late.
Why It's Invisible
Retries hide failure through structural mechanisms: error dilution spreads failures across attempts, success bias masks degradation in metrics, load amplification multiplies traffic under partial failure, and queue growth occurs without corresponding error spikes.
Error dilution spreads failures across multiple attempts. A single user request that fails three times before succeeding appears as three separate attempts in metrics. A system experiencing 30% failure rate appears to have 10% error rate when each failure is retried twice. Metrics count attempts, not user requests.
Success bias masks degradation. Retries eventually succeed for most requests, so error rates stay low even as systems degrade. A system that's 50% functional appears healthy because retries push success rates back to acceptable levels. Metrics show success, not the cost of achieving it.
Load amplification multiplies traffic during partial failure. When a system is 50% functional, retries double the load. When a system is 33% functional, retries triple the load. The system receives 200% or 300% of normal load while appearing to handle it because retries eventually succeed.
Queue growth occurs without corresponding error spikes. Retries fill queues with pending requests while error rates stay low. Queues grow because retries are accumulating, not because the system is processing requests. Queue depth increases while error metrics remain acceptable, delaying detection until systems are overwhelmed.
A Failure Scenario
An API gateway handled requests for multiple services. Upstream services began responding slowly. Response times increased from 200ms to 5 seconds. The gateway retried failed requests automatically.
Error rates stayed low. Dashboards showed acceptable success rates. Alerts didn't fire. Queue depth increased from 100 requests to 10,000 requests over 20 minutes.
The gateway retried each failed request three times. When upstream services were 40% functional, retries multiplied traffic by 2.5x. Upstream services received 250% of normal load while operating at 40% capacity. Queue depth grew because retries accumulated faster than requests could be processed.
The incident lasted 35 minutes before someone noticed the queue depth. Error rates stayed acceptable. Success rates stayed acceptable. The system was collapsing under retry load. Retries didn't cause the incident. They hid it until it was too late.
Why Competent Teams Miss This
Retries are recommended everywhere. Client libraries recommend retries for resilience. Platform documentation recommends retries for reliability. Industry standards recommend retries for fault tolerance. Teams implement retries because they're recommended, not because they understand their limitations.
Retries are encouraged by platforms. SDKs include retry logic by default. Load balancers retry failed connections automatically. Message queues retry failed deliveries automatically. Teams implement retries because platforms encourage them, not because they question their effectiveness.
Retries improve reliability until they don't. In normal operation, retries handle transient failures effectively. They recover from network glitches, temporary overloads, and brief outages. This success validates the assumption that retries improve resilience. Teams trust retries because they work most of the time. But "most of the time" isn't "all of the time," and the failures happen when systems are already degraded.
This is a systemic blind spot, not poor engineering. Competent teams implement retries correctly. They follow platform recommendations. They follow best practices. They configure retries properly. But retries have structural limitations that aren't obvious until they fail. The retry mechanism feels safe because it's automatic. The success rate feels reassuring because it's high. But the automation hides failure, and the success rate hides degradation.
Teams miss this because retries are designed to hide failure. They're meant to provide automatic recovery. They're meant to reduce complex failure behavior to simple retry logic. This design works when failures are transient, but it fails when failures are systemic. The design that makes retries useful also makes them hide degradation.