Why Health Checks Lie

Article Map

TL;DR

Teams believe health checks represent system health. They actually measure whether a simplified probe succeeds. Health checks stay green during degradation because binary pass/fail masks partial failure, sampling differs from real traffic, dependency chains aren't represented, and success criteria are detached from user experience.

Health checks reduce complex behavior into binary signals, creating false confidence that green means healthy
Binary pass/fail masks partial failure and gradual degradation that users experience
Health check sampling differs from real traffic patterns, so checks pass while real requests fail
Dependency chains aren't represented in health checks, so upstream failures remain invisible
Success criteria are detached from user experience, so checks pass while users time out

The Pattern

Health checks are assumed to guarantee system health. In production, we often see teams interpret green health checks as evidence that systems are operating correctly. This assumption feels safe because health checks are required by platforms, encouraged by best practices, and appear to validate operational status.

Health checks reduce complex behavior into binary signals. A system that processes thousands of requests per second, handles multiple dependencies, and serves diverse traffic patterns gets reduced to a single pass or fail. This simplification feels safe because it provides clear signals. But the simplification hides the complexity it's meant to represent.

Operationally, health checks create false confidence. Passing checks signal health, suppress alerts, and prevent investigation. The binary signal becomes truth, even when it's not.

This pattern appears across system types. Kubernetes pod health checks pass while applications fail. Load balancer health checks pass while backends degrade. Database health checks pass while queries timeout. The pattern is universal: binary signals hide partial failure.

Why It's Invisible

Health checks lie through structural mechanisms: binary pass/fail masks partial failure, sampling differs from real traffic, dependency chains aren't represented, and success criteria are detached from user experience.

Binary pass/fail collapses continuous degradation into discrete states. A system that's 90% functional and one that's 10% functional both show as "healthy" if they pass the threshold. The signal doesn't change until complete failure. Health checks pass while 30% of requests fail, response times degrade by 500%, and dependency failures accumulate. The binary signal measures pass/fail, not degree of failure.

Sampling differs from real traffic. Health checks hit lightweight endpoints, use simple requests, and run at fixed intervals. Real traffic hits complex paths, uses complex payloads, and arrives in bursts. Health checks pass every 10 seconds while real requests timeout every 30 seconds. The check succeeds because it uses a simpler path. Real traffic fails because it exercises system behavior that health checks don't.

Dependency chains aren't represented. Health checks verify that a service responds, not that its dependencies are functional. A service passes checks while its database is slow, its cache is stale, or its external APIs are degraded. The check succeeds because the process is running. The service can't handle real requests because dependencies are failing. Health checks pass while downstream services timeout, database connections exhaust, and external APIs rate-limit. Dependency failures remain invisible.

Success criteria are detached from user experience. Health checks measure whether a probe succeeds, not whether users can complete workflows. A check passes in 50 milliseconds while users experience 30-second timeouts. A check passes while users see errors or can't complete transactions. The check measures technical success, not user success. Health checks pass while user-facing workflows fail, critical paths timeout, and business logic errors accumulate.

A Failure Scenario

A payment processing service ran in Kubernetes. Pod health checks passed. Kubernetes marked pods as healthy. Dashboards showed green.

Users reported payment failures. Support tickets increased. Payment attempts timed out. Health checks continued passing. Alerts didn't fire.

The service had three dependencies: a payment gateway API, a database, and a cache. The gateway API response times increased from 200ms to 8 seconds. The database connection pool exhausted. The cache returned stale data. The health check endpoint didn't call these dependencies. It returned a simple HTTP 200 response.

Real payment requests failed because they required all three dependencies. Requests hit the slow gateway API, waited for unavailable database connections, and received stale cache data. The health check used a lightweight endpoint that didn't match real request patterns. Real requests used complex payloads, multiple database queries, and external API calls. Health checks used simple HTTP requests that returned immediately.

The incident lasted 45 minutes before someone noticed. Health checks stayed green. Dashboards showed nominal metrics. Users couldn't complete payments. The health check measured the wrong thing.

Why Competent Teams Miss This

Health checks are required by platforms. Kubernetes requires them for pod lifecycle management. Load balancers require them for traffic routing. Orchestration platforms require them for service discovery. Teams implement health checks because they're mandatory, not because they question their effectiveness.

Health checks are encouraged by best practices. Documentation, platform guides, and industry standards all recommend them. Teams implement health checks because they're recommended, not because they understand their limitations.

Health checks work until they don't. In normal operation, health checks accurately reflect system health. They pass when systems are healthy. They fail when systems are unhealthy. This success validates the assumption that health checks represent system health. Teams trust health checks because they work most of the time. But "most of the time" isn't "all of the time," and the failures happen when systems are already degraded.

This is a systemic blind spot, not negligence. Competent teams implement health checks correctly. They follow platform requirements. They follow best practices. They configure health checks properly. But health checks have structural limitations that aren't obvious until they fail. The binary signal feels safe because it's simple. The green status feels reassuring because it's clear. But the simplicity hides complexity, and the clarity hides degradation.

Teams miss this because health checks are designed to hide complexity. They're meant to provide simple signals. They're meant to reduce complex behavior to binary states. This design works when systems are healthy, but it fails when systems are degrading. The design that makes health checks useful also makes them lie.