This error occurs when a Docker container's health check command fails consecutively for the configured number of retries (default: 3). The container is then marked as 'unhealthy'. Common fixes include increasing the retry count, extending timeout and start_period values, and debugging the underlying health check command.
The "Health check failed after maximum retries" error in Docker indicates that the container's HEALTHCHECK instruction has failed repeatedly, exhausting all configured retry attempts. By default, Docker allows 3 consecutive failures before marking a container as unhealthy. Docker's health check system works as follows: 1. After the `start_period` (default: 0s), Docker begins running the health check command at each `interval` (default: 30s) 2. Each check must complete within the `timeout` (default: 30s) or it counts as a failure 3. After `retries` (default: 3) consecutive failures, the container is marked "unhealthy" 4. A single successful check resets the failure counter When this error occurs, it means your application inside the container has failed to respond correctly to the health probe command multiple times in a row. The container continues running but is flagged as unhealthy, which affects orchestration decisions and service dependencies. Common scenarios that trigger this error: - Application starts slowly and isn't ready before health checks begin counting failures - The health check command itself is misconfigured or uses unavailable tools - Network or resource issues prevent the application from responding - The application has crashed or become unresponsive - Timeout values are too aggressive for the health check to complete
First, examine exactly why the health checks are failing by inspecting the container's health status:
# View current container status
docker ps -a
# Get detailed health check information
docker inspect --format='{{json .State.Health}}' <container_name> | jq .
# View the last several health check results with timestamps
docker inspect <container_name> | jq '.[0].State.Health.Log'Example output showing failed retries:
{
"Status": "unhealthy",
"FailingStreak": 3,
"Log": [
{
"Start": "2024-01-15T10:00:00.000Z",
"End": "2024-01-15T10:00:30.000Z",
"ExitCode": 1,
"Output": "curl: (7) Failed to connect to localhost port 8080"
},
{
"Start": "2024-01-15T10:00:30.000Z",
"End": "2024-01-15T10:01:00.000Z",
"ExitCode": 1,
"Output": "curl: (7) Failed to connect to localhost port 8080"
},
{
"Start": "2024-01-15T10:01:00.000Z",
"End": "2024-01-15T10:01:30.000Z",
"ExitCode": 1,
"Output": "curl: (7) Failed to connect to localhost port 8080"
}
]
}This shows the health check failed 3 times consecutively with exit code 1, indicating connection issues.
If your application occasionally fails health checks due to temporary conditions (heavy load, garbage collection, etc.), increase the retry count:
In docker-compose.yml:
services:
api:
image: my-api:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5 # Increased from default 3
start_period: 60sIn Dockerfile:
HEALTHCHECK --interval=30s --timeout=10s --retries=5 --start-period=60s \
CMD curl -f http://localhost:8080/health || exit 1Guidelines for retry values:
- 3 retries (default): Good for stable applications
- 5 retries: Better for applications with occasional hiccups
- 10+ retries: Use sparingly, only for known-flaky services
Note: More retries means longer time before a truly unhealthy container is flagged.
The start_period option gives your application time to initialize before health check failures count toward the retry limit. This is crucial for applications that need significant startup time.
Example for a Java/Spring Boot application:
services:
backend:
image: spring-app:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/actuator/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 120s # 2 minutes for JVM warmupExample for a database service:
services:
postgres:
image: postgres:15
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30sHow start_period works:
- During start_period, failed health checks do NOT count toward retries
- If a health check succeeds during start_period, the container is marked healthy
- After start_period expires, normal retry counting begins
Recommended start_period values:
- Node.js/Go applications: 10-30s
- Python/Ruby applications: 20-60s
- Java applications: 60-180s
- Database services: 30-60s
If your health check command takes longer than the timeout value, each check is marked as failed. Increase the timeout if your health endpoint performs complex checks:
services:
api:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health/deep"]
interval: 30s
timeout: 30s # Increased for deep health checks
retries: 3
start_period: 60sSigns that timeout is too short:
- Health check logs show "context deadline exceeded" or similar timeout messages
- Exit codes indicate timeout rather than connection failure
- The health endpoint works when tested manually but fails in automated checks
Testing your health check duration:
# Time how long the health check takes
docker exec <container> time curl -f http://localhost:8080/healthIf the health check takes 15 seconds, set timeout to at least 20-30 seconds.
Run the health check command manually inside the container to identify issues:
# Get a shell in the container
docker exec -it <container_name> sh
# Run the health check command manually
curl -f http://localhost:8080/health
echo "Exit code: $?"
# Check if the service is listening
netstat -tlnp | grep 8080
# or
ss -tlnp | grep 8080Common issues and solutions:
Command not found:
sh: curl: not foundSolution: Install curl in your Dockerfile or use an alternative tool.
Connection refused:
curl: (7) Failed to connect to localhost port 8080: Connection refusedSolution: Verify your application is running and listening on the correct port.
HTTP error:
curl: (22) The requested URL returned error: 503Solution: Check your application logs - the health endpoint may be returning errors.
DNS resolution:
curl: (6) Could not resolve host: localhostSolution: Use 127.0.0.1 instead of localhost in minimal containers.
Many minimal Docker images don't include curl or wget. Add them to your Dockerfile:
For Alpine-based images:
FROM node:20-alpine
RUN apk add --no-cache curl
# ... rest of DockerfileFor Debian/Ubuntu-based images:
FROM python:3.12-slim
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
# ... rest of DockerfileAlternative: Use wget (often pre-installed on Alpine):
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080/health"]Alternative: Use native language tools:
# Node.js - no extra dependencies needed
healthcheck:
test: ["CMD", "node", "-e", "require('http').get('http://localhost:8080/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1)).on('error', () => process.exit(1))"]Alternative: Use netcat for TCP-only checks:
healthcheck:
test: ["CMD", "nc", "-z", "localhost", "8080"]Incorrect syntax in docker-compose.yml is a common cause of health check failures:
CMD format - each argument must be separate:
# WRONG - arguments combined
healthcheck:
test: ["CMD", "curl -f http://localhost:8080/health"]
# CORRECT - arguments separated
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]CMD-SHELL format - use for shell features like || or pipes:
# CORRECT - shell form handles || exit 1
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]String format - simplest syntax:
# CORRECT - string is automatically run in shell
healthcheck:
test: curl -f http://localhost:8080/health || exit 1Important: Always add || exit 1 when using curl or wget. Tools like curl return various exit codes (7 for connection refused, 22 for HTTP errors), but Docker only recognizes 0 (healthy) and 1 (unhealthy).
Ensure your application has a properly functioning health endpoint:
1. Check what the health endpoint returns:
docker exec <container> curl -v http://localhost:8080/healthThe endpoint should return HTTP 200 for healthy status. Any non-2xx status will cause curl -f to fail.
2. Create a proper health endpoint in your application:
Node.js/Express:
app.get('/health', (req, res) => {
// Quick check that the server is responding
res.status(200).json({ status: 'healthy' });
});
// More thorough check including dependencies
app.get('/health/ready', async (req, res) => {
try {
await db.query('SELECT 1');
res.status(200).json({ status: 'ready' });
} catch (err) {
res.status(503).json({ status: 'not ready', error: err.message });
}
});Python/FastAPI:
@app.get("/health")
async def health():
return {"status": "healthy"}
@app.get("/health/ready")
async def ready():
try:
await database.execute("SELECT 1")
return {"status": "ready"}
except Exception as e:
raise HTTPException(status_code=503, detail=str(e))3. For database containers, use native tools:
# PostgreSQL
test: ["CMD-SHELL", "pg_isready -U postgres -d mydb"]
# MySQL
test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-u", "root", "-p$MYSQL_ROOT_PASSWORD"]
# Redis
test: ["CMD", "redis-cli", "ping"]
# MongoDB
test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]Understanding the retry mechanism in detail:
The health check retry mechanism works as follows:
1. start_period begins when the container starts
2. During start_period, health checks run but failures don't count
3. After start_period (or after the first successful check), counting begins
4. Each failed check increments FailingStreak
5. When FailingStreak equals retries, container becomes "unhealthy"
6. A single successful check resets FailingStreak to 0
Combining interval and retries strategically:
For a 5-minute detection window with frequent checks:
healthcheck:
interval: 30s # Check every 30 seconds
retries: 10 # 10 failures = 5 minutes to declare unhealthyFor quick detection of critical failures:
healthcheck:
interval: 10s # Check every 10 seconds
retries: 3 # 3 failures = 30 seconds to declare unhealthyHealth checks in orchestration systems:
Docker Swarm: Automatically replaces unhealthy containers in services. Configure with:
docker service create --name api \
--health-cmd "curl -f http://localhost:8080/health" \
--health-interval 30s \
--health-timeout 10s \
--health-retries 3 \
myimageKubernetes: Uses separate liveness and readiness probes with similar retry logic:
livenessProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 3 # Equivalent to retries
periodSeconds: 30 # Equivalent to intervalMonitoring health check failures:
Use Docker events to monitor health status changes:
docker events --filter event=health_statusOr programmatically with the Docker API/SDK to trigger alerts when containers become unhealthy.
Auto-healing without orchestration:
Use the willfarrell/autoheal container to automatically restart unhealthy containers:
services:
autoheal:
image: willfarrell/autoheal
restart: always
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
- AUTOHEAL_CONTAINER_LABEL=allDebugging in CI/CD pipelines:
Add verbose logging to health checks for CI debugging:
healthcheck:
test: ["CMD-SHELL", "curl -v http://localhost:8080/health 2>&1 | tee /dev/stderr | grep -q 'HTTP.*200' || exit 1"]Or use a wait-for-healthy script in your pipeline:
timeout 120 bash -c 'until docker inspect --format="{{.State.Health.Status}}" mycontainer | grep -q healthy; do sleep 5; done'image operating system "linux" cannot be used on this platform
How to fix 'image operating system linux cannot be used on this platform' in Docker
manifest unknown: manifest unknown
How to fix 'manifest unknown' in Docker
cannot open '/etc/passwd': Permission denied
How to fix 'cannot open: Permission denied' in Docker
Error response from daemon: failed to create the ipvlan port
How to fix 'failed to create the ipvlan port' in Docker
toomanyrequests: Rate exceeded for anonymous users
How to fix 'Rate exceeded for anonymous users' in Docker Hub