A failing liveness probe causes Kubernetes to repeatedly restart the container, creating a CrashLoopBackOff. The liveness probe detects if a container is deadlocked or unresponsive and triggers a restart to recover. A bad probe configuration or application issue causes restart loops.
The liveness probe is a health check that tells Kubernetes: "If I fail, the container is dead and should be restarted." Kubernetes uses it to: 1. Detect deadlocked containers (hung processes that don't respond) 2. Auto-recover unhealthy containers by restarting them 3. Remove the pod from service to prevent cascading failures When a liveness probe fails consistently: - Kubernetes kills the container (SIGKILL) - Container restarts (creating a new process) - Probe fails again immediately (if issue not fixed) - Repeat → CrashLoopBackOff Unlike readiness probes (which remove from load balancing), liveness probes restart containers.
View the probe definition:
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 15 "livenessProbe:"
kubectl describe pod <pod-name> -n <namespace> # Shows probe detailsExample configuration:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10 # Wait before first probe
periodSeconds: 10 # Check every 10 seconds
timeoutSeconds: 5 # Give 5 seconds to respond
failureThreshold: 3 # Restart after 3 failuresNote each parameter and compare to what's expected.
Run the exact probe check from inside the pod:
# For HTTP probe:
kubectl exec -it <pod-name> -n <namespace> -- sh
curl -v http://localhost:8080/health
echo $? # Check exit code
# For TCP probe:
kubectl exec <pod-name> -- nc -zv localhost 8080
# For exec probe:
kubectl exec <pod-name> -- /bin/health-check.sh
echo $? # Should be 0 for success
# Check response details:
kubectl exec <pod-name> -- curl -v http://localhost:8080/health | head -20The probe must succeed (return 0 or HTTP 200-399) consistently.
Verify the health check is implemented:
# Check app logs for health check handling:
kubectl logs <pod-name> -n <namespace> | grep -i health
# Search code for the endpoint:
git grep -i "health\|liveness" -- app/
# For Java:
grep -r "@GetMapping.*health" src/
grep -r "@RequestMapping.*health" src/
# For Node.js:
grep -r "app.get.*health" .
grep -r "router.get.*health" .
# For Python:
grep -r "@app.route.*health" .If not found, implement a simple health endpoint:
@app.route('/health')
def health():
return {'status': 'ok'}, 200Verify the app starts before liveness probe runs:
# Watch pod start:
kubectl logs <pod-name> -n <namespace> -f
# Look for startup messages:
grep -i "listening\|started\|ready" /var/log/app.log
# Measure startup time:
time curl http://localhost:8080/health
# If startup is slow, increase initialDelaySeconds:
kubectl patch deployment <name> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","livenessProbe":{"initialDelaySeconds":60}}]}}}}'Rule of thumb: initialDelaySeconds should be > max startup time × 2.
Check if the pod has enough resources:
kubectl describe pod <pod-name> -n <namespace> | grep -E "Limits|Requests"
kubectl top pods <pod-name> -n <namespace>
# Check for OOMKill:
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -i "reason\|exitCode"
# Increase resource limits:
kubectl set resources deployment <name> -n <namespace> \
--limits=cpu=1,memory=1Gi \
--requests=cpu=500m,memory=512MiIf application is slow due to resource constraints, the probe will timeout and fail.
Make the probe more lenient to transient issues:
# Edit the deployment:
kubectl edit deployment <name> -n <namespace>
# Or patch:
kubectl patch deployment <name> -n <namespace> -papiVersion: apps/v1
kind: Deployment
metadata:
name: stable-app
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Longer startup wait
periodSeconds: 10 # Check less frequently
timeoutSeconds: 10 # Give more time to respond
failureThreshold: 5 # Allow 5 failures before restartHigher threshold → more tolerant to transient failures.
Create a robust health check that returns true only if healthy:
# Python/Flask example:
from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/health', methods=['GET'])
def health():
# Check dependencies
try:
# Check database connection
db.session.execute('SELECT 1')
# Check cache
cache.ping()
# Check other critical services
return jsonify({'status': 'healthy'}), 200
except Exception as e:
print(f"Health check failed: {e}")
return jsonify({'status': 'unhealthy'}), 503Important: The health endpoint should check ONLY if the container itself is healthy. Don't check external dependencies unrelated to core functionality.
Distinguish between "ready to serve traffic" vs "alive":
apiVersion: apps/v1
kind: Deployment
metadata:
name: healthy-app
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
# Readiness: app started and can handle traffic
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 2
# Liveness: container is alive (not deadlocked)
livenessProbe:
httpGet:
path: /alive
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3 # Only restart if truly deadReadiness checks local initialization; liveness checks only if deadlocked.
Failing liveness probes are usually a configuration issue, not a Kubernetes bug. The most common mistakes: setting initialDelaySeconds too low (app not ready), timeoutSeconds too short (slow probe response), or checking external services (not the pod's health). Never use liveness probes to check database connectivity—if DB is down, restarting the pod won't help. Use readiness probes instead to remove the pod from load balancing. For JVM apps, increase initialDelaySeconds (JVM startup is slow) and timeoutSeconds (GC pauses cause latency). Implement proper logging in health check endpoints to debug failures. Startup probes (added in Kubernetes 1.18) are better than high initialDelaySeconds for slow-starting apps. Consider using third-party tools (probes from Spring Boot Actuator, Node.js health libraries) for better health checks. In production, log all probe failures to understand patterns.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
error: invalid configuration
How to fix "error: invalid configuration" in Kubernetes
etcdserver: cluster ID mismatch
How to fix "etcdserver: cluster ID mismatch" in Kubernetes
running with swap on is not supported
How to fix "running with swap on is not supported" in kubeadm