How to fix Failing Liveness Probe in Kubernetes

KubernetesBEGINNERHIGH

A failing liveness probe causes Kubernetes to repeatedly restart the container, creating a CrashLoopBackOff. The liveness probe detects if a container is deadlocked or unresponsive and triggers a restart to recover. A bad probe configuration or application issue causes restart loops.

What this error means

The liveness probe is a health check that tells Kubernetes: "If I fail, the container is dead and should be restarted." Kubernetes uses it to: 1. Detect deadlocked containers (hung processes that don't respond) 2. Auto-recover unhealthy containers by restarting them 3. Remove the pod from service to prevent cascading failures When a liveness probe fails consistently: - Kubernetes kills the container (SIGKILL) - Container restarts (creating a new process) - Probe fails again immediately (if issue not fixed) - Repeat → CrashLoopBackOff Unlike readiness probes (which remove from load balancing), liveness probes restart containers.

How to fix "Failing Liveness Probe"

1Check the liveness probe configuration

View the probe definition:

bash

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 15 "livenessProbe:"
kubectl describe pod <pod-name> -n <namespace>  # Shows probe details

Example configuration:

yaml

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10  # Wait before first probe
  periodSeconds: 10        # Check every 10 seconds
  timeoutSeconds: 5        # Give 5 seconds to respond
  failureThreshold: 3      # Restart after 3 failures

Note each parameter and compare to what's expected.

2Test the probe endpoint manually

Run the exact probe check from inside the pod:

bash

# For HTTP probe:
kubectl exec -it <pod-name> -n <namespace> -- sh
curl -v http://localhost:8080/health
echo $?  # Check exit code

# For TCP probe:
kubectl exec <pod-name> -- nc -zv localhost 8080

# For exec probe:
kubectl exec <pod-name> -- /bin/health-check.sh
echo $?  # Should be 0 for success

# Check response details:
kubectl exec <pod-name> -- curl -v http://localhost:8080/health | head -20

The probe must succeed (return 0 or HTTP 200-399) consistently.

3Check if the probe endpoint exists in the application

Verify the health check is implemented:

bash

# Check app logs for health check handling:
kubectl logs <pod-name> -n <namespace> | grep -i health

# Search code for the endpoint:
git grep -i "health\|liveness" -- app/

# For Java:
grep -r "@GetMapping.*health" src/
grep -r "@RequestMapping.*health" src/

# For Node.js:
grep -r "app.get.*health" .
grep -r "router.get.*health" .

# For Python:
grep -r "@app.route.*health" .

If not found, implement a simple health endpoint:

python

@app.route('/health')
def health():
    return {'status': 'ok'}, 200

4Check application startup time vs initialDelaySeconds

Verify the app starts before liveness probe runs:

bash

# Watch pod start:
kubectl logs <pod-name> -n <namespace> -f

# Look for startup messages:
grep -i "listening\|started\|ready" /var/log/app.log

# Measure startup time:
time curl http://localhost:8080/health

# If startup is slow, increase initialDelaySeconds:
kubectl patch deployment <name> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","livenessProbe":{"initialDelaySeconds":60}}]}}}}'

Rule of thumb: initialDelaySeconds should be > max startup time × 2.

5Review resource constraints that might slow the application

Check if the pod has enough resources:

bash

kubectl describe pod <pod-name> -n <namespace> | grep -E "Limits|Requests"
kubectl top pods <pod-name> -n <namespace>

# Check for OOMKill:
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -i "reason\|exitCode"

# Increase resource limits:
kubectl set resources deployment <name> -n <namespace> \
  --limits=cpu=1,memory=1Gi \
  --requests=cpu=500m,memory=512Mi

If application is slow due to resource constraints, the probe will timeout and fail.

6Adjust liveness probe timeout and failure threshold

Make the probe more lenient to transient issues:

bash

# Edit the deployment:
kubectl edit deployment <name> -n <namespace>

# Or patch:
kubectl patch deployment <name> -n <namespace> -p

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stable-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:latest
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30   # Longer startup wait
          periodSeconds: 10         # Check less frequently
          timeoutSeconds: 10        # Give more time to respond
          failureThreshold: 5       # Allow 5 failures before restart

Higher threshold → more tolerant to transient failures.

7Implement a proper health check endpoint

Create a robust health check that returns true only if healthy:

python

# Python/Flask example:
from flask import Flask, jsonify
app = Flask(__name__)

@app.route('/health', methods=['GET'])
def health():
    # Check dependencies
    try:
        # Check database connection
        db.session.execute('SELECT 1')
        # Check cache
        cache.ping()
        # Check other critical services
        return jsonify({'status': 'healthy'}), 200
    except Exception as e:
        print(f"Health check failed: {e}")
        return jsonify({'status': 'unhealthy'}), 503

Important: The health endpoint should check ONLY if the container itself is healthy. Don't check external dependencies unrelated to core functionality.

8Use separate readiness and liveness probes

Distinguish between "ready to serve traffic" vs "alive":

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: healthy-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:latest
        
        # Readiness: app started and can handle traffic
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 2
        
        # Liveness: container is alive (not deadlocked)
        livenessProbe:
          httpGet:
            path: /alive
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
          failureThreshold: 3  # Only restart if truly dead

Readiness checks local initialization; liveness checks only if deadlocked.

How to fix Failing Liveness Probe in Kubernetes

What this error means

Typical symptoms

Common causes

How to fix "Failing Liveness Probe"

Advanced notes

Related errors

Official resources & further reading