BackoffLimitExceeded occurs when a Kubernetes Job has reached its maximum retry attempts (default 6) and all pods have failed. The Job is marked as Failed and no further retries occur.
BackoffLimitExceeded occurs when a Kubernetes Job has reached its maximum number of retry attempts and all pods have failed. Kubernetes Jobs have a built-in retry mechanism controlled by the backoffLimit parameter (default: 6 attempts) that automatically recreates failed pods with exponential backoff delays (10s, 20s, 40s... capped at 6 minutes). When this limit is reached, the Job status is marked as Failed and no further retries occur. This is a permanent failure state that requires manual investigation and intervention to resolve, as the Job controller will not automatically restart the Job. The exponential backoff mechanism helps prevent overwhelming the cluster with rapidly failing pods, but once exceeded, you must fix the underlying issue and create a new Job or adjust the backoffLimit.
Get detailed information about the failed Job:
kubectl describe job <job-name> -n <namespace>This shows the Job's status, failure reason, and event history. Then check the associated pods:
kubectl get pods -n <namespace> -l job-name=<job-name>
kubectl describe pod <pod-name> -n <namespace>Look for the container status, exit codes, and termination reason. The 'Last State' section reveals the exit code and reason (e.g., OOMKilled, Error).
Retrieve logs from the failed pod:
kubectl logs <pod-name> -n <namespace>For pods that crashed during startup, check previous logs:
kubectl logs <pod-name> -n <namespace> --previousFor init container failures:
kubectl logs <pod-name> -n <namespace> -c <init-container-name>Look for error messages, stack traces, or exceptions. Common indicators: 'FATAL', 'ERROR', 'panic', 'out of memory', 'command not found'.
Check if containers are being killed due to resource constraints:
kubectl describe node <node-name>
kubectl top nodes
kubectl top pods -n <namespace>If the exit code is 137 (OOMKilled), increase the memory limit:
spec:
template:
spec:
containers:
- name: app
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"Ensure the container image exists and can be pulled:
kubectl describe pod <pod-name> -n <namespace> | grep -i 'image\|pull'Check the command and args in your Job spec. Exit code 127 means the command is not found. Exit code 126 means insufficient permissions.
Test the command locally:
docker run --rm myimage:v1 ls -la /path/to/command
docker run --rm myimage:v1 /bin/sh -c "./run.sh"If the failures are transient (temporary network issues, brief resource spikes), increase the backoffLimit:
apiVersion: batch/v1
kind: Job
metadata:
name: my-job
spec:
backoffLimit: 10 # Increased from default of 6
activeDeadlineSeconds: 3600 # Job will fail if not done in 1 hour
template:
spec:
containers:
- name: app
image: myimage:v1
restartPolicy: NeverDo NOT blindly increase backoffLimit for bugs in application code—this wastes cluster resources.
Once you've identified the root cause, implement the fix:
For application errors, fix the code, rebuild the image, and update the Job:
docker build -t myimage:v2 .
docker push myimage:v2Delete the old Job and reapply:
kubectl delete job <old-job-name> -n <namespace>
kubectl apply -f job.yamlFor configuration issues, fix init containers or environment variables before recreating.
BackoffLimit behavior varies with Job configuration: (1) With restartPolicy: Never (default), the Job creates new pods on failure up to backoffLimit. (2) With restartPolicy: OnFailure, the same pod is restarted within the container—backoffLimit still applies to pod creation attempts.
For Indexed Jobs (Kubernetes 1.27+), you can use backoffLimitPerIndex to set per-index retry limits instead of a global limit, providing better control over large parallel jobs.
Pod failure policy (Kubernetes 1.25+) allows you to mark certain exit codes as non-retriable (e.g., application bugs with exit code 1) to fail the Job faster without wasting retries on unrecoverable errors.
When combined with activeDeadlineSeconds, a Job will fail immediately if the total duration exceeds this limit, regardless of backoffLimit remaining.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
error: invalid configuration
How to fix "error: invalid configuration" in Kubernetes
etcdserver: cluster ID mismatch
How to fix "etcdserver: cluster ID mismatch" in Kubernetes
running with swap on is not supported
How to fix "running with swap on is not supported" in kubeadm