This error means the Job has failed repeatedly and exhausted all retry attempts specified in backoffLimit. The Job is marked as permanently failed and requires manual intervention.
When a Kubernetes Job encounters this error, it means the Job has failed repeatedly and the Job controller has exhausted all retry attempts specified in the backoffLimit field. By default, Kubernetes allows 6 retry attempts for a Job before marking it as permanently failed. The Job controller recreates failed Pods with an exponential back-off delay (10 seconds, 20 seconds, 40 seconds, and so on) capped at six minutes between retries. Once the backoffLimit is reached, no more Pods are created, the Job is marked as Failed, and the Job controller stops attempting to recover it. This is a safeguard to prevent infinite retry loops and unnecessary resource consumption. The error indicates that the underlying issue must be fixed before the Job can succeed.
Use kubectl describe to get comprehensive information about the Job's state:
kubectl describe job <job-name> -n <namespace>Look for the 'Conditions' section and 'Events' section. Note the exact error messages and timestamps to understand the failure pattern.
Retrieve logs from failed pod replicas:
# List all pods associated with the job
kubectl get pods -l job-name=<job-name> -n <namespace>
# View logs from a failed pod
kubectl logs <pod-name> -n <namespace>
# View logs from the previous container if pod restarted
kubectl logs <pod-name> --previous -n <namespace>Pay close attention to error messages, stack traces, and the last lines of output.
Ensure all required configuration is present in the Job spec:
# Get the full Job definition
kubectl get job <job-name> -o yaml -n <namespace>
# Verify environment variables in pod spec
kubectl get job <job-name> -o jsonpath='{.spec.template.spec.containers[0].env}' -n <namespace>Verify that ConfigMaps and Secrets referenced in the Job exist:
kubectl get configmap <name> -n <namespace>
kubectl get secret <name> -n <namespace>Verify that the Job's resource allocation matches available cluster resources:
# View resource requests and limits
kubectl get job <job-name> -o jsonpath='{.spec.template.spec.containers[0].resources}' -n <namespace>
# Check node capacity
kubectl describe nodesIf you see OOMKilled or Evicted status, increase the memory request/limit. If pods cannot schedule, verify node selector, affinity rules, and that nodes have sufficient free resources.
Examine the exact exit code from failed containers:
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState}' -n <namespace>Common exit codes:
- Exit code 0: Success
- Exit code 1: Application error or general exception
- Exit code 126: Permission denied - script not executable
- Exit code 127: Command not found
- Exit code 137 or OOMKilled: Out of memory
- Exit code 139: Segmentation fault
- Exit code 143: Terminated gracefully (SIGTERM)
Once you've fixed the root cause, recreate the Job. For debugging, consider changing restartPolicy to 'Never':
apiVersion: batch/v1
kind: Job
metadata:
name: debug-job
spec:
backoffLimit: 3 # Reduce for faster feedback during debugging
template:
spec:
restartPolicy: Never # Prevents pod restart, preserves logs
containers:
- name: job-container
image: your-image:fixed-versionDelete the old Job and apply:
kubectl delete job <old-job-name> -n <namespace>
kubectl apply -f job.yaml
kubectl get job -w # Watch for status changesUnderstanding restartPolicy is critical for debugging: with restartPolicy: OnFailure, individual container restarts are counted toward backoffLimit, but the pod itself isn't deleted. With restartPolicy: Never, the pod isn't restarted at all, but the Job controller creates new pods for each retry attempt.
For complex scenarios, use Pod Failure Policy (Kubernetes 1.28+) with .spec.podFailurePolicy to handle failures based on exit codes. This allows ignoring pod disruptions (preemption, eviction) so they don't count toward backoffLimit.
Always preserve completed pods for log inspection by not deleting the Job immediately after failure. Use terminationMessagePolicy: FallbackToLogsOnError to capture the last 2048 bytes of log output if the container exits with an error.
In production, increase backoffLimit conservatively only if failures are transient. Repeated backoff indicates a fundamental application issue that needs fixing, not higher retry limits.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
error: invalid configuration
How to fix "error: invalid configuration" in Kubernetes
etcdserver: cluster ID mismatch
How to fix "etcdserver: cluster ID mismatch" in Kubernetes
running with swap on is not supported
How to fix "running with swap on is not supported" in kubeadm