Temporary failure errors indicate transient problems during pod startup (network timeouts, brief API server unavailability, container image pull delays). These usually resolve with retry, but repeated failures suggest persistent issues.
Kubernetes reports "temporary failure" for transient errors: 1. Network timeouts during image pull 2. Brief API server connectivity loss 3. Container runtime temporary unavailability 4. Scheduler timeout finding resources 5. DNS resolution failures for image registries The kubelet automatically retries, and pods eventually start if the underlying issue resolves. However, if the condition persists, pods remain stuck in Pending or ImagePullBackOff.
Examine the pod:
kubectl describe pod <pod-name> -n <namespace>Look at the Events section for:
- Recent "failed" or "backoff" events
- Container error messages
- Image pull attempts and failures
- Restart count and last state
For repeated failures:
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A10 status:Check container logs:
kubectl logs <pod-name> -n <namespace> # Current logs
kubectl logs <pod-name> -n <namespace> --previous # Previous attemptFrom a working pod in the cluster, test registry access:
kubectl run test-pull --image=busybox --rm -it -- shInside the pod:
ping <registry-domain>
dig <registry-domain>
curl -v https://<registry>/v2/From the problematic node:
kubectl debug node <node-name> -it --image=ubuntu
# Inside container:
ping registry.example.com
telnet registry.example.com 443
time curl https://registry.example.com/v2/If DNS fails:
kubectl get svc -n kube-system kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dnsIf timeout:
- Network latency (check with time command)
- Registry rate limiting (check registry logs)
- Firewall blocking (check iptables, security groups)
If using a private registry, check the secret:
kubectl get secret -n <namespace> | grep registry
kubectl describe secret <secret-name> -n <namespace>Verify pod spec includes the secret:
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A5 imagePullSecretsShould output:
imagePullSecrets:
- name: <secret-name>If missing, add it:
kubectl patch serviceaccount default -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'Test secret validity:
kubectl get secret <secret-name> -n <namespace> -o yaml | grep .dockerconfigjson | base64 -d | jq .Verify credentials work:
docker login -u <user> -p <pass> <registry>Verify network stability on the node:
sudo mtr -r -c 100 <registry-domain> # Multi-hop traceroute
ping -c 100 <registry-domain> # Check packet lossCheck DNS resolution:
nslookup <registry-domain>
dig <registry-domain> +shortIf DNS is slow or failing:
kubectl get service -n kube-system kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns | tail -50Test from pod:
kubectl run test-dns --image=busybox -- nslookup kubernetes.defaultFor persistent DNS issues, increase kubelet DNS timeout:
sudo nano /etc/sysconfig/kubelet
# Add: --image-pull-progress-deadline=3m # Increase from 1m
sudo systemctl restart kubeletAdjust kubelet settings for slow registries:
sudo nano /etc/sysconfig/kubeletAdd or update:
--image-pull-progress-deadline=3m # Default 1 minute
--max-pods=110 # Reduce if too many parallel pulls
--serialize-image-pulls=true # Pull images sequentially (slower but more reliable)For deployments, add image pull backoff:
spec:
template:
spec:
restartPolicy: Always
containers:
- image: myregistry.com/myimage:v1
imagePullPolicy: IfNotPresent # Don't re-pull if cachedRestart kubelet:
sudo systemctl restart kubelet
sudo journalctl -u kubelet -fFor frequently-used images, pre-cache them on nodes:
# SSH into node
sudo docker pull myregistry.com/myimage:v1
sudo ctr -n k8s.io i pull myregistry.com/myimage:v1 # containerdFor DaemonSets or common images, use a node init script:
# /var/lib/cloud/cloud-init.sh
#!/bin/bash
docker pull nginx:latest
docker pull postgres:13For large deployments, build a base machine image with pre-cached images (Packer, Golden AMI).
Or use a cache warmer pod:
apiVersion: batch/v1
kind: DaemonSet
metadata:
name: image-cache-warmer
spec:
template:
spec:
hostNetwork: true
containers:
- name: warmer
image: busybox
command:
- /bin/sh
- -c
- for img in nginx postgres redis; do docker pull $img:latest; done
volumeMounts:
- name: docker
mountPath: /var/run/docker.sock
volumes:
- name: docker
hostPath:
path: /var/run/docker.sockSet up proactive monitoring:
# Check registry health endpoint
curl -v https://<registry>/v2/ # Should return 200 OK
# Monitor node network metrics
watch -n 1 "ethtool -S eth0 | grep -i error"For Prometheus:
rate(kubelet_image_pull_duration_seconds_bucket[5m])
kubelet_image_pull_duration_seconds_countAlert on excessive image pull failures:
- alert: HighImagePullFailureRate
expr: rate(kubelet_image_pull_failures_total[5m]) > 0.1
for: 5mFor large registries, consider:
- Deploy registry mirror (Docker Distribution)
- Use image caching (Harbor, Nexus)
- Implement image pull rate limiting locally to avoid overwhelming registry
If a pod is stuck after several retry attempts:
# Delete and recreate
kubectl delete pod <pod-name> -n <namespace>
kubectl get pod -w # Watch for new podFor deployments, trigger a rollout restart:
kubectl rollout restart deployment <name> -n <namespace>
kubectl rollout status deployment <name> -n <namespace>For StatefulSets:
kubectl rollout restart statefulset <name> -n <namespace>Monitor the new attempt:
kubectl describe pod <new-pod> -n <namespace>
kubectl logs <new-pod> -n <namespace> -fIf the issue persists, it's no longer temporary—investigate root cause (Steps 2-5).
Temporary failures are often symptoms of transient network issues; in Kubernetes, the retry mechanism is the feature, not the bug. However, if a pod retries indefinitely, something is broken. Image pull optimization is critical for large clusters: a single slow registry pull during a rolling update can delay the entire deployment. Implement circuit breaker patterns for registries: if pull latency > threshold, use cached image or fallback. For air-gapped clusters, pre-loading images is mandatory. Multi-region clusters may experience geographic latency to registries—deploy regional replicas. Container image layer caching in nodes (containerd image layers) speeds up subsequent pulls. For production, use private registries with authentication; public registries may rate-limit (Docker Hub limits 100 pulls/6h for anonymous users). Monitor kubelet image pull metrics and alert on anomalies. Distributed tracing (Jaeger) can help diagnose slow image pulls across the pull path (DNS → registry server → download).
No subnets found for EKS cluster
How to fix "eks subnet not found" in Kubernetes
unable to compute replica count
How to fix "unable to compute replica count" in Kubernetes HPA
error: context not found
How to fix "error: context not found" in Kubernetes
default backend - 404
How to fix "default backend - 404" in Kubernetes Ingress
serviceaccount cannot list resource
How to fix "serviceaccount cannot list resource" in Kubernetes