How to fix temporary failure in Kubernetes

KubernetesBEGINNERMEDIUM

Temporary failure errors indicate transient problems during pod startup (network timeouts, brief API server unavailability, container image pull delays). These usually resolve with retry, but repeated failures suggest persistent issues.

What this error means

Kubernetes reports "temporary failure" for transient errors: 1. Network timeouts during image pull 2. Brief API server connectivity loss 3. Container runtime temporary unavailability 4. Scheduler timeout finding resources 5. DNS resolution failures for image registries The kubelet automatically retries, and pods eventually start if the underlying issue resolves. However, if the condition persists, pods remain stuck in Pending or ImagePullBackOff.

How to fix "temporary failure"

1Check pod current status and events

Examine the pod:

bash

kubectl describe pod <pod-name> -n <namespace>

Look at the Events section for:
- Recent "failed" or "backoff" events
- Container error messages
- Image pull attempts and failures
- Restart count and last state

For repeated failures:

bash

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A10 status:

Check container logs:

bash

kubectl logs <pod-name> -n <namespace>  # Current logs
kubectl logs <pod-name> -n <namespace> --previous  # Previous attempt

2Test container registry connectivity

From a working pod in the cluster, test registry access:

bash

kubectl run test-pull --image=busybox --rm -it -- sh

Inside the pod:

bash

ping <registry-domain>
dig <registry-domain>
curl -v https://<registry>/v2/

From the problematic node:

bash

kubectl debug node <node-name> -it --image=ubuntu
# Inside container:
ping registry.example.com
telnet registry.example.com 443
time curl https://registry.example.com/v2/

If DNS fails:

bash

kubectl get svc -n kube-system kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

If timeout:
- Network latency (check with time command)
- Registry rate limiting (check registry logs)
- Firewall blocking (check iptables, security groups)

3Verify image pull secret configuration

If using a private registry, check the secret:

bash

kubectl get secret -n <namespace> | grep registry
kubectl describe secret <secret-name> -n <namespace>

Verify pod spec includes the secret:

bash

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A5 imagePullSecrets

Should output:

yaml

imagePullSecrets:
- name: <secret-name>

If missing, add it:

bash

kubectl patch serviceaccount default -n <namespace> -p '{"imagePullSecrets": [{"name": "<secret-name>"}]}'

Test secret validity:

bash

kubectl get secret <secret-name> -n <namespace> -o yaml | grep .dockerconfigjson | base64 -d | jq .

Verify credentials work:

bash

docker login -u <user> -p <pass> <registry>

4Check node network and DNS

Verify network stability on the node:

bash

sudo mtr -r -c 100 <registry-domain>  # Multi-hop traceroute
ping -c 100 <registry-domain>  # Check packet loss

Check DNS resolution:

bash

nslookup <registry-domain>
dig <registry-domain> +short

If DNS is slow or failing:

bash

kubectl get service -n kube-system kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns | tail -50

Test from pod:

bash

kubectl run test-dns --image=busybox -- nslookup kubernetes.default

For persistent DNS issues, increase kubelet DNS timeout:

bash

sudo nano /etc/sysconfig/kubelet
# Add: --image-pull-progress-deadline=3m  # Increase from 1m
sudo systemctl restart kubelet

5Increase image pull timeout and retries

Adjust kubelet settings for slow registries:

bash

sudo nano /etc/sysconfig/kubelet

Add or update:

bash

--image-pull-progress-deadline=3m  # Default 1 minute
--max-pods=110  # Reduce if too many parallel pulls
--serialize-image-pulls=true  # Pull images sequentially (slower but more reliable)

For deployments, add image pull backoff:

yaml

spec:
  template:
    spec:
      restartPolicy: Always
      containers:
      - image: myregistry.com/myimage:v1
        imagePullPolicy: IfNotPresent  # Don't re-pull if cached

Restart kubelet:

bash

sudo systemctl restart kubelet
sudo journalctl -u kubelet -f

6Pre-cache images on nodes if possible

For frequently-used images, pre-cache them on nodes:

bash

# SSH into node
sudo docker pull myregistry.com/myimage:v1
sudo ctr -n k8s.io i pull myregistry.com/myimage:v1  # containerd

For DaemonSets or common images, use a node init script:

bash

# /var/lib/cloud/cloud-init.sh
#!/bin/bash
docker pull nginx:latest
docker pull postgres:13

For large deployments, build a base machine image with pre-cached images (Packer, Golden AMI).

Or use a cache warmer pod:

yaml

apiVersion: batch/v1
kind: DaemonSet
metadata:
  name: image-cache-warmer
spec:
  template:
    spec:
      hostNetwork: true
      containers:
      - name: warmer
        image: busybox
        command:
        - /bin/sh
        - -c
        - for img in nginx postgres redis; do docker pull $img:latest; done
        volumeMounts:
        - name: docker
          mountPath: /var/run/docker.sock
      volumes:
      - name: docker
        hostPath:
          path: /var/run/docker.sock

7Monitor registry and node network health

Set up proactive monitoring:

bash

# Check registry health endpoint
curl -v https://<registry>/v2/  # Should return 200 OK

# Monitor node network metrics
watch -n 1 "ethtool -S eth0 | grep -i error"

For Prometheus:

promql

rate(kubelet_image_pull_duration_seconds_bucket[5m])
kubelet_image_pull_duration_seconds_count

Alert on excessive image pull failures:

yaml

- alert: HighImagePullFailureRate
  expr: rate(kubelet_image_pull_failures_total[5m]) > 0.1
  for: 5m

For large registries, consider:
- Deploy registry mirror (Docker Distribution)
- Use image caching (Harbor, Nexus)
- Implement image pull rate limiting locally to avoid overwhelming registry

8Retry pod if temporary failure persists

If a pod is stuck after several retry attempts:

bash

# Delete and recreate
kubectl delete pod <pod-name> -n <namespace>
kubectl get pod -w  # Watch for new pod

For deployments, trigger a rollout restart:

bash

kubectl rollout restart deployment <name> -n <namespace>
kubectl rollout status deployment <name> -n <namespace>

For StatefulSets:

bash

kubectl rollout restart statefulset <name> -n <namespace>

Monitor the new attempt:

bash

kubectl describe pod <new-pod> -n <namespace>
kubectl logs <new-pod> -n <namespace> -f

If the issue persists, it's no longer temporary—investigate root cause (Steps 2-5).

How to fix temporary failure in Kubernetes

What this error means

Typical symptoms

Common causes

How to fix "temporary failure"

Advanced notes

Related errors

Official resources & further reading