A sync loop hang occurs when the kubelet's main reconciliation loop becomes blocked or unresponsive. The node stops processing pod changes, leaving pods in inconsistent states and blocking cluster operations.
The kubelet runs a sync loop that: 1. Monitors desired pod state from the API server 2. Reconciles actual pod state on the node 3. Creates/updates/deletes containers to match desired state When this loop hangs (blocks indefinitely), the kubelet: - Stops processing new pods - Stops cleaning up terminated pods - Cannot respond to API server requests - Appears to be frozen but may not crash Common causes: deadlocks in kubelet code, I/O operations blocking, network timeouts, or resource exhaustion.
SSH into the affected node and check kubelet status:
ps aux | grep kubelet
sudo systemctl status kubeletCheck kubelet logs for hang indicators:
sudo journalctl -u kubelet --no-pager | tail -100Look for:
- No log entries for extended period (> 30 seconds)
- Repeating error messages (stuck in retry loop)
- "sync pod" or "sync loop" keywords
- "context deadline exceeded"
Check if kubelet process is responsive:
kill -0 <kubelet-pid> # Exit code 0 = process exists
strace -p <kubelet-pid> 2>&1 | head -10 # See what it's doingIf strace shows I/O wait or network system calls, the process is blocked.
Verify node has sufficient resources:
free -h # Memory
df -h # Disk space
blk-iosat # I/O saturation
top -b -n1 | head -20 # CPU and process infoCheck disk I/O specifically:
sudo iostat -x 1 5 # 5 reports, 1 second intervalsLook for util% near 100% (disk saturation).
Check file descriptor limit:
ulimit -n
lsof | wc -l # Open files countIf disk is full or heavily I/O bound:
sudo lsof | sort -k7 -n | tail -20 # Files by size
sudo du -sh /var/lib/kubelet/* | sort -rh | headFree space if needed:
sudo rm -rf /var/lib/kubelet/pods/*/volumes/* # Temporary volumesThe sync loop communicates with the API server. Test connectivity:
curl -k https://<api-server-ip>:6443/api/v1/nodesIf timeout:
- Network partition or firewall issue
- API server overloaded
- kubelet network stuck in TIME_WAIT
Reset network connections:
sudo systemctl restart networking # On some distros
sudo ip route flush cacheCheck kubelet request timeout setting:
ps aux | grep kubelet | grep -o "--request-timeout=[^ ]*"Default is 70s. If API server is slow, increase it:
# Edit kubelet config
sudo nano /etc/sysconfig/kubelet
# Add: --request-timeout=120s
sudo systemctl restart kubeletThe sync loop calls container runtime frequently. Verify runtime is responsive:
containerd:
sudo ctr -a /run/containerd/containerd.sock version
sudo ctr -a /run/containerd/containerd.sock tasks listDocker:
sudo docker ps
sudo docker stats --no-streamIf runtime is slow/hung:
- Check runtime logs (see kubelet-runtime-error article)
- Restart runtime:
sudo systemctl restart containerd # or docker, crio- But wait for kubelet to sync first; don't restart both simultaneously
Monitor runtime operations:
sudo strace -p <runtime-pid> -e trace=network,open,openat 2>&1 | head -20If sync loop hang started after a recent change:
Check current kubelet version:
kubelet --version
kubectl get nodes -o wide # Shows versionCheck release notes for known issues:
- https://github.com/kubernetes/kubernetes/releases
- Search for "sync loop" and "hang"
Common problematic versions have patches. Upgrade if available:
# For kubeadm clusters
sudo kubeadm upgrade plan
sudo kubeadm upgrade nodeFor managed services (EKS, GKE, AKS), check if a node update is available:
# AWS EKS
aws eks describe-nodegroup --cluster-name <cluster> --nodegroup-name <group>
# GKE
gcloud container node-pools update <pool> --cluster <cluster> --enable-upgrade-on-bootEnable debug logging to see what kubelet is doing:
sudo nano /etc/sysconfig/kubelet
# Add: --v=4 # Increase verbosity (1-4)
sudo systemctl restart kubeletThen monitor logs in real-time:
sudo journalctl -u kubelet -f --output=short-isoHigh verbosity will show:
- Pod sync operations
- API server requests
- Container operations
- Lock acquisitions
Watch for:
- Stuck on one operation for > 30 seconds
- "Timed out waiting for..." messages
- Panic or fatal errors
After debugging, reduce verbosity to avoid log spam:
sudo nano /etc/sysconfig/kubelet
# Change: --v=2 # Or remove
sudo systemctl restart kubeletIf sync loop is hung and blocking cluster operations:
# Cordon node to prevent new pods
kubectl cordon <node-name>
# Force-restart kubelet (may lose data on tmpfs volumes)
sudo systemctl stop kubelet
sleep 5
sudo systemctl start kubelet
# Monitor recovery
sudo journalctl -u kubelet -f
kubectl get nodes -wIf kubelet won't stop:
sudo killall -9 kubeletThen restart:
sudo systemctl start kubeletPods affected by the restart may show errors. Let the node recover:
kubectl uncordon <node-name>
kubectl get pods -A --field-selector spec.nodeName=<node-name> -wAfter recovery, investigate root cause (Steps 1-5).
If the node exhibits sync loop hangs consistently:
1. Cordon the node:
kubectl cordon <node-name>2. Drain workloads:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data3. Remove the node from cluster:
kubectl delete node <node-name>4. Terminate the actual node:
# AWS
aws ec2 terminate-instances --instance-ids <instance-id>
# GCP
gcloud compute instances delete <instance-name>
# Azure
az vm delete --resource-group <group> --name <vm-name>5. New node will auto-join (if cluster auto-scaling is enabled):
kubectl get nodes -wFor on-prem, decommission and replace the physical/VM hardware.
Sync loop hangs are rare in stable Kubernetes versions but more common in bleeding-edge or heavily patched clusters. They indicate either a kubelet bug (file an issue on GitHub) or environmental problems (I/O bottleneck, network issue). For production, establish Node Problem Detector to automatically detect stuck kubelet and trigger node replacement. Kubernetes 1.18+ includes improvements to the sync loop for better responsiveness. Monitor kubelet restart count with Prometheus: rate(kubelet_container_manager_cgroup_manager_duration_seconds_bucket[5m]). For long-running nodes in production, periodic restarts (every 30 days) can prevent accumulation of lock-related issues. Deadlock in the sync loop is usually a golang runtime issueโensure kubelet is not using too much memory (GC pauses can block). WSL2 may have sync loop issues under heavy I/O; upgrade kernel or consider native Linux deployment.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
error: invalid configuration
How to fix "error: invalid configuration" in Kubernetes
etcdserver: cluster ID mismatch
How to fix "etcdserver: cluster ID mismatch" in Kubernetes
running with swap on is not supported
How to fix "running with swap on is not supported" in kubeadm