An unresponsive kubelet means the node no longer communicates with the API server or cannot process requests. The node becomes NotReady, pods cannot be scheduled, and existing pods may be evicted. This is a critical operational issue.
When the kubelet stops responding, it cannot: 1. Send heartbeat messages to API server 2. Accept new pod scheduling requests 3. Report node metrics (CPU, memory) 4. Process pod changes (create, update, delete) 5. Manage pod lifecycle The API server marks the node NotReady after the grace period expires (default 40 seconds). Pods are evicted after pod-eviction-timeout (default 5 minutes). A completely unresponsive kubelet leaves the cluster unable to schedule workloads on that node.
From control plane or another node:
ping <node-ip>
ssh ubuntu@<node-ip> # Or your userIf ping fails:
- Network partition (check routing, switches, cables)
- Firewall blocking ICMP (check security groups, UFW)
- Node powered off
If SSH fails:
- SSH daemon not running
- Firewall blocking port 22
- Network interface down
From control plane, check cluster connectivity:
kubectl get nodes -o wide # See node IP
kubectl describe node <node> | grep StatusIf network is unresponsive, use out-of-band access:
Physical servers:
- IPMI/iLO/iDRAC console
- Serial port access
- Physical reboot button (last resort)
Cloud instances:
# AWS
aws ec2-instance-connect send-ssh-public-key --instance-os-user ubuntu ...
aws ssm start-session --target <instance-id> # Systems Manager Session
# GCP
gcloud compute instances create-serial-console-prompt <instance-name>
gcloud compute connect-to-serial-port <instance-name>
# Azure
az vm run-command invoke -g <group> -n <vm-name> --command-id RunShellScript --scripts "ps aux | grep kubelet"Once connected:
ps aux | grep kubelet
free -h # Check OOM
df -h # Check disk full
sudo journalctl -u kubelet -n 50SSH into the node:
ps aux | grep kubeletIf no kubelet process:
sudo systemctl status kubelet
sudo systemctl start kubelet
sudo journalctl -u kubelet -fIf kubelet is running:
# Test if port 10250 is listening
sudo netstat -tlnp | grep 10250
# Test connectivity to kubelet API
kubectl debug node <node-name> -it --image=ubuntu
# Inside:
curl -k https://<node-ip>:10250/podsIf kubelet is listening but not responding:
sudo journalctl -u kubelet --no-pager | tail -100Look for:
- panic or fatal errors
- stuck operations
- lock contention
Check if node is out of resources:
# Memory
free -h
cat /proc/meminfo | grep MemAvailable
# If OOM:
dmesg | grep "Out of memory"If memory exhausted:
- Kubelet was killed by OOM killer
- Other processes consuming memory
# Disk
df -h
du -sh /* | sort -rh # Find space consumers
# If disk full:
# Clear logs
sudo journalctl --vacuum=1G
# Clear package manager cache
sudo apt-get clean
sudo yum clean all# Load
top -b -n1 | head -20
uptime
cat /proc/loadavgIf load > cores: System heavily contended, kubelet may be starved of CPU time.
Review detailed logs:
sudo journalctl -u kubelet --no-pager | head -100
sudo tail -f /var/log/kubelet.logLook for:
- "failed to connect to API server"
- "deadline exceeded"
- "permission denied"
- "resource exhausted"
- panic traces
For recent logs:
sudo journalctl -u kubelet --since="5 minutes ago"Enable verbose logging for troubleshooting:
sudo nano /etc/sysconfig/kubelet
# Add: --v=4
sudo systemctl restart kubelet
sudo journalctl -u kubelet -fAfter debugging, reduce verbosity:
# Remove --v=4
sudo systemctl restart kubeletTest from the node:
curl -k -H "Authorization: Bearer $(cat /var/lib/kubelet/pki/kubelet-client.crt)" \
https://<api-server>:6443/api/v1/nodes/<node-name>If timeout/refused:
- Check firewall rules
- Verify API server is running
- Check network routing
From control plane:
kubectl get nodes # List all nodes
kubectl get events -A # Check for API server errors
kubectl logs -n kube-system -l component=kube-apiserver | grep <node-name>For network debugging:
# From node
sudo tcpdump -i eth0 -n host <api-server-ip> and port 6443
# From control plane
sudo netstat -tlnp | grep 6443If kubelet is hung or stuck:
sudo systemctl restart kubelet
sudo journalctl -u kubelet -fMonitor status in another terminal:
kubectl get nodes -wThe node should return to Ready within a few seconds if the underlying issue is resolved.
If restart fails:
sudo systemctl status kubelet
sudo systemctl stop kubelet
sleep 10
sudo systemctl start kubeletFor emergency, force-kill kubelet:
sudo killall -9 kubelet
sudo systemctl start kubeletWait for node to rejoin:
kubectl get nodes <node-name> -wIf the node is completely unreachable (network cable unplugged, routing failure):
Option A: Recover the node
Fix the network issue (reseat cable, fix routing, update firewall):
sudo systemctl restart networking # Restart network
ip link show # Verify interfaces are up
ip route show # Verify default route existsOnce network is restored:
pings <api-server-ip>
kubectl get nodes # From control plane, monitor recoveryOption B: Replace the node
If recovery is impossible (hardware failure):
kubectl cordon <node-name> # Prevent new pods
kubectl drain <node-name> --ignore-daemonsets # Evict existing pods
kubectl delete node <node-name> # Remove from clusterFor cloud instances:
# AWS
aws ec2 terminate-instances --instance-ids <id>
# GCP
gcloud compute instances delete <name>
# Azure
az vm delete -g <group> -n <name>New node will auto-join if auto-scaling is configured.
Unresponsive kubelet is a severe failure state. Most recovery requires out-of-band access (serial console, cloud provider session manager). For production, implement Node Problem Detector to automatically detect and remediate kubelet hangs. Consider node auto-replacement policies: if a node is NotReady for > 10 minutes, automatically replace it. Kubelet deadlocks are usually in third-party plugins (CNI, storage drivers); check plugin logs if standard troubleshooting fails. For cloud environments, machine images should include health check daemons (systemd service) that monitor kubelet and auto-restart if hung. Redundancy is key: ensure no single node failure impacts cluster availability. Multi-zone deployments with pod affinity rules ensure workloads remain available. For on-prem, out-of-band management (IPMI) is essential for recovery. WSL2 kubelet may become unresponsive under high WSL2 system loadβupgrade WSL2 or use native Linux.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
No subnets found for EKS cluster
How to fix "eks subnet not found" in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
unable to compute replica count
How to fix "unable to compute replica count" in Kubernetes HPA
error: context not found
How to fix "error: context not found" in Kubernetes