How to fix kubelet unresponsive in Kubernetes

KubernetesADVANCEDCRITICAL

An unresponsive kubelet means the node no longer communicates with the API server or cannot process requests. The node becomes NotReady, pods cannot be scheduled, and existing pods may be evicted. This is a critical operational issue.

What this error means

When the kubelet stops responding, it cannot: 1. Send heartbeat messages to API server 2. Accept new pod scheduling requests 3. Report node metrics (CPU, memory) 4. Process pod changes (create, update, delete) 5. Manage pod lifecycle The API server marks the node NotReady after the grace period expires (default 40 seconds). Pods are evicted after pod-eviction-timeout (default 5 minutes). A completely unresponsive kubelet leaves the cluster unable to schedule workloads on that node.

How to fix "kubelet unresponsive"

1Attempt to reach the node via network

From control plane or another node:

bash

ping <node-ip>
ssh ubuntu@<node-ip>  # Or your user

If ping fails:
- Network partition (check routing, switches, cables)
- Firewall blocking ICMP (check security groups, UFW)
- Node powered off

If SSH fails:
- SSH daemon not running
- Firewall blocking port 22
- Network interface down

From control plane, check cluster connectivity:

bash

kubectl get nodes -o wide  # See node IP
kubectl describe node <node> | grep Status

2Check node system state from console/serial access

If network is unresponsive, use out-of-band access:

Physical servers:
- IPMI/iLO/iDRAC console
- Serial port access
- Physical reboot button (last resort)

Cloud instances:

bash

# AWS
aws ec2-instance-connect send-ssh-public-key --instance-os-user ubuntu ...
aws ssm start-session --target <instance-id>  # Systems Manager Session

# GCP
gcloud compute instances create-serial-console-prompt <instance-name>
gcloud compute connect-to-serial-port <instance-name>

# Azure
az vm run-command invoke -g <group> -n <vm-name> --command-id RunShellScript --scripts "ps aux | grep kubelet"

Once connected:

bash

ps aux | grep kubelet
free -h  # Check OOM
df -h  # Check disk full
sudo journalctl -u kubelet -n 50

3Verify kubelet process is running and responsive

SSH into the node:

bash

ps aux | grep kubelet

If no kubelet process:

bash

sudo systemctl status kubelet
sudo systemctl start kubelet
sudo journalctl -u kubelet -f

If kubelet is running:

bash

# Test if port 10250 is listening
sudo netstat -tlnp | grep 10250

# Test connectivity to kubelet API
kubectl debug node <node-name> -it --image=ubuntu
# Inside:
curl -k https://<node-ip>:10250/pods

If kubelet is listening but not responding:

bash

sudo journalctl -u kubelet --no-pager | tail -100

Look for:
- panic or fatal errors
- stuck operations
- lock contention

4Investigate resource exhaustion on the node

Check if node is out of resources:

bash

# Memory
free -h
cat /proc/meminfo | grep MemAvailable

# If OOM:
dmesg | grep "Out of memory"

If memory exhausted:
- Kubelet was killed by OOM killer
- Other processes consuming memory

bash

# Disk
df -h
du -sh /* | sort -rh  # Find space consumers

# If disk full:
# Clear logs
sudo journalctl --vacuum=1G
# Clear package manager cache
sudo apt-get clean
sudo yum clean all

bash

# Load
top -b -n1 | head -20
uptime
cat /proc/loadavg

If load > cores: System heavily contended, kubelet may be starved of CPU time.

5Check kubelet logs for errors

Review detailed logs:

bash

sudo journalctl -u kubelet --no-pager | head -100
sudo tail -f /var/log/kubelet.log

Look for:
- "failed to connect to API server"
- "deadline exceeded"
- "permission denied"
- "resource exhausted"
- panic traces

For recent logs:

bash

sudo journalctl -u kubelet --since="5 minutes ago"

Enable verbose logging for troubleshooting:

bash

sudo nano /etc/sysconfig/kubelet
# Add: --v=4
sudo systemctl restart kubelet
sudo journalctl -u kubelet -f

After debugging, reduce verbosity:

bash

# Remove --v=4
sudo systemctl restart kubelet

6Verify API server connectivity

Test from the node:

bash

curl -k -H "Authorization: Bearer $(cat /var/lib/kubelet/pki/kubelet-client.crt)" \
  https://<api-server>:6443/api/v1/nodes/<node-name>

If timeout/refused:
- Check firewall rules
- Verify API server is running
- Check network routing

From control plane:

bash

kubectl get nodes  # List all nodes
kubectl get events -A  # Check for API server errors
kubectl logs -n kube-system -l component=kube-apiserver | grep <node-name>

For network debugging:

bash

# From node
sudo tcpdump -i eth0 -n host <api-server-ip> and port 6443

# From control plane  
sudo netstat -tlnp | grep 6443

7Restart kubelet service

If kubelet is hung or stuck:

bash

sudo systemctl restart kubelet
sudo journalctl -u kubelet -f

Monitor status in another terminal:

bash

kubectl get nodes -w

The node should return to Ready within a few seconds if the underlying issue is resolved.

If restart fails:

bash

sudo systemctl status kubelet
sudo systemctl stop kubelet
sleep 10
sudo systemctl start kubelet

For emergency, force-kill kubelet:

bash

sudo killall -9 kubelet
sudo systemctl start kubelet

Wait for node to rejoin:

bash

kubectl get nodes <node-name> -w

8Recover from network partition

If the node is completely unreachable (network cable unplugged, routing failure):

Option A: Recover the node
Fix the network issue (reseat cable, fix routing, update firewall):

bash

sudo systemctl restart networking  # Restart network
ip link show  # Verify interfaces are up
ip route show  # Verify default route exists

Once network is restored:

bash

pings <api-server-ip>
kubectl get nodes  # From control plane, monitor recovery

Option B: Replace the node
If recovery is impossible (hardware failure):

bash

kubectl cordon <node-name>  # Prevent new pods
kubectl drain <node-name> --ignore-daemonsets  # Evict existing pods
kubectl delete node <node-name>  # Remove from cluster

For cloud instances:

bash

# AWS
aws ec2 terminate-instances --instance-ids <id>

# GCP
gcloud compute instances delete <name>

# Azure  
az vm delete -g <group> -n <name>

New node will auto-join if auto-scaling is configured.

How to fix kubelet unresponsive in Kubernetes

What this error means

Typical symptoms

Common causes

How to fix "kubelet unresponsive"

Advanced notes

Related errors

Official resources & further reading