PLEG (Pod Lifecycle Event Generator) is a critical kubelet component that monitors container runtime state. When PLEG becomes unhealthy, the node transitions to NotReady and prevents pod scheduling. This is usually caused by container runtime latency, high pod density, or resource exhaustion.
PLEG is a Kubernetes kubelet component that continuously polls the container runtime (Docker, containerd, CRI-O) to detect container state changes and synchronize them with the pod cache. The "PLEG is not healthy" error means the kubelet's relist process—which checks container status—took longer than 3 minutes to complete. When PLEG times out, the kubelet transitions the node to NotReady to prevent pod scheduling. This is a safety mechanism because if the kubelet can't track container state, it can't reliably manage pods. The underlying cause is usually the container runtime being too slow to respond to status queries.
Get real-time metrics:
kubectl top nodes
kubectl top pods -n <namespace>Check if CPU or memory is near 100%. From the node itself:
top # Interactive: shows CPU/memory usage
free -h # Memory usage
df -h # Disk usage
ps aux | grep kubelet # Check kubelet CPU %
ps aux | grep docker # Check Docker CPU %If CPU is high, the relist can't complete in time.
Check current pod density:
kubectl get pods --all-namespaces --field-selector=spec.nodeName=<node-name> | wc -l
kubectl describe node <node-name> | grep "Allocated resources"Also check from the node:
docker ps | wc -l # Number of running containers
docker ps -a | wc -l # Total containers (including exited)If exceeding ~400 containers, that's likely the bottleneck.
View Docker/containerd logs for hanging operations:
# Docker:
sudo journalctl -u docker -n 100
sudo docker info # Check for warnings
# containerd:
sudo journalctl -u containerd -n 100
# CRI-O:
sudo journalctl -u crio -n 100Look for:
- "inspect" operations timing out
- Hung goroutines or deadlocks
- Resource allocation warnings
Move pods to other nodes to reduce density below ~400 containers:
# Prevent new pods:
kubectl cordon <node-name>
# Gracefully move existing pods:
kubectl drain <node-name> --ignore-daemonsets
# Wait for PLEG to recover (2-5 minutes):
kubectl describe node <node-name>
# Once healthy, allow scheduling again:
kubectl uncordon <node-name>Monitor recovery:
kubectl get nodes -wRestarting Docker/containerd clears hung connections and resets state:
# Docker:
sudo systemctl restart docker
# containerd:
sudo systemctl restart containerdKubelet will automatically reconnect. Monitor:
kubectl describe node <node-name>Wait 30-60 seconds for PLEG to recover.
Remove containers consuming memory/disk:
# Docker:
sudo docker container prune # Removes all exited containers
sudo docker image prune # Removes unused images
# containerd:
sudo crictl rm <container-id> # Remove specific container
sudo ctr container rm <container-id>This reduces relist scan overhead by decreasing the number of containers to inspect.
PLEG deadlock was fixed in Kubernetes 1.14+. Check current version:
kubectl version --shortIf using 1.13 or earlier, upgrade the cluster. This fixes inherited PLEG issues. For cloud providers:
# GKE:
gcloud container clusters upgrade <cluster-name> --master-zone <zone>
# EKS:
aws eks update-cluster-version --name <cluster-name> --kubernetes-version 1.27
# AKS:
az aks upgrade --resource-group <group> --name <cluster-name> --kubernetes-version 1.27Modern Kubernetes supports event-based container status instead of polling. This dramatically reduces overhead:
# Check if supported:
kubectl version --shortIf using 1.27+, enable in kubelet config:
sudo vi /etc/kubernetes/kubelet.conf
# Add or update:
eventRecordQPS: 10
featureGates:
EventedPLEG: trueRestart kubelet:
sudo systemctl restart kubeletThis prevents PLEG polling altogether.
If resource exhaustion is persistent, increase node capacity:
Add more nodes:
# Scale auto-scaling group (AWS):
aws autoscaling set-desired-capacity --auto-scaling-group-name <asg> --desired-capacity 5
# Scale GKE node pool:
gcloud container node-pools update <pool> --num-nodes 5Upgrade node specs:
- Increase CPU/RAM on existing nodes
- Use larger instance types
- Add SSDs for faster disk I/O
Redistribute pods using node affinity or manually drain/reschedule.
PLEG is the critical health check for kubelet's container state tracking. When unhealthy, the kubelet intentionally marks the node NotReady to prevent cascading failures. This is a safety feature, not a bug. Docker has known inspect() hangs (moby#44300)—upgrading to 20.10.19+ improves stability. For high-density clusters, event-based PLEG (Kubernetes 1.27+) is transformational, reducing polling overhead by 90%+. In resource-constrained environments (edge, IoT), reduce pod density or allocate reserved CPU/memory for kubelet/containerd. For stateful workloads, high pod churn can cause repeated PLEG timeouts during rapid scaling—use pod disruption budgets to slow down evictions. WSL2-based Kubernetes may experience PLEG issues during heavy container churn—ensure adequate memory allocation to WSL VM.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
error: invalid configuration
How to fix "error: invalid configuration" in Kubernetes
etcdserver: cluster ID mismatch
How to fix "etcdserver: cluster ID mismatch" in Kubernetes
running with swap on is not supported
How to fix "running with swap on is not supported" in kubeadm