How to fix PLEG is not healthy in Kubernetes

KubernetesADVANCEDHIGH

PLEG (Pod Lifecycle Event Generator) is a critical kubelet component that monitors container runtime state. When PLEG becomes unhealthy, the node transitions to NotReady and prevents pod scheduling. This is usually caused by container runtime latency, high pod density, or resource exhaustion.

What this error means

PLEG is a Kubernetes kubelet component that continuously polls the container runtime (Docker, containerd, CRI-O) to detect container state changes and synchronize them with the pod cache. The "PLEG is not healthy" error means the kubelet's relist process—which checks container status—took longer than 3 minutes to complete. When PLEG times out, the kubelet transitions the node to NotReady to prevent pod scheduling. This is a safety mechanism because if the kubelet can't track container state, it can't reliably manage pods. The underlying cause is usually the container runtime being too slow to respond to status queries.

How to fix "PLEG is not healthy"

1Check node resource usage immediately

Get real-time metrics:

bash

kubectl top nodes
kubectl top pods -n <namespace>

Check if CPU or memory is near 100%. From the node itself:

bash

top  # Interactive: shows CPU/memory usage
free -h  # Memory usage
df -h  # Disk usage
ps aux | grep kubelet  # Check kubelet CPU %
ps aux | grep docker  # Check Docker CPU %

If CPU is high, the relist can't complete in time.

2Count pods and containers on the node

Check current pod density:

bash

kubectl get pods --all-namespaces --field-selector=spec.nodeName=<node-name> | wc -l
kubectl describe node <node-name> | grep "Allocated resources"

Also check from the node:

bash

docker ps | wc -l  # Number of running containers
docker ps -a | wc -l  # Total containers (including exited)

If exceeding ~400 containers, that's likely the bottleneck.

3Check container runtime logs

View Docker/containerd logs for hanging operations:

bash

# Docker:
sudo journalctl -u docker -n 100
sudo docker info  # Check for warnings

# containerd:
sudo journalctl -u containerd -n 100

# CRI-O:
sudo journalctl -u crio -n 100

Look for:
- "inspect" operations timing out
- Hung goroutines or deadlocks
- Resource allocation warnings

4Reduce pod density by draining and redistributing

Move pods to other nodes to reduce density below ~400 containers:

bash

# Prevent new pods:
kubectl cordon <node-name>

# Gracefully move existing pods:
kubectl drain <node-name> --ignore-daemonsets

# Wait for PLEG to recover (2-5 minutes):
kubectl describe node <node-name>

# Once healthy, allow scheduling again:
kubectl uncordon <node-name>

Monitor recovery:

bash

kubectl get nodes -w

5Restart container runtime daemon

Restarting Docker/containerd clears hung connections and resets state:

bash

# Docker:
sudo systemctl restart docker

# containerd:
sudo systemctl restart containerd

Kubelet will automatically reconnect. Monitor:

bash

kubectl describe node <node-name>

Wait 30-60 seconds for PLEG to recover.

6Clean up exited and dangling containers

Remove containers consuming memory/disk:

bash

# Docker:
sudo docker container prune  # Removes all exited containers
sudo docker image prune      # Removes unused images

# containerd:
sudo crictl rm <container-id>  # Remove specific container
sudo ctr container rm <container-id>

This reduces relist scan overhead by decreasing the number of containers to inspect.

7Upgrade Kubernetes if using old versions

PLEG deadlock was fixed in Kubernetes 1.14+. Check current version:

bash

kubectl version --short

If using 1.13 or earlier, upgrade the cluster. This fixes inherited PLEG issues. For cloud providers:

bash

# GKE:
gcloud container clusters upgrade <cluster-name> --master-zone <zone>

# EKS:
aws eks update-cluster-version --name <cluster-name> --kubernetes-version 1.27

# AKS:
az aks upgrade --resource-group <group> --name <cluster-name> --kubernetes-version 1.27

8Enable Event-based PLEG (Kubernetes 1.27+)

Modern Kubernetes supports event-based container status instead of polling. This dramatically reduces overhead:

bash

# Check if supported:
kubectl version --short

If using 1.27+, enable in kubelet config:

bash

sudo vi /etc/kubernetes/kubelet.conf
# Add or update:
eventRecordQPS: 10
featureGates:
  EventedPLEG: true

Restart kubelet:

bash

sudo systemctl restart kubelet

This prevents PLEG polling altogether.

9Scale infrastructure to add capacity

If resource exhaustion is persistent, increase node capacity:

Add more nodes:

bash

# Scale auto-scaling group (AWS):
aws autoscaling set-desired-capacity --auto-scaling-group-name <asg> --desired-capacity 5

# Scale GKE node pool:
gcloud container node-pools update <pool> --num-nodes 5

Upgrade node specs:
- Increase CPU/RAM on existing nodes
- Use larger instance types
- Add SSDs for faster disk I/O

Redistribute pods using node affinity or manually drain/reschedule.

How to fix PLEG is not healthy in Kubernetes

What this error means

Typical symptoms

Common causes

How to fix "PLEG is not healthy"

Advanced notes

Related errors

Official resources & further reading