The etcd leader changed error indicates the etcd cluster detected a leadership change during operation. While occasional changes are normal, frequent changes signal network issues, slow disk performance, or resource constraints that prevent timely heartbeats.
Etcd uses a distributed consensus algorithm with one elected leader. When the leader fails to send heartbeats within the election timeout, followers initiate a new election and elect a different leader. The "leader changed" error indicates this just happened. Occasional changes are normal in HA clusters, but frequent changes indicate infrastructure problems preventing stable leadership.
Diagnose network and disk performance:
# Network latency between control nodes:
ping -c 100 <other-control-node-ip> | tail -1
# Disk I/O performance:
fio --name=randread --ioengine=libaio --iodepth=16 --rw=randread \
--bs=4k --direct=1 --size=1G --numjobs=1 \
--runtime=60 --group_reporting --directory=/var/lib/etcdLook for:
- Network latency > 100ms (indicates potential issue)
- Disk read latency > 10ms
Check for excessive leader elections:
# Prometheus query (if Prometheus installed):
etcd_server_has_leader # Should be 1 (has leader)
etcd_server_leader_changes_seen_total # Should be low
# Or check logs:
kubectl logs -n kube-system <etcd-pod> | grep "elected leader"
kubectl logs -n kube-system <etcd-pod> | grep "leader changed"Frequent elections (> 1 per minute) indicate problems.
Check all etcd members are responding:
ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--endpoints=https://127.0.0.1:2379 \
endpoint health
# Check member list:
ETCDCTL_API=3 etcdctl member listAll members should show healthy. Remove stale members if any show down.
Many objects in etcd slow down lookups:
# Check pod count (rough etcd size indicator):
kubectl get pods --all-namespaces | wc -l
# Remove evicted or error pods:
kubectl delete pods --all-namespaces --field-selector=status.phase=Failed
kubectl delete pods --all-namespaces --field-selector=status.reason=EvictedLarge numbers of accumulated objects slow etcd responses.
Ensure etcd has enough space:
df -h /var/lib/etcd
ls -lh /var/lib/etcd/member/snap/dbEtcd has a default 2GB quota. Monitor size growth:
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint statusIf approaching quota, increase it or compact the database.
After deleting many objects, defragment:
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
defragDefragmentation may trigger a leader election temporarily.
Check number of etcd members:
ETCDCTL_API=3 etcdctl member listRecommended:
- 3 members (tolerate 1 failure)
- 5 members (tolerate 2 failures)
- Avoid even numbers (4, 6) as quorum becomes fragile
If using even number, add/remove to make odd.
Dedicate more resources to etcd:
# For static pods, edit manifest:
sudo vi /etc/kubernetes/manifests/etcd.yaml
# Increase CPU/memory requests:
resources:
requests:
cpu: 500m # Increase from 100m
memory: 2Gi # Increase from 1GiThen restart etcd:
sudo systemctl restart kubelet # Picks up manifest changeSome platforms (Azure AKS) take automated etcd snapshots:
# Check Azure AKS backup settings:
az aks show --name <cluster> --resource-group <group>Snapshots can trigger temporary leader changes. These are expected and not concerning if infrequent.
For multi-control-plane clusters, these operations should not be frequent.
Leader changes are Kubernetes' self-healing mechanismโthe cluster is working correctly even during elections. What matters is frequency and stability afterward. A single leader change is not a problem; multiple changes per minute indicate infrastructure issues. Network latency is the most common cause in cloud environments; use dedicated high-performance networking for control plane nodes. For very large clusters (1000+ nodes), consider splitting etcd into a dedicated external cluster. Kubernetes 1.20+ has better etcd performance optimizations; upgrading may help. Monitor etcd_disk_backend_commit_duration_seconds metric to identify slow disk operations.
No subnets found for EKS cluster
How to fix "eks subnet not found" in Kubernetes
unable to compute replica count
How to fix "unable to compute replica count" in Kubernetes HPA
error: context not found
How to fix "error: context not found" in Kubernetes
default backend - 404
How to fix "default backend - 404" in Kubernetes Ingress
serviceaccount cannot list resource
How to fix "serviceaccount cannot list resource" in Kubernetes