The etcdserver no leader error indicates the etcd cluster lost quorum and cannot elect a leader. This is critical because the cluster cannot make any changes: no new pods can be scheduled and configurations cannot be updated.
Etcd requires a quorum (majority) of nodes to be healthy and reachable to elect a leader. When more than half the cluster is unreachable or unhealthy, no leader can be elected, and the cluster becomes read-only. Without a leader, etcd cannot process write operations, blocking all cluster changes.
Check which control plane nodes are running:
kubectl get nodes -l node-role.kubernetes.io/control-planeFor self-managed clusters, verify physically or via cloud provider:
# AWS:
aws ec2 describe-instances --filter Name=tag:karpenter.sh/capacity-type,Values=master
# Azure:
az vm list --output tableCount healthy vs down. Need majority healthy for quorum.
From a healthy node, check etcd status:
ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--endpoints=https://127.0.0.1:2379 \
member listNote which members are reachable. For 3-member cluster, need 2+ healthy. For 5-member, need 3+.
If majority is unreachable, restart them:
# For cloud VMs:
aws ec2 reboot-instances --instance-ids <instance-id>
az vm restart --ids <vm-id> --resource-group <group>
# Or for on-prem:
power cycle the serverWait 2-3 minutes for nodes to rejoin cluster. Monitor:
watch "ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=https://127.0.0.1:2379 member list"If nodes are permanently gone, remove them:
ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--endpoints=https://127.0.0.1:2379 \
member remove <member-id>This reduces quorum requirement. Example: 3-member to 2-member cluster still needs 2 healthy (no loss in tolerance).
Network problems may be isolating members:
# Test from each control node:
for node in <node1> <node2> <node3>; do
echo "Testing $node:"
ping -c 1 $node
telnet $node 2379 # etcd client port
telnet $node 2380 # etcd peer port
doneIf telnet fails, firewall is blocking. Open ports:
sudo iptables -A INPUT -p tcp --dport 2379 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 2380 -j ACCEPTRestart etcd to force rejoin cluster:
sudo systemctl restart etcd
# or for containerized:
sudo docker restart <etcd-container>Monitor recovery:
watch "ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --endpoints=https://127.0.0.1:2379 endpoint health"Wait 10-30 seconds for leader election.
If majority of nodes are lost, rebuild:
# On one healthy node:
kubeadm init phase certs all
kubeadm init phase kubeconfig all
kubeadm init phase control-plane all
# Then join other nodes:
kubeadm join <control-plane-endpoint> \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-planeAfter recovery, verify data wasn't corrupted:
ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--endpoints=https://127.0.0.1:2379 \
check perfAlso verify API server can communicate:
kubectl get nodes
kubectl get pods --all-namespacesImplement safeguards:
# Monitoring (Prometheus + alerting):
alert if etcd_has_leader == 0
alert if etcd_members_unhealthy > 0
# Recommended cluster size: odd number
# 3-member: tolerate 1 failure
# 5-member: tolerate 2 failures
# Avoid even numbers
# Backup etcd regularly:
ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--endpoints=https://127.0.0.1:2379 \
snapshot save backup.dbQuorum loss is the most critical Kubernetes incident. Having 3+ control plane nodes is essential for production. Stacked etcd (etcd on same nodes as control plane) couples their fateβif too many control nodes fail, etcd fails too. External etcd topology is more resilient. Always maintain odd-numbered clusters (3, 5, 7). Automate monitoring for lost leaders. For disaster recovery, restore from etcd backup using etcdctl snapshot restore. Prevent quorum loss through: redundant network paths, dedicated high-performance networking for control plane, and monitoring member health continuously.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
No subnets found for EKS cluster
How to fix "eks subnet not found" in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
unable to compute replica count
How to fix "unable to compute replica count" in Kubernetes HPA
error: context not found
How to fix "error: context not found" in Kubernetes