Leader election failures prevent controllers from achieving high availability. When multiple replicas of a controller (operator, webhook, scheduler) cannot elect a leader, none may process events, disrupting cluster operations.
Many Kubernetes controllers use leader election to ensure only one replica is active: 1. Multiple replicas attempt to acquire a lease 2. The winner becomes the leader and processes events 3. If leader fails, another replica takes over When leader election fails: - No lease is acquired - No replica becomes leader - Events are not processed - Cluster operations stall Common causes: missing RBAC permissions, etcd unavailable, network partition, or misconfigured lease mechanism.
Check if controller has leader election enabled:
kubectl get pods -n <namespace> -l app=<controller>
kubectl logs <controller-pod-name> -n <namespace> | grep -i "leader\|lease"For external-dns, cert-manager, etc.:
kubectl get lease -AShould show lease resources like:
NAMESPACE NAME HOLDER AGE
default external-dns external-dns-7f4c6f8d8 5m
default cert-manager cert-manager-5d6c9f9d9 3mIf no leases appear, leader election hasn't succeeded.
Verify the controller's ServiceAccount has lease permissions:
kubectl get rolebinding -A | grep <controller>
kubectl get clusterrolebinding | grep <controller>Check the role definition:
kubectl get role <role-name> -n <namespace> -o yamlShould include:
rules:
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- get
- create
- update
- patch # For renewalsIf missing, add permissions:
kubectl create role <controller>-leader --verb=get,create,update,patch --resource=leases -n <namespace>
kubectl create rolebinding <controller>-leader --role=<controller>-leader --serviceaccount=<namespace>:<controller> -n <namespace>For cluster-wide leadership:
kubectl create clusterrole <controller>-leader --verb=get,create,update,patch --resource=leases
kubectl create clusterrolebinding <controller>-leader --clusterrole=<controller>-leader --serviceaccount=<namespace>:<controller>Leader election requires a working etcd backend:
# Check API server
kubectl cluster-info
kubectl get componentstatus # Deprecated but useful
# Check etcd
kubectl get pods -n kube-system -l component=etcd
kubectl logs -n kube-system etcd-<node> --tail=50
# Test API server connectivity
curl -k https://<api-server>:6443/api/v1If API server is down, restart it:
# For kubeadm clusters
sudo systemctl restart kubelet
# For managed services
# Contact providerIf etcd is corrupted:
# Backup and restore
sudo systemctl stop kubelet
sudo etcdctl snapshot save backup.db --endpoints=https://127.0.0.1:2379
# Restore requires downtime
sudo systemctl start kubeletInspect the lease that controller is trying to acquire:
kubectl get lease -A
kubectl describe lease <lease-name> -n <namespace>Should show current holder:
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: my-controller
namespace: default
spec:
holderIdentity: my-controller-pod-abc123
leaseDurationSeconds: 60
acquireTime: "2024-01-01T12:00:00Z"
renewTime: "2024-01-01T12:01:00Z"If holderIdentity is empty or stale:
# Delete the stuck lease to force re-election
kubectl delete lease <lease-name> -n <namespace>Controller pods will immediately re-attempt election:
kubectl logs <controller-pod> -n <namespace> -f | grep -i leaderWatch for log entries like "became leader".
Get detailed error logs:
kubectl logs <controller-pod> -n <namespace> --tail=100Look for:
- "failed to acquire lease"
- "permission denied"
- "connection refused"
- "deadline exceeded"
- "not found"
Enable debug logging:
kubectl set env deployment/<controller> -c <container> -n <namespace> \
LEADER_ELECTION_NAMESPACE=<namespace> \
V=4 # Verbose logging
kubectl rollout status deployment/<controller> -n <namespace>
kubectl logs deployment/<controller> -n <namespace> -fCommon patterns:
- "permission denied" → RBAC issue
- "connection refused" → API server unreachable
- "deadline exceeded" → etcd slow or unavailable
Check if multiple replicas exist:
kubectl get pods -l app=<controller> -n <namespace>Should show multiple replicas (usually 2-3):
NAME READY STATUS RESTARTS
my-controller-abc123 1/1 Running 0
my-controller-def456 1/1 Running 0If only one replica exists, scale up:
kubectl scale deployment <controller> --replicas=2 -n <namespace>Test inter-pod network connectivity:
kubectl exec <controller-pod-1> -n <namespace> -- \
curl -v https://<api-server>:6443/api/v1/leases
kubectl exec <controller-pod-2> -n <namespace> -- \
ping <controller-pod-1>.default.svc.cluster.localIf network is broken, check:
- CNI plugin status
- Network policies blocking communication
- Service DNS resolution
If multiple controllers are competing for the same lease:
kubectl get lease <lease-name> -n <namespace> -o yamlIf different pods keep acquiring the lease (frequent holder changes):
- Lease duration too short (increase leaseDurationSeconds)
- Rapid pod restarts
- Network latency causing renewal failures
Check etcd metrics:
kubectl logs -n kube-system etcd-<node> | grep -i "slot\|contention"For performance issues:
- Monitor etcd write latency
- Check if other heavy workloads are using etcd
- Consider splitting leadership into multiple leases (one per component)
If a stale leader is blocking new elections:
# Delete all controller pods to force fresh start
kubectl delete pods -l app=<controller> -n <namespace>
# Monitor new pod startup and leader election
kubectl get pods -l app=<controller> -n <namespace> -w
kubectl logs -l app=<controller> -n <namespace> -f --all-containers=trueWatch logs for leader election:
[INFO] controller: became leader
[INFO] controller: started processing eventsAlternatively, delete just the stuck lease:
kubectl delete lease <lease-name> -n <namespace>
# Controller will immediately try to re-acquireThen verify new leader is elected:
kubectl get lease <lease-name> -n <namespace>Leader election is critical for HA controllers. Multiple replicas without leader election cause conflicting operations (two cert-manager instances issuing certs, two external-dns instances managing DNS, etc.). The election mechanism uses etcd Leases for atomicity. Lease duration (default 15s) balances between quick failover and election churn. For wide-area networks, increase lease duration. Operator frameworks (Kubebuilder, Operator SDK) provide built-in leader election. Monitor lease churn with Prometheus: kubernetes_build_info{job="controller"} tracks replica count. For mission-critical controllers, implement additional healthchecks beyond leader election. Some controllers use ConfigMap-based elections instead of Leases (older Kubernetes)—these are less reliable. Multi-cluster scenarios may need distributed leader election (Consul, etcd cluster). For testing, disable leader election locally to simplify debugging.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
error: invalid configuration
How to fix "error: invalid configuration" in Kubernetes
etcdserver: cluster ID mismatch
How to fix "etcdserver: cluster ID mismatch" in Kubernetes
running with swap on is not supported
How to fix "running with swap on is not supported" in kubeadm