A fencing error occurs when a node is unresponsive or partitioned from the cluster, and Kubernetes cannot determine if pods should be evicted. Fencing prevents "split brain" scenarios where multiple copies of stateful pods run simultaneously. Fencing is critical for stateful applications and storage systems.
Fencing is a cluster-level safety mechanism to prevent concurrent access conflicts. When a node becomes unresponsive: 1. The control plane marks the node NotReady 2. Kubernetes waits for node-monitor-grace-period (default 40s) 3. After timeout, kubelet doesn't respond → mark for eviction 4. Before evicting pods, check if it's safe: - Can we confirm the node is truly dead? (fencing) - If yes, evict pods and reschedule - If unsure, wait (prevent "split brain") Fencing errors occur when the cluster cannot determine node status, delaying pod eviction and causing service degradation.
Find the unresponsive node:
kubectl get nodes
kubectl describe node <node-name>
# Look for conditions:
kubectl get nodes -o json | jq '.items[] | select(.status.conditions[].status=="False") | .metadata.name'
# Check when the node became NotReady:
kubectl describe node <node-name> | grep -A 5 "Conditions:"
# Try to access the node directly:
ssh <node-ip> # Will fail if truly unreachable
ping <node-ip>Note the node name and when it became unresponsive.
Check if the node is actually down:
# Try SSH:
ssh -v <node-ip>
# Try ping:
ping -c 3 <node-ip>
# Try curl to kubelet:
curl -k https://<node-ip>:10250/
# Check cloud provider console:
# AWS: Check instance status
# Azure: Check VM status
# GCP: Check instance status
# Check kubelet logs on the node (if you can access it):
sudo tail -f /var/log/kubelet.logIf completely unresponsive, the node is likely down.
Force evict pods if the node won't recover:
# First, mark the node as unschedulable (prevent new pods):
kubectl cordon <node-name>
# Then drain (gracefully remove) all pods:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=30
# If drain times out or fails, force-delete pods:
kubectl delete pods --all-namespaces --field-selector=spec.nodeName=<node-name> --grace-period=0 --force
# Verify pods are moved:
kubectl get pods -A --field-selector=spec.nodeName=<node-name>Draining removes pods gracefully. Force-delete removes them immediately.
If the node won't come back, remove it permanently:
# Delete the node object:
kubectl delete node <node-name>
# Verify it's gone:
kubectl get nodes
# If using cloud infrastructure, terminate the instance:
# AWS:
aws ec2 terminate-instances --instance-ids <instance-id>
# Azure:
az vm delete --resource-group <rg> --name <vm-name>
# GCP:
gcloud compute instances delete <instance-name>Deleting the node object prevents the cluster from waiting for it.
Adjust how long the cluster waits before marking node NotReady:
kubectl edit configmap -n kube-system kubelet-config
# Or edit API server:
kubectl edit pod -n kube-system kube-apiserver-<node>Add/modify the flag:
--node-monitor-grace-period=30s # Default: 40s
--node-monitor-update-frequency=5s # How often to check
--pod-eviction-timeout=5m # Wait before evicting (usually --node-monitor-grace-period × 3)Lower grace period → faster detection, but more false positives.
Set up fencing for production clusters to handle network partitions:
Option 1: Cloud provider API access
# Enable cloud provider plugin to detect node status via API
kubectl edit pod -n kube-system kube-controller-managerAdd:
--cloud-provider=aws # or azure, gcp
--cloud-config=/etc/kubernetes/cloud-config.confOption 2: IPMI (for on-prem)
# Configure IPMI fencing on each node
ipmitool -I lanplus -H <ipmi-ip> -U <user> -P <password> status
ipmitool -I lanplus -H <ipmi-ip> -U <user> -P <password> power offOption 3: Use HAProxy or keepalived for node health
Fencing allows the cluster to confirm if a node is truly down before evicting pods.
Set up alerts for fencing issues:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: node-fencing-alerts
spec:
groups:
- name: kubernetes-nodes
interval: 30s
rules:
- alert: NodeNotReady
expr: |
kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
- alert: NodeUnreachable
expr: |
rate(node_network_transmit_packets_total[5m]) == 0
for: 10mAlert on:
- Nodes in NotReady state > 5 minutes
- No network activity from node > 10 minutes
- Pod eviction delays
Ensure critical workloads can survive node failures:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: stateful-app-pdb
spec:
minAvailable: 2 # At least 2 replicas always running
selector:
matchLabels:
app: stateful-app
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: stateful-app
spec:
serviceName: stateful-app
replicas: 3
selector:
matchLabels:
app: stateful-app
template:
metadata:
labels:
app: stateful-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: stateful-app
topologyKey: kubernetes.io/hostname # Different nodesUsing pod anti-affinity and PDB ensures pods are distributed and protected.
Fencing is critical for stateful systems (databases, message queues). Without proper fencing, a split-brain scenario can occur where the same persistent volume is mounted on multiple nodes, causing data corruption. Cloud providers (AWS, Azure, GCP) have built-in fencing through instance metadata APIs. For on-premises clusters, use IPMI, Redfish, or similar out-of-band management. The pod-eviction-timeout (default 5m) is the maximum time a pod remains "Terminating" before forced deletion. For highly available systems, run StatefulSets with at least 3 replicas across different failure domains. Use Pod Disruption Budgets to prevent cascading failures. Storage systems (etcd, databases) need stronger fencing guarantees; use cluster-aware storage or replication. Monitor node network and kubelet heartbeat; alert on any delays. In cloud environments, enable auto-scaling so failed nodes are replaced automatically.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
error: invalid configuration
How to fix "error: invalid configuration" in Kubernetes
etcdserver: cluster ID mismatch
How to fix "etcdserver: cluster ID mismatch" in Kubernetes
running with swap on is not supported
How to fix "running with swap on is not supported" in kubeadm