How to fix Fencing Error in Kubernetes

KubernetesADVANCEDHIGH

A fencing error occurs when a node is unresponsive or partitioned from the cluster, and Kubernetes cannot determine if pods should be evicted. Fencing prevents "split brain" scenarios where multiple copies of stateful pods run simultaneously. Fencing is critical for stateful applications and storage systems.

What this error means

Fencing is a cluster-level safety mechanism to prevent concurrent access conflicts. When a node becomes unresponsive: 1. The control plane marks the node NotReady 2. Kubernetes waits for node-monitor-grace-period (default 40s) 3. After timeout, kubelet doesn't respond → mark for eviction 4. Before evicting pods, check if it's safe: - Can we confirm the node is truly dead? (fencing) - If yes, evict pods and reschedule - If unsure, wait (prevent "split brain") Fencing errors occur when the cluster cannot determine node status, delaying pod eviction and causing service degradation.

How to fix "Fencing Error"

1Identify which node has the fencing issue

Find the unresponsive node:

bash

kubectl get nodes
kubectl describe node <node-name>

# Look for conditions:
kubectl get nodes -o json | jq '.items[] | select(.status.conditions[].status=="False") | .metadata.name'

# Check when the node became NotReady:
kubectl describe node <node-name> | grep -A 5 "Conditions:"

# Try to access the node directly:
ssh <node-ip>  # Will fail if truly unreachable
ping <node-ip>

Note the node name and when it became unresponsive.

2Verify the node is truly unresponsive

Check if the node is actually down:

bash

# Try SSH:
ssh -v <node-ip>

# Try ping:
ping -c 3 <node-ip>

# Try curl to kubelet:
curl -k https://<node-ip>:10250/

# Check cloud provider console:
# AWS: Check instance status
# Azure: Check VM status
# GCP: Check instance status

# Check kubelet logs on the node (if you can access it):
sudo tail -f /var/log/kubelet.log

If completely unresponsive, the node is likely down.

3Manually drain pods from the unresponsive node

Force evict pods if the node won't recover:

bash

# First, mark the node as unschedulable (prevent new pods):
kubectl cordon <node-name>

# Then drain (gracefully remove) all pods:
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=30

# If drain times out or fails, force-delete pods:
kubectl delete pods --all-namespaces --field-selector=spec.nodeName=<node-name> --grace-period=0 --force

# Verify pods are moved:
kubectl get pods -A --field-selector=spec.nodeName=<node-name>

Draining removes pods gracefully. Force-delete removes them immediately.

4Remove the node from the cluster

If the node won't come back, remove it permanently:

bash

# Delete the node object:
kubectl delete node <node-name>

# Verify it's gone:
kubectl get nodes

# If using cloud infrastructure, terminate the instance:
# AWS:
aws ec2 terminate-instances --instance-ids <instance-id>

# Azure:
az vm delete --resource-group <rg> --name <vm-name>

# GCP:
gcloud compute instances delete <instance-name>

Deleting the node object prevents the cluster from waiting for it.

5Configure node-monitor-grace-period for faster detection

Adjust how long the cluster waits before marking node NotReady:

bash

kubectl edit configmap -n kube-system kubelet-config

# Or edit API server:
kubectl edit pod -n kube-system kube-apiserver-<node>

Add/modify the flag:

yaml

--node-monitor-grace-period=30s  # Default: 40s
--node-monitor-update-frequency=5s  # How often to check
--pod-eviction-timeout=5m        # Wait before evicting (usually --node-monitor-grace-period × 3)

Lower grace period → faster detection, but more false positives.

6Implement proper fencing (out-of-band access)

Set up fencing for production clusters to handle network partitions:

Option 1: Cloud provider API access

bash

# Enable cloud provider plugin to detect node status via API
kubectl edit pod -n kube-system kube-controller-manager

Add:

yaml

--cloud-provider=aws  # or azure, gcp
--cloud-config=/etc/kubernetes/cloud-config.conf

Option 2: IPMI (for on-prem)

bash

# Configure IPMI fencing on each node
ipmitool -I lanplus -H <ipmi-ip> -U <user> -P <password> status
ipmitool -I lanplus -H <ipmi-ip> -U <user> -P <password> power off

Option 3: Use HAProxy or keepalived for node health

Fencing allows the cluster to confirm if a node is truly down before evicting pods.

7Monitor node status with Prometheus

Set up alerts for fencing issues:

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-fencing-alerts
spec:
  groups:
  - name: kubernetes-nodes
    interval: 30s
    rules:
    - alert: NodeNotReady
      expr: |
        kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 5m
    - alert: NodeUnreachable
      expr: |
        rate(node_network_transmit_packets_total[5m]) == 0
      for: 10m

Alert on:
- Nodes in NotReady state > 5 minutes
- No network activity from node > 10 minutes
- Pod eviction delays

8Verify high availability for stateful apps

Ensure critical workloads can survive node failures:

yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: stateful-app-pdb
spec:
  minAvailable: 2  # At least 2 replicas always running
  selector:
    matchLabels:
      app: stateful-app

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: stateful-app
spec:
  serviceName: stateful-app
  replicas: 3
  selector:
    matchLabels:
      app: stateful-app
  template:
    metadata:
      labels:
        app: stateful-app
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: stateful-app
            topologyKey: kubernetes.io/hostname  # Different nodes

Using pod anti-affinity and PDB ensures pods are distributed and protected.

How to fix Fencing Error in Kubernetes

What this error means

Typical symptoms

Common causes

How to fix "Fencing Error"

Advanced notes

Related errors

Official resources & further reading