How to fix NodeDisconnectedException: [node] disconnected in Elasticsearch

ElasticsearchINTERMEDIATEHIGH

This error occurs when an Elasticsearch node loses connection to the cluster, preventing communication between nodes. It typically indicates network issues, node failures, or resource constraints that disrupt cluster coordination. The disconnected node cannot participate in search, indexing, or cluster management operations.

What this error means

The "NodeDisconnectedException: [node] disconnected" error indicates that one or more Elasticsearch nodes have lost connection to the cluster. Elasticsearch is a distributed system where nodes communicate to coordinate operations like search queries, indexing, and cluster state management. When a node disconnects, it can no longer: 1. Participate in search operations (shard queries won't reach the disconnected node) 2. Receive index updates or replication requests 3. Vote in master elections or receive cluster state updates 4. Serve as a data node for shards allocated to it This error is particularly critical because: - It can cause shards to become unassigned if the disconnected node was hosting primary shards - Search queries may fail or return partial results - Indexing operations may fail if the write consistency level can't be met - The cluster health status may change to yellow or red The error message typically includes the node name or ID in brackets, helping identify which specific node has disconnected from the cluster.

How to fix "NodeDisconnectedException: [node] disconnected"

1Check cluster health and identify disconnected nodes

First, examine the cluster health to understand the scope of the problem:

bash

# Check overall cluster health
curl -X GET "localhost:9200/_cluster/health?pretty" -u "username:password"

# Check cluster state for detailed node information
curl -X GET "localhost:9200/_cluster/state?pretty&filter_path=metadata,nodes" -u "username:password"

# List all nodes in the cluster
curl -X GET "localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,load_1m,node.role,master" -u "username:password"

# Check for unassigned shards (often a symptom of node disconnection)
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason,node" -u "username:password" | grep UNASSIGNED

# Check master node status
curl -X GET "localhost:9200/_cat/master?v" -u "username:password"

Look for:
- Nodes missing from the nodes list
- Unassigned shards that were on the disconnected node
- Changes in cluster health status (green → yellow/red)
- Master node changes or election issues

2Investigate network connectivity between nodes

Test network connectivity between cluster nodes:

bash

# From each node, test connectivity to other nodes (adjust ports as needed)
# Test transport port (default 9300)
nc -zv other-node-ip 9300

# Test HTTP port (default 9200)
nc -zv other-node-ip 9200

# Check DNS resolution
nslookup other-node-hostname

# Check firewall rules
sudo iptables -L -n | grep 9300
sudo iptables -L -n | grep 9200

# For cloud environments, check security groups/network ACLs
# AWS: Check security group inbound rules for ports 9200 and 9300
# GCP: Check firewall rules
# Azure: Check network security groups

# Check network interface status and configuration
ip addr show
netstat -tulpn | grep java

# Test with telnet (if available)
telnet other-node-ip 9300

Common network issues to fix:
- Firewall blocking ports 9200/9300 between nodes
- Security group misconfiguration in cloud environments
- DNS resolution failures
- Network interface misconfiguration
- Routing issues between subnets

3Examine Elasticsearch logs for disconnection causes

Check Elasticsearch logs on all nodes, especially the disconnected node if accessible:

bash

# Check Elasticsearch logs (location varies by installation)
# Systemd installations:
sudo journalctl -u elasticsearch --since "1 hour ago" | tail -100

# Tarball installations:
tail -100 /path/to/elasticsearch/logs/elasticsearch.log

# Look for specific error patterns:
grep -i "disconnect" /var/log/elasticsearch/elasticsearch.log
grep -i "timeout" /var/log/elasticsearch/elasticsearch.log
grep -i "heartbeat" /var/log/elasticsearch/elasticsearch.log
grep -i "master_not_discovered" /var/log/elasticsearch/elasticsearch.log

# Check for out of memory errors
grep -i "outofmemory" /var/log/elasticsearch/elasticsearch.log
grep -i "gc" /var/log/elasticsearch/elasticsearch.log | tail -20

# Check for long garbage collection pauses
# Enable GC logging in jvm.options if not already:
# -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

# Check system logs for hardware/OS issues
dmesg | tail -50
sudo tail -50 /var/log/syslog

Key log patterns indicating disconnection causes:
- "failed to connect" or "connection refused" - Network issues
- "long gc" or "out of memory" - Resource exhaustion
- "heartbeat timeout" - Network latency or node overload
- "master not discovered" - Cluster formation issues

4Restart disconnected nodes and monitor recovery

If the node is accessible, restart it and monitor cluster recovery:

bash

# Gracefully restart the Elasticsearch service
sudo systemctl restart elasticsearch

# Or for tarball installations
# First, find the PID
ps aux | grep elasticsearch

# Send SIGTERM for graceful shutdown
kill -TERM <pid>

# Wait for shutdown, then restart
/path/to/elasticsearch/bin/elasticsearch -d -p pid

# Monitor startup logs
tail -f /var/log/elasticsearch/elasticsearch.log

# After restart, check if node rejoins cluster
curl -X GET "localhost:9200/_cat/nodes?v" -u "username:password"

# Check cluster health recovery
watch -n 5 'curl -s "localhost:9200/_cluster/health?pretty" -u "username:password"'

# If shards are recovering, monitor progress
curl -X GET "localhost:9200/_cat/recovery?v&active_only" -u "username:password"

Important considerations:
1. Rolling restart: If multiple nodes need restarting, do them one at a time with sufficient interval for cluster recovery
2. Shard allocation: After node restart, shards may take time to reallocate. Monitor with _cat/recovery
3. Index status: Check if indices become yellow/red during recovery
4. Timeout settings: If restart takes too long, check discovery.zen.ping_timeout and discovery.zen.fd.ping_timeout settings

5Adjust timeout and discovery settings if needed

If nodes frequently disconnect due to network latency or GC pauses, adjust timeout settings:

bash

# Check current discovery settings
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.discovery.*" -u "username:password" | jq .

# Increase ping timeouts (adjust values based on your environment)
curl -X PUT "localhost:9200/_cluster/settings" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "discovery.zen.ping_timeout": "30s",
    "discovery.zen.fd.ping_timeout": "2m",
    "discovery.zen.fd.ping_retries": 10,
    "discovery.zen.no_master_block": "write"
  }
}
'

# For Elasticsearch 7.x and above with cluster formation settings:
curl -X PUT "localhost:9200/_cluster/settings" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.fault_detection.leader_check.interval": "2s",
    "cluster.fault_detection.leader_check.timeout": "10s",
    "cluster.fault_detection.follower_check.interval": "1s",
    "cluster.fault_detection.follower_check.timeout": "10s"
  }
}
'

# Adjust thread pool settings if nodes are overloaded
curl -X PUT "localhost:9200/_cluster/settings" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "thread_pool.search.queue_size": 2000,
    "thread_pool.write.queue_size": 500
  }
}
'

# After changing settings, restart nodes if needed
# Settings marked "persistent" survive restarts
# Settings marked "transient" are reset on full cluster restart

Warning: Increasing timeouts too much can mask real problems. Use monitoring to identify root causes rather than just increasing timeouts indefinitely.

6Implement monitoring and preventive measures

Set up monitoring to detect and prevent future node disconnections:

bash

# Configure Elasticsearch monitoring (X-Pack or open source alternatives)
# Enable monitoring in elasticsearch.yml:
xpack.monitoring.enabled: true
xpack.monitoring.elasticsearch.collection.enabled: true

# Set up alerts for node disconnections
# Example using Elasticsearch Watcher or external monitoring:

# Create a watch for node disconnections
curl -X PUT "localhost:9200/_watcher/watch/node_disconnect_alert" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "trigger": {
    "schedule": {
      "interval": "30s"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": [".monitoring-es-*"],
        "body": {
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "timestamp": {
                      "gte": "now-1m"
                    }
                  }
                },
                {
                  "term": {
                    "type": "node_stats"
                  }
                }
              ]
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "node.name",
                "size": 10
              }
            }
          },
          "size": 0
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.aggregations.nodes.buckets.length": {
        "lt": "{{expected_node_count}}"
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "to": ["[email protected]"],
        "subject": "Elasticsearch Node Disconnected",
        "body": "Node count dropped from {{expected_node_count}} to {{ctx.payload.aggregations.nodes.buckets.length}}"
      }
    }
  }
}
'

# Monitor key metrics:
# - Node count over time
# - Network errors in transport layer
# - GC duration and frequency
# - Heap usage trends
# - Thread pool queue sizes
# - Disk I/O latency

# Set up external monitoring (Prometheus + Grafana example):
# 1. Install and configure Elasticsearch exporter
# 2. Set up Prometheus to scrape metrics
# 3. Create Grafana dashboards for:
#    - Node status and count
#    - Network connectivity
#    - Resource utilization
#    - Cluster health status

Preventive measures:
1. Regular health checks: Implement automated cluster health monitoring
2. Capacity planning: Monitor resource usage trends and scale before limits are reached
3. Network redundancy: Use multiple network paths and validate connectivity regularly
4. Configuration management: Use tools like Ansible, Puppet, or Chef to ensure consistent node configuration
5. Backup and recovery plans: Regular snapshots and tested recovery procedures

Advanced notes

## Advanced Troubleshooting for Persistent Node Disconnections

### Network Diagnostics
For complex network environments, use advanced tools:
- tcpdump: Capture and analyze traffic between nodes

bash

sudo tcpdump -i any port 9300 -w elasticsearch.pcap

- mtr: Combine traceroute and ping for path analysis

bash

mtr --report --report-cycles 10 other-node-ip

- netstat: Check connection states and counts

bash

netstat -an | grep 9300 | grep ESTABLISHED | wc -l

### JVM and Garbage Collection Tuning
If GC pauses cause disconnections:
1. Enable detailed GC logging in jvm.options:

bash

-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m

2. Analyze GC logs with tools like GCViewer or Elasticsearch's own monitoring
3. Adjust heap size based on workload:
- Not too small (causes frequent GC)
- Not too large (causes long GC pauses)
- General rule: 50% of available RAM, up to 32GB
4. Consider G1GC for large heaps (default in recent Elasticsearch versions)

### Split-Brain Scenarios
When network partitions cause multiple masters:
1. Minimum master nodes: Configure discovery.zen.minimum_master_nodes to (master_eligible_nodes / 2) + 1
2. For Elasticsearch 7+: Use cluster.initial_master_nodes for initial cluster formation
3. Quorum-based voting: Ensure voting configuration prevents split-brain
4. Recovery: May require manual intervention to resolve conflicting cluster states

### Cloud-Specific Considerations
AWS/Azure/GCP environments:
- Instance metadata service: Ensure it's accessible for node identification
- Load balancer health checks: Configure proper intervals and thresholds
- Auto-scaling groups: Use lifecycle hooks for graceful node termination
- Spot instances: Implement checkpointing for sudden termination
- Network peering: Verify VPC peering or transit gateway configurations

### Security Plugin Issues
If security plugins cause disconnections:
1. Certificate expiration: Check TLS certificate validity periods
2. Authentication timeouts: Adjust security-related timeout settings
3. Role mapping: Ensure nodes have appropriate security roles
4. Audit logging: Check for authentication/authorization failures

### Performance Optimization
To reduce disconnection risk:
1. Shard allocation awareness: Use rack/zone awareness for fault tolerance
2. Index lifecycle management: Automate index rotation to control shard count
3. Query optimization: Use search profiler to identify expensive queries
4. Bulk request tuning: Optimal batch sizes and concurrency settings
5. Circuit breakers: Monitor and adjust circuit breaker limits appropriately

### Disaster Recovery Planning
1. Regular snapshots: Automated snapshot policies to external repositories
2. Cross-cluster replication: For critical indices, maintain replica clusters
3. Documentation: Maintain runbooks for common failure scenarios
4. Testing: Regularly test node failure and recovery procedures

How to fix NodeDisconnectedException: [node] disconnected in Elasticsearch

What this error means

Typical symptoms

Common causes

How to fix "NodeDisconnectedException: [node] disconnected"

Advanced notes

Related errors

Official resources & further reading