This error occurs when Elasticsearch cluster state has not been fully recovered or initialized after startup or node failures. The cluster is in a transitional state and temporarily blocks operations. You need to wait for the cluster to recover, verify node connectivity, ensure adequate disk space, and check cluster configuration.
The "ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]" error indicates that the Elasticsearch cluster is in the process of recovering its state and is not yet ready to serve requests. When Elasticsearch starts up or after node failures occur, the cluster must go through a state recovery phase. During this phase, the master node collects metadata from all nodes and rebuilds the cluster state. Until this process completes, all data operations are blocked with a SERVICE_UNAVAILABLE exception (HTTP 503). This is a protective mechanism to prevent clients from receiving incomplete or incorrect results while the cluster is unstable. Once all nodes join the cluster and the state is fully recovered, the block is automatically removed and operations resume normally.
First, examine what state the cluster is in and what operations are pending:
# Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty" -u "username:password"
# List pending cluster tasks (may timeout during recovery)
curl -X GET "localhost:9200/_cluster/pending_tasks?pretty" -u "username:password"
# Check cluster state version
curl -X GET "localhost:9200/_cluster/state/metadata?pretty" -u "username:password"
# Get node discovery status
curl -X GET "localhost:9200/_nodes?pretty" -u "username:password"
# Check node count
curl -X GET "localhost:9200/_cat/nodes?v" -u "username:password"Look for:
- All expected nodes present in the node list
- Active master node elected
- Number of pending tasks (high number = still recovering)
- Cluster version number (should be incrementing)
Insufficient disk space can prevent nodes from starting and cause state recovery to fail:
# Check disk usage on all nodes
curl -X GET "localhost:9200/_nodes/stats/fs?pretty" -u "username:password"
# More detailed allocation information
curl -X GET "localhost:9200/_cat/allocation?v&pretty" -u "username:password"
# On each node, check filesystem directly
df -h /path/to/elasticsearch/dataEnsure:
- All nodes have at least 10% free disk space
- No node is above 85% disk usage
- Elasticsearch data directories are not full
If disk space is low:
# Delete old indices to free space
curl -X DELETE "localhost:9200/old-index-name" -u "username:password"
# Add new disk volume and restart node
# Or delete expired snapshots if present
curl -X DELETE "localhost:9200/_snapshot/repo-name/snapshot-name" -u "username:password"Network connectivity issues can prevent cluster formation:
# Check node connectivity from primary
curl -X GET "localhost:9200/_nodes?pretty" -u "username:password"
# On each node, check the logs for connection errors
tail -f /var/log/elasticsearch/elasticsearch.log | grep -i "connection\|error\|discovery"
# Verify network connectivity between nodes manually
ping <other-node-ip>
telnet <other-node-ip> 9300
# Check discovery settings in elasticsearch.yml:
# discovery.seed_hosts: ["node1:9300", "node2:9300", "node3:9300"]
# cluster.initial_master_nodes: ["node1", "node2", "node3"]Ensure:
- All nodes can reach the transport port (default 9300) on all other nodes
- Discovery seed hosts are correctly configured
- Firewall rules allow node-to-node communication
- Nodes are on the same network or have proper routing
In many cases, the cluster will recover automatically given enough time:
# Monitor recovery progress
watch -n 5 'curl -s "localhost:9200/_cluster/health?pretty" -u "username:password" | head -20'
# Watch cluster events in real-time
curl -X GET "localhost:9200/_cluster/health?wait_for_status=yellow&timeout=30s&pretty" -u "username:password"
# For large clusters, recovery can take several minutes to hours
# Increase the timeout and monitor logsRecovery timeline:
- Small clusters (< 5 nodes): Usually 30 seconds to 2 minutes
- Medium clusters (5-20 nodes): 2-10 minutes
- Large clusters (> 20 nodes): 10+ minutes depending on shard count
Do NOT force restart nodes while recovery is in progress.
Incorrect gateway settings can cause the cluster to wait indefinitely:
# Check current gateway settings
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.gateway.*&pretty" -u "username:password"
# For a 3-node cluster, check these settings in elasticsearch.yml:
# gateway.recover_after_nodes: 2
# gateway.recover_after_time: 5m
# gateway.expected_nodes: 3
# Verify minimum master nodes setting
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.discovery.zen.*&pretty" -u "username:password"
# For Elasticsearch 7.0+, use this instead:
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=**.cluster.initial_master_nodes&pretty" -u "username:password"Guidelines for gateway settings:
- gateway.recover_after_nodes: Set to (total_nodes / 2) + 1
- gateway.expected_nodes: Set to actual number of nodes
- gateway.recover_after_time: Set to 5-10 minutes for initial recovery
Example for 3-node cluster:
curl -X PUT "localhost:9200/_cluster/settings?pretty" -u "username:password" -H 'Content-Type: application/json' -d'
{
"persistent": {
"gateway.recover_after_nodes": 2,
"gateway.expected_nodes": 3,
"gateway.recover_after_time": "5m"
}
}
'As a last resort, if the cluster is completely stuck and unresponsive:
# CAUTION: Only use these steps if recovery has not completed after 30+ minutes
# and all nodes are present and reachable
# Option 1: Restart the master node
# 1. Identify current master
curl -X GET "localhost:9200/_nodes?filter_path=nodes.*.name,nodes.*.master_node&pretty" -u "username:password"
# 2. Stop the master node gracefully
# On the master node server:
pkill -TERM -f "org.elasticsearch.bootstrap.Elasticsearch"
# 3. Wait 30 seconds, then restart it
systemctl restart elasticsearch
# or
./bin/elasticsearch
# Option 2: Temporarily disable minimum master nodes requirement
curl -X PUT "localhost:9200/_cluster/settings?pretty" -u "username:password" -H 'Content-Type: application/json' -d'
{
"transient": {
"discovery.zen.minimum_master_nodes": 1
}
}
'
# Option 3: Reset cluster state (destructive - last resort only!)
# Stop all nodes completely, delete data/nodes directory, restart fresh
# This will lose all cluster state but allows starting over:
systemctl stop elasticsearch
rm -rf /var/lib/elasticsearch/nodes/0
systemctl start elasticsearchWarning: Options 2 and 3 can cause data loss or corruption. Only attempt after confirming normal recovery will not work.
## Advanced Cluster Recovery Topics
### Understanding Cluster State Recovery
The Elasticsearch cluster state contains:
- Cluster metadata (indices, shards, mappings)
- Node information
- Cluster settings
- Index settings and aliases
After a master election, the new master must:
1. Collect metadata from all nodes
2. Rebuild the full cluster state
3. Determine shard allocation
4. Assign shards to nodes
5. Announce the new state to all nodes
### Gateway Recovery Process
The gateway module persists cluster state to disk on all nodes:
- When cluster starts, nodes load persisted state
- Master waits for minimum nodes to report their state
- After gateway.recover_after_time, recovery proceeds even if not all nodes present
- If a node is still missing after recovery, its shards become unassigned
### Common Scenarios
Multi-Node Failure Scenario:
If multiple nodes crash simultaneously:
1. Remaining nodes detect failures via heartbeat timeout (default 30s)
2. Master initiates re-election if quorum still exists
3. New master waits for gateway.recover_after_nodes to join
4. Once met, recovery proceeds and shards are re-allocated
Single-Node Cluster:
Single-node clusters have special requirements:
- No quorum needed
- Recovery happens immediately on startup
- gateway.recover_after_nodes: 1 is typical
- No replication, so data loss if node fails
Split Brain Prevention:
Elasticsearch prevents split-brain via quorum:
- Minimum master nodes = (total_master_nodes / 2) + 1
- Never run cluster with even number of master nodes
- 3, 5, 7 master nodes are ideal
### Monitoring Recovery Progress
# Check recovery status
curl -X GET "localhost:9200/_recovery?human&pretty" -u "username:password"
# Monitor shard allocation
curl -X GET "localhost:9200/_cat/shards?h=index,shard,prirep,state,node&v" -u "username:password"
# Check cluster routing decisions
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -u "username:password"### Performance Tuning for Recovery
# Increase recovery parallelism for faster recovery
curl -X PUT "localhost:9200/_cluster/settings?pretty" -u "username:password" -H 'Content-Type: application/json' -d'
{
"transient": {
"cluster.routing.allocation.node_concurrent_recoveries": 5,
"indices.recovery.max_bytes_per_sec": "200mb",
"indices.recovery.concurrent_streams": 5
}
}
'
# For very large indices, reduce recovery pressure
curl -X PUT "localhost:9200/_cluster/settings?pretty" -u "username:password" -H 'Content-Type: application/json' -d'
{
"transient": {
"indices.recovery.max_bytes_per_sec": "50mb",
"cluster.routing.allocation.node_concurrent_recoveries": 2
}
}
'### Disk-Based Block vs State Recovery Block
- Disk block: Related to disk space thresholds (FORBIDDEN/12)
- State recovery block: Related to cluster formation (SERVICE_UNAVAILABLE/1)
STATE_UNAVAILABLE blocks when:
- Master hasn't recovered cluster state yet
- Not enough nodes in cluster
- Cluster metadata is corrupted
FORBIDDEN blocks when:
- Disk usage exceeds thresholds
- Index-level blocks applied
- Read-only settings enforced
### Long-Running Recovery Scenarios
For very large clusters:
1. Create a separate recovery cluster if possible
2. Temporarily reduce shard count
3. Use forced merge to consolidate segments
4. Increase heap size during recovery (Xms and Xmx equal)
5. Monitor CPU and I/O, not just network
QueryShardException: No mapping found for [field] in order to sort on
How to fix "QueryShardException: No mapping found for field in order to sort on" in Elasticsearch
IllegalStateException: There are no ingest nodes in this cluster, unable to forward request to an ingest node
How to fix "There are no ingest nodes in this cluster" in Elasticsearch
IndexNotFoundException: no such index [index_name]
How to fix "IndexNotFoundException: no such index [index_name]" in Elasticsearch
DocumentMissingException: [index][type][id]: document missing
DocumentMissingException: Document missing
ParsingException: Unknown key for a START_OBJECT in [query]
How to fix "ParsingException: Unknown key for a START_OBJECT in [query]" in Elasticsearch