This error occurs when a Redis cluster becomes unavailable due to node failures or missing hash slot coverage. The cluster stops accepting requests until all hash slots are covered again by available nodes.
The "CLUSTERDOWN The cluster is down" error indicates that the Redis cluster has entered an unavailable state. By default, Redis cluster nodes are configured to refuse all queries if they detect that at least one hash slot (a portion of the data space) is not covered by any working node. Redis cluster divides the keyspace into 16,384 hash slots. Each key maps to exactly one slot, and each cluster node is responsible for a range of slots. When master nodes and all their replicas go down simultaneously, the slots they manage become uncovered, triggering a cluster-wide shutdown. This behavior is controlled by the `cluster-require-full-coverage` configuration setting (default: yes). It exists as a safety mechanism to prevent the cluster from serving stale or incorrect data during partial outages. **Common scenarios:** - A master node and all its replicas crash simultaneously - Network partitions isolating master nodes from the cluster - Long-running commands or Lua scripts cause nodes to appear dead to the cluster - Slot migration operations hang, leaving slots in an inconsistent state - Multiple nodes fail before the cluster can rebalance
Get a quick overview of the cluster state:
# On any accessible cluster node
redis-cli -h <node-ip> -p <port> cluster info
# Check detailed slot status
redis-cli -h <node-ip> -p <port> cluster slots
# Check individual node status
redis-cli -h <node-ip> -p <port> cluster nodesLook for:
- cluster_state:fail - Confirms the cluster is down
- cluster_slots_assigned vs cluster_slots_ok - Shows missing slots
- cluster_my_epoch and node epoch values - Helps identify outdated nodes
Identify failed nodes to understand the scope:
# From a node that's still accessible
redis-cli -h <node-ip> -p <port> cluster nodes | grep -E 'master|slave'
# Output shows node status (connected/disconnected)
# Example:
# 1a2b3c... 10.0.0.1:6379 master - 0 1500000000000 1 connected 0-5460
# 4d5e6f... 10.0.0.2:6379 slave 1a2b3c... 0 1500000001000 1 connectedKey indicators in the output:
- master/slave: Node role
- connected/disconnected: Current connectivity status
- epoch value: Helps detect split-brain scenarios
- Slot ranges: Which slots this node manages
The quickest fix is often to restart the failed nodes:
# Via systemd (Linux)
sudo systemctl restart redis-server
# Or if using redis_6379 naming convention
sudo systemctl restart redis_6379
# Via direct command
redis-server /etc/redis/redis.conf
# Check if the node rejoins the cluster
redis-cli -h <node-ip> -p <port> cluster infoMonitor the restart:
# Watch cluster recovery
watch -n 1 'redis-cli -h <node-ip> -p <port> cluster info | grep cluster_state'The cluster should automatically recover once the node becomes available and syncs with the rest of the cluster.
If slot migration is stuck, force the slots to a stable state:
# Find stuck slots (showing importing/migrating state)
redis-cli -h <node-ip> -p <port> cluster slots
# On BOTH the source and target nodes, stabilize the slot
redis-cli -h <source-node> -p <port> CLUSTER SETSLOT <slot-number> STABLE
redis-cli -h <target-node> -p <port> CLUSTER SETSLOT <slot-number> STABLE
# Example: Stabilize slot 5000
redis-cli -h 10.0.0.1 -p 6379 CLUSTER SETSLOT 5000 STABLE
redis-cli -h 10.0.0.2 -p 6379 CLUSTER SETSLOT 5000 STABLE
# Verify the cluster recovers
redis-cli -h <node-ip> -p <port> cluster infoThis clears any import/migrate state without losing data.
If only a subset of nodes are down and you need the cluster operational, temporarily disable full-coverage:
# On any accessible node
redis-cli -h <node-ip> -p <port> CONFIG SET cluster-require-full-coverage no
# Verify the change
redis-cli -h <node-ip> -p <port> CONFIG GET cluster-require-full-coverage
# The cluster should now serve requests for available slots
redis-cli -h <node-ip> -p <port> cluster infoIMPORTANT: This is a temporary measure. The cluster will only serve data from the available nodes. Make sure to:
1. Restart the failed nodes as soon as possible
2. Restore cluster-require-full-coverage yes in production configs
3. Investigate the root cause of node failures
If nodes have incorrect slot assignments after failure recovery:
# Use redis-cli --cluster fix command (Redis 5.0+)
redis-cli --cluster fix <node-ip>:<port>
# For older Redis versions, use redis-trib.rb
./redis-trib.rb fix <node-ip>:<port>
# Check the results
redis-cli -h <node-ip> -p <port> cluster infoThis command:
- Detects missing slots
- Identifies nodes with extra slots
- Automatically rebalances slot assignments
- Repairs cluster metadata inconsistencies
Before running this:
- Ensure all nodes are running and connected
- Have a backup of your cluster configuration
- Run it during a maintenance window if possible
If nodes are being marked as failed too aggressively, increase timeout thresholds:
# On any cluster node, temporarily adjust timeouts
redis-cli -h <node-ip> -p <port> CONFIG SET cluster-node-timeout 15000
# Also check and adjust Lua script timeout (default 5000ms)
redis-cli -h <node-ip> -p <port> CONFIG SET lua-time-limit 10000
# Verify changes
redis-cli -h <node-ip> -p <port> CONFIG GET cluster-node-timeout
redis-cli -h <node-ip> -p <port> CONFIG GET lua-time-limitConfiguration guide:
- cluster-node-timeout: Milliseconds before a node is considered failed (default: 15000)
- lua-time-limit: Max time for a Lua script in milliseconds (default: 5000)
Important: Make these changes permanent by updating your redis.conf file and restarting.
After recovery, monitor the cluster to prevent future outages:
# Real-time cluster status monitoring
while true; do
echo "=== Cluster Status at $(date) ==="
redis-cli -h <node-ip> -p <port> cluster info | grep cluster_state
redis-cli -h <node-ip> -p <port> cluster info | grep cluster_slots
redis-cli -h <node-ip> -p <port> cluster nodes | grep -E 'master|slave' | awk '{print $1, $2, $3, $4}'
sleep 10
done
# Or use redis-cli's built-in monitor
redis-cli -h <node-ip> -p <port> --cluster check <node-ip>:<port>Key metrics to watch:
- cluster_state: Should be "ok"
- cluster_slots_ok: Should equal 16384
- cluster_slots_assigned: Should equal 16384
- Node epoch values: Should be synchronized
- All nodes: Should show "connected" status
Address the underlying problem to prevent recurrence:
# Check Redis logs for errors
tail -f /var/log/redis/redis-server.log
# Monitor system resources on affected nodes
watch -n 1 'free -h && df -h && ps aux | grep redis'
# Check for out-of-memory conditions
redis-cli -h <node-ip> -p <port> info memory
# Look for replication lag (if applicable)
redis-cli -h <node-ip> -p <port> info replicationCommon root causes to investigate:
- Out of memory: Increase memory limit or evict old keys
- Disk full: Free up disk space; configure persistence
- Long-running commands: Optimize slow queries, adjust script timeouts
- Network issues: Check firewall rules, network connectivity between nodes
- High CPU/load: Profile and optimize application code
- Version mismatch: Ensure all nodes run compatible Redis versions
Understanding Redis cluster architecture:
Redis Cluster uses a distributed consensus model where all nodes participate in gossip protocol communication. Each node sends ping/pong messages to detect failures. If a node misses pongs from another node within cluster-node-timeout, it marks that node as "suspected failure" (PFAIL). When a majority of nodes agree a node is down, it enters "confirmed failure" (FAIL) state.
Hash slots and cluster availability:
Redis divides the keyspace into 16,384 hash slots using CRC16:
- Each key maps to a slot: slot = CRC16(key) % 16384
- Each cluster node manages a range of slots
- When a node goes offline, its slots become unreachable
- With cluster-require-full-coverage: yes (default), ANY uncovered slot causes cluster-wide shutdown
- With cluster-require-full-coverage: no, uncovered slots only prevent reads/writes for those specific keys
Preventing CLUSTERDOWN outages:
1. Use replica nodes: Deploy at least one replica per master. If a master fails, its replica can take over.
2. Tune cluster-node-timeout: Set it higher than expected network latencies but low enough to detect real failures (recommended: 15000-30000ms).
3. Avoid long-running commands: Use --max-memory-policy appropriately and set lua-time-limit conservatively.
4. Monitor cluster metrics: Track slot coverage, node connectivity, and memory usage continuously.
5. Use managed Redis services: Cloud Redis offerings (AWS ElastiCache, Google Cloud Memorystore) handle cluster management automatically.
Recovery considerations:
When restarting failed nodes, the cluster goes through:
1. Sync phase: Node loads RDB snapshot and AOF log
2. Handshake phase: Node joins cluster gossip protocol
3. Rebalance phase: Cluster updates slot assignments if needed
For large datasets, this can take minutes. During this time, the cluster remains unavailable if using cluster-require-full-coverage: yes. Monitor info replication to track progress.
Cluster failover modes:
- Automatic failover: Replica automatically promotes to master when master fails (requires working cluster)
- Manual failover: Admin forces a specific replica to become master using CLUSTER FAILOVER command
- Forced failover: Admin removes failed master and reassigns its slots (dangerous, use as last resort)
Split-brain scenarios:
If your cluster splits into isolated partitions, each partition may think it's the complete cluster. This can lead to:
- Conflicting slot assignments
- Data inconsistencies
- Nodes with mismatched configuration epochs
Always resolve network partitions immediately and run --cluster fix to repair any inconsistencies.
ERR fsync error
How to fix "ERR fsync error" in Redis
ERR Job for redis-server.service failed because a timeout was exceeded
Job for redis-server.service failed because a timeout was exceeded
ERR Unbalanced XREAD list of streams
How to fix "ERR Unbalanced XREAD list" in Redis
ERR syntax error
How to fix "ERR syntax error" in Redis
ConnectionError: Error while reading from socket
ConnectionError: Error while reading from socket in redis-py