How to fix CLUSTERDOWN The cluster is down in Redis

RedisADVANCEDHIGH

This error occurs when a Redis cluster becomes unavailable due to node failures or missing hash slot coverage. The cluster stops accepting requests until all hash slots are covered again by available nodes.

What this error means

The "CLUSTERDOWN The cluster is down" error indicates that the Redis cluster has entered an unavailable state. By default, Redis cluster nodes are configured to refuse all queries if they detect that at least one hash slot (a portion of the data space) is not covered by any working node. Redis cluster divides the keyspace into 16,384 hash slots. Each key maps to exactly one slot, and each cluster node is responsible for a range of slots. When master nodes and all their replicas go down simultaneously, the slots they manage become uncovered, triggering a cluster-wide shutdown. This behavior is controlled by the `cluster-require-full-coverage` configuration setting (default: yes). It exists as a safety mechanism to prevent the cluster from serving stale or incorrect data during partial outages. **Common scenarios:** - A master node and all its replicas crash simultaneously - Network partitions isolating master nodes from the cluster - Long-running commands or Lua scripts cause nodes to appear dead to the cluster - Slot migration operations hang, leaving slots in an inconsistent state - Multiple nodes fail before the cluster can rebalance

How to fix "CLUSTERDOWN The cluster is down"

1Check cluster status immediately

Get a quick overview of the cluster state:

bash

# On any accessible cluster node
redis-cli -h <node-ip> -p <port> cluster info

# Check detailed slot status
redis-cli -h <node-ip> -p <port> cluster slots

# Check individual node status
redis-cli -h <node-ip> -p <port> cluster nodes

Look for:
- cluster_state:fail - Confirms the cluster is down
- cluster_slots_assigned vs cluster_slots_ok - Shows missing slots
- cluster_my_epoch and node epoch values - Helps identify outdated nodes

2Verify which nodes are offline

Identify failed nodes to understand the scope:

bash

# From a node that's still accessible
redis-cli -h <node-ip> -p <port> cluster nodes | grep -E 'master|slave'

# Output shows node status (connected/disconnected)
# Example:
# 1a2b3c... 10.0.0.1:6379 master - 0 1500000000000 1 connected 0-5460
# 4d5e6f... 10.0.0.2:6379 slave 1a2b3c... 0 1500000001000 1 connected

Key indicators in the output:
- master/slave: Node role
- connected/disconnected: Current connectivity status
- epoch value: Helps detect split-brain scenarios
- Slot ranges: Which slots this node manages

3Attempt to restart offline nodes

The quickest fix is often to restart the failed nodes:

bash

# Via systemd (Linux)
sudo systemctl restart redis-server

# Or if using redis_6379 naming convention
sudo systemctl restart redis_6379

# Via direct command
redis-server /etc/redis/redis.conf

# Check if the node rejoins the cluster
redis-cli -h <node-ip> -p <port> cluster info

Monitor the restart:

bash

# Watch cluster recovery
watch -n 1 'redis-cli -h <node-ip> -p <port> cluster info | grep cluster_state'

The cluster should automatically recover once the node becomes available and syncs with the rest of the cluster.

4Resolve slot migration conflicts

If slot migration is stuck, force the slots to a stable state:

bash

# Find stuck slots (showing importing/migrating state)
redis-cli -h <node-ip> -p <port> cluster slots

# On BOTH the source and target nodes, stabilize the slot
redis-cli -h <source-node> -p <port> CLUSTER SETSLOT <slot-number> STABLE
redis-cli -h <target-node> -p <port> CLUSTER SETSLOT <slot-number> STABLE

# Example: Stabilize slot 5000
redis-cli -h 10.0.0.1 -p 6379 CLUSTER SETSLOT 5000 STABLE
redis-cli -h 10.0.0.2 -p 6379 CLUSTER SETSLOT 5000 STABLE

# Verify the cluster recovers
redis-cli -h <node-ip> -p <port> cluster info

This clears any import/migrate state without losing data.

5Disable full-coverage requirement as temporary fix

If only a subset of nodes are down and you need the cluster operational, temporarily disable full-coverage:

bash

# On any accessible node
redis-cli -h <node-ip> -p <port> CONFIG SET cluster-require-full-coverage no

# Verify the change
redis-cli -h <node-ip> -p <port> CONFIG GET cluster-require-full-coverage

# The cluster should now serve requests for available slots
redis-cli -h <node-ip> -p <port> cluster info

IMPORTANT: This is a temporary measure. The cluster will only serve data from the available nodes. Make sure to:
1. Restart the failed nodes as soon as possible
2. Restore cluster-require-full-coverage yes in production configs
3. Investigate the root cause of node failures

6Fix cluster topology inconsistencies

If nodes have incorrect slot assignments after failure recovery:

bash

# Use redis-cli --cluster fix command (Redis 5.0+)
redis-cli --cluster fix <node-ip>:<port>

# For older Redis versions, use redis-trib.rb
./redis-trib.rb fix <node-ip>:<port>

# Check the results
redis-cli -h <node-ip> -p <port> cluster info

This command:
- Detects missing slots
- Identifies nodes with extra slots
- Automatically rebalances slot assignments
- Repairs cluster metadata inconsistencies

Before running this:
- Ensure all nodes are running and connected
- Have a backup of your cluster configuration
- Run it during a maintenance window if possible

7Adjust cluster timeout settings

If nodes are being marked as failed too aggressively, increase timeout thresholds:

bash

# On any cluster node, temporarily adjust timeouts
redis-cli -h <node-ip> -p <port> CONFIG SET cluster-node-timeout 15000

# Also check and adjust Lua script timeout (default 5000ms)
redis-cli -h <node-ip> -p <port> CONFIG SET lua-time-limit 10000

# Verify changes
redis-cli -h <node-ip> -p <port> CONFIG GET cluster-node-timeout
redis-cli -h <node-ip> -p <port> CONFIG GET lua-time-limit

Configuration guide:
- cluster-node-timeout: Milliseconds before a node is considered failed (default: 15000)
- lua-time-limit: Max time for a Lua script in milliseconds (default: 5000)

Important: Make these changes permanent by updating your redis.conf file and restarting.

8Monitor cluster health continuously

After recovery, monitor the cluster to prevent future outages:

bash

# Real-time cluster status monitoring
while true; do
  echo "=== Cluster Status at $(date) ==="
  redis-cli -h <node-ip> -p <port> cluster info | grep cluster_state
  redis-cli -h <node-ip> -p <port> cluster info | grep cluster_slots
  redis-cli -h <node-ip> -p <port> cluster nodes | grep -E 'master|slave' | awk '{print $1, $2, $3, $4}'
  sleep 10
done

# Or use redis-cli's built-in monitor
redis-cli -h <node-ip> -p <port> --cluster check <node-ip>:<port>

Key metrics to watch:
- cluster_state: Should be "ok"
- cluster_slots_ok: Should equal 16384
- cluster_slots_assigned: Should equal 16384
- Node epoch values: Should be synchronized
- All nodes: Should show "connected" status

9Investigate and fix root causes

Address the underlying problem to prevent recurrence:

bash

# Check Redis logs for errors
tail -f /var/log/redis/redis-server.log

# Monitor system resources on affected nodes
watch -n 1 'free -h && df -h && ps aux | grep redis'

# Check for out-of-memory conditions
redis-cli -h <node-ip> -p <port> info memory

# Look for replication lag (if applicable)
redis-cli -h <node-ip> -p <port> info replication

Common root causes to investigate:
- Out of memory: Increase memory limit or evict old keys
- Disk full: Free up disk space; configure persistence
- Long-running commands: Optimize slow queries, adjust script timeouts
- Network issues: Check firewall rules, network connectivity between nodes
- High CPU/load: Profile and optimize application code
- Version mismatch: Ensure all nodes run compatible Redis versions

Advanced notes

Understanding Redis cluster architecture:

Redis Cluster uses a distributed consensus model where all nodes participate in gossip protocol communication. Each node sends ping/pong messages to detect failures. If a node misses pongs from another node within cluster-node-timeout, it marks that node as "suspected failure" (PFAIL). When a majority of nodes agree a node is down, it enters "confirmed failure" (FAIL) state.

Hash slots and cluster availability:

Redis divides the keyspace into 16,384 hash slots using CRC16:
- Each key maps to a slot: slot = CRC16(key) % 16384
- Each cluster node manages a range of slots
- When a node goes offline, its slots become unreachable
- With cluster-require-full-coverage: yes (default), ANY uncovered slot causes cluster-wide shutdown
- With cluster-require-full-coverage: no, uncovered slots only prevent reads/writes for those specific keys

Preventing CLUSTERDOWN outages:

1. Use replica nodes: Deploy at least one replica per master. If a master fails, its replica can take over.
2. Tune cluster-node-timeout: Set it higher than expected network latencies but low enough to detect real failures (recommended: 15000-30000ms).
3. Avoid long-running commands: Use --max-memory-policy appropriately and set lua-time-limit conservatively.
4. Monitor cluster metrics: Track slot coverage, node connectivity, and memory usage continuously.
5. Use managed Redis services: Cloud Redis offerings (AWS ElastiCache, Google Cloud Memorystore) handle cluster management automatically.

Recovery considerations:

When restarting failed nodes, the cluster goes through:
1. Sync phase: Node loads RDB snapshot and AOF log
2. Handshake phase: Node joins cluster gossip protocol
3. Rebalance phase: Cluster updates slot assignments if needed

For large datasets, this can take minutes. During this time, the cluster remains unavailable if using cluster-require-full-coverage: yes. Monitor info replication to track progress.

Cluster failover modes:

- Automatic failover: Replica automatically promotes to master when master fails (requires working cluster)
- Manual failover: Admin forces a specific replica to become master using CLUSTER FAILOVER command
- Forced failover: Admin removes failed master and reassigns its slots (dangerous, use as last resort)

Split-brain scenarios:

If your cluster splits into isolated partitions, each partition may think it's the complete cluster. This can lead to:
- Conflicting slot assignments
- Data inconsistencies
- Nodes with mismatched configuration epochs

Always resolve network partitions immediately and run --cluster fix to repair any inconsistencies.

How to fix CLUSTERDOWN The cluster is down in Redis

What this error means

Typical symptoms

Common causes

How to fix "CLUSTERDOWN The cluster is down"

Advanced notes

Related errors

Official resources & further reading