How to fix StaleShardVersion: shard version mismatch in MongoDB

MongoDBINTERMEDIATEMEDIUM

The StaleShardVersion error (code 63) occurs when MongoDB components have outdated metadata about shard versions in a sharded cluster. This typically happens during chunk migrations, configuration changes, or when clients cache stale routing information. Applications should implement retry logic to handle this transient error.

What this error means

The "StaleShardVersion: shard version mismatch" error is a MongoDB-specific error (code 63) that occurs in sharded cluster environments. This error indicates that a mongos router, driver, or client has cached metadata about shard versions that no longer matches the current state of the cluster. Sharded MongoDB clusters maintain version metadata for each chunk of data across shards. When chunks are migrated between shards for load balancing, or when the cluster configuration changes, these version numbers get incremented. Components that interact with the cluster cache this metadata to optimize routing decisions. The error occurs when: 1. A mongos instance has stale routing table information 2. A client driver caches shard metadata that becomes outdated 3. Chunk migrations are in progress and metadata hasn't propagated 4. Network partitions or delays prevent metadata synchronization This is generally a transient error that applications should handle gracefully with retry mechanisms, as the underlying issue is typically resolved automatically by MongoDB's metadata synchronization processes.

How to fix "StaleShardVersion: shard version mismatch"

1Implement automatic retry logic in your application

Add retry logic to handle StaleShardVersion errors gracefully. Most MongoDB drivers provide built-in retry mechanisms for transient errors.

javascript

// Example in Node.js MongoDB driver
const client = new MongoClient(uri, {
  retryWrites: true,
  retryReads: true,
  maxPoolSize: 10,
  minPoolSize: 5,
});

// Or implement custom retry logic
async function executeWithRetry(operation, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await operation();
    } catch (error) {
      if (error.code === 63 || error.codeName === 'StaleShardVersion') {
        // Wait before retrying (exponential backoff)
        await new Promise(resolve =>
          setTimeout(resolve, Math.pow(2, i) * 100)
        );
        continue;
      }
      throw error;
    }
  }
  throw new Error('Max retries exceeded');
}

2Check cluster health and balancer status

Verify that all shards and config servers are healthy. Check if the balancer is actively migrating chunks.

javascript

// Check balancer status
use config
db.locks.find({ _id: "balancer" })

// Check chunk distribution
db.getSiblingDB("config").chunks.find().count()

// Monitor active migrations
db.currentOp().inprog.forEach(op => {
  if (op.desc && op.desc.includes("migrate")) {
    printjson(op);
  }
})

Also check MongoDB logs for balancer activity:

bash

# Check mongos logs
tail -f /var/log/mongodb/mongos.log | grep -i "balancer|migrate"

# Check config server logs
tail -f /var/log/mongodb/mongod-config.log

3Refresh client connections and metadata

Force clients to refresh their metadata by reconnecting or using driver-specific methods to clear caches.

javascript

// For MongoDB Node.js driver, you can:
// 1. Reconnect the client
await client.close();
await client.connect();

// 2. Clear operation-specific caches (if available)
// Some drivers allow clearing session or metadata caches

// 3. Ensure you're using the latest driver version
// Older drivers may have bugs in metadata handling

For applications, consider:
- Implementing connection pooling with reasonable TTLs
- Using shorter session timeouts
- Ensuring proper error handling for all database operations

4Monitor and optimize chunk distribution

Review your chunk distribution and consider adjusting chunk size or shard key if migrations are too frequent.

javascript

// Check chunk size settings
use config
db.settings.findOne({ _id: "chunksize" })

// View chunk distribution per shard
db.getSiblingDB("config").chunks.aggregate([
  { $group: { _id: "$shard", count: { $sum: 1 } } }
])

// Check for jumbo chunks (too large to migrate)
db.getSiblingDB("config").chunks.find({ jumbo: true })

Considerations:
- Default chunk size is 64MB (adjustable from 1MB to 1024MB)
- Too many small chunks can increase metadata overhead
- Too few large chunks can cause uneven distribution
- Jumbo chunks may need manual splitting

5Verify network connectivity and timeouts

Ensure proper network connectivity between all cluster components with appropriate timeouts.

javascript

// Check network connectivity
// From each mongos to each shard and config server
const hosts = ['shard1:27017', 'shard2:27017', 'config1:27019'];
for (const host of hosts) {
  try {
    const testClient = new MongoClient(`mongodb://${host}/test`, {
      serverSelectionTimeoutMS: 5000,
      connectTimeoutMS: 10000,
      socketTimeoutMS: 30000
    });
    await testClient.connect();
    console.log(`✓ ${host} reachable`);
    await testClient.close();
  } catch (error) {
    console.error(`✗ ${host} unreachable:`, error.message);
  }
}

Network considerations:
- Ensure DNS resolution is consistent across all nodes
- Check firewall rules allow all necessary ports
- Verify network latency is within acceptable bounds
- Consider using shorter operation timeouts for faster failure detection

6Review application design and query patterns

Evaluate whether your application design contributes to the issue. Avoid scatter-gather queries and ensure proper shard key usage.

javascript

// Bad: Query without shard key (scatter-gather)
db.orders.find({ status: "pending" }) // If status is not shard key

// Good: Query with shard key prefix
db.orders.find({ customerId: "123", status: "pending" }) // If customerId is shard key

// Check query patterns
db.currentOp().inprog.forEach(op => {
  if (op.ns === "mydb.orders") {
    print(`Query: ${JSON.stringify(op.query)}`);
    print(`Shard key: ${op.shardKey}`);
  }
})

Best practices:
- Always include shard key in queries when possible
- Avoid updates that change shard key values (requires document migration)
- Use targeted operations rather than broadcast operations
- Consider using zones for geographic or logical data separation

Advanced notes

## Deep Dive: MongoDB Sharding Metadata Architecture

### Metadata Versioning System
MongoDB uses a versioned metadata system for sharded clusters:
1. Config servers store the authoritative metadata (chunk ranges, shard mappings)
2. mongos routers cache this metadata for routing decisions
3. Each chunk has a version number that increments on migration
4. Epoch values track major metadata changes

### When StaleShardVersion Becomes Critical
While typically transient, persistent StaleShardVersion errors can indicate:
- Network partitions preventing metadata propagation
- Config server issues where metadata becomes inconsistent
- Driver bugs in metadata caching logic
- Excessive chunk migrations overwhelming the system

### Performance Implications
Frequent StaleShardVersion errors can indicate:
1. Too aggressive balancing: Chunk size too small or imbalance thresholds too tight
2. Poor shard key choice: Leading to frequent chunk splits and migrations
3. Hot shards: Uneven workload distribution requiring constant rebalancing

### Monitoring and Alerting
Set up monitoring for:
- sharding.versions.stale metrics in MongoDB Atlas or Ops Manager
- Error code 63 frequency in application logs
- Balancer lock contention in config server logs
- Chunk migration rates and durations

### When to Contact MongoDB Support
Consider professional support if:
- Errors persist for hours despite retries
- Multiple clients experience simultaneous version mismatches
- Cluster becomes unstable or unresponsive
- You suspect metadata corruption in config servers

### Alternative Approaches
For read-heavy workloads experiencing frequent version mismatches:
1. Read preference secondary: Read from secondaries to avoid routing through mongos
2. Client-side caching: Cache query results to reduce database load
3. Connection pinning: Use driver features to maintain affinity to specific mongos instances

Remember: StaleShardVersion is a normal part of sharded cluster operations. Well-designed applications should handle it gracefully rather than treating it as a critical failure.

How to fix StaleShardVersion: shard version mismatch in MongoDB

What this error means

Typical symptoms

Common causes

How to fix "StaleShardVersion: shard version mismatch"

Advanced notes

Related errors

Official resources & further reading