The StaleShardVersion error (code 63) occurs when MongoDB components have outdated metadata about shard versions in a sharded cluster. This typically happens during chunk migrations, configuration changes, or when clients cache stale routing information. Applications should implement retry logic to handle this transient error.
The "StaleShardVersion: shard version mismatch" error is a MongoDB-specific error (code 63) that occurs in sharded cluster environments. This error indicates that a mongos router, driver, or client has cached metadata about shard versions that no longer matches the current state of the cluster. Sharded MongoDB clusters maintain version metadata for each chunk of data across shards. When chunks are migrated between shards for load balancing, or when the cluster configuration changes, these version numbers get incremented. Components that interact with the cluster cache this metadata to optimize routing decisions. The error occurs when: 1. A mongos instance has stale routing table information 2. A client driver caches shard metadata that becomes outdated 3. Chunk migrations are in progress and metadata hasn't propagated 4. Network partitions or delays prevent metadata synchronization This is generally a transient error that applications should handle gracefully with retry mechanisms, as the underlying issue is typically resolved automatically by MongoDB's metadata synchronization processes.
Add retry logic to handle StaleShardVersion errors gracefully. Most MongoDB drivers provide built-in retry mechanisms for transient errors.
// Example in Node.js MongoDB driver
const client = new MongoClient(uri, {
retryWrites: true,
retryReads: true,
maxPoolSize: 10,
minPoolSize: 5,
});
// Or implement custom retry logic
async function executeWithRetry(operation, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await operation();
} catch (error) {
if (error.code === 63 || error.codeName === 'StaleShardVersion') {
// Wait before retrying (exponential backoff)
await new Promise(resolve =>
setTimeout(resolve, Math.pow(2, i) * 100)
);
continue;
}
throw error;
}
}
throw new Error('Max retries exceeded');
}Verify that all shards and config servers are healthy. Check if the balancer is actively migrating chunks.
// Check balancer status
use config
db.locks.find({ _id: "balancer" })
// Check chunk distribution
db.getSiblingDB("config").chunks.find().count()
// Monitor active migrations
db.currentOp().inprog.forEach(op => {
if (op.desc && op.desc.includes("migrate")) {
printjson(op);
}
})Also check MongoDB logs for balancer activity:
# Check mongos logs
tail -f /var/log/mongodb/mongos.log | grep -i "balancer|migrate"
# Check config server logs
tail -f /var/log/mongodb/mongod-config.logForce clients to refresh their metadata by reconnecting or using driver-specific methods to clear caches.
// For MongoDB Node.js driver, you can:
// 1. Reconnect the client
await client.close();
await client.connect();
// 2. Clear operation-specific caches (if available)
// Some drivers allow clearing session or metadata caches
// 3. Ensure you're using the latest driver version
// Older drivers may have bugs in metadata handlingFor applications, consider:
- Implementing connection pooling with reasonable TTLs
- Using shorter session timeouts
- Ensuring proper error handling for all database operations
Review your chunk distribution and consider adjusting chunk size or shard key if migrations are too frequent.
// Check chunk size settings
use config
db.settings.findOne({ _id: "chunksize" })
// View chunk distribution per shard
db.getSiblingDB("config").chunks.aggregate([
{ $group: { _id: "$shard", count: { $sum: 1 } } }
])
// Check for jumbo chunks (too large to migrate)
db.getSiblingDB("config").chunks.find({ jumbo: true })Considerations:
- Default chunk size is 64MB (adjustable from 1MB to 1024MB)
- Too many small chunks can increase metadata overhead
- Too few large chunks can cause uneven distribution
- Jumbo chunks may need manual splitting
Ensure proper network connectivity between all cluster components with appropriate timeouts.
// Check network connectivity
// From each mongos to each shard and config server
const hosts = ['shard1:27017', 'shard2:27017', 'config1:27019'];
for (const host of hosts) {
try {
const testClient = new MongoClient(`mongodb://${host}/test`, {
serverSelectionTimeoutMS: 5000,
connectTimeoutMS: 10000,
socketTimeoutMS: 30000
});
await testClient.connect();
console.log(`✓ ${host} reachable`);
await testClient.close();
} catch (error) {
console.error(`✗ ${host} unreachable:`, error.message);
}
}Network considerations:
- Ensure DNS resolution is consistent across all nodes
- Check firewall rules allow all necessary ports
- Verify network latency is within acceptable bounds
- Consider using shorter operation timeouts for faster failure detection
Evaluate whether your application design contributes to the issue. Avoid scatter-gather queries and ensure proper shard key usage.
// Bad: Query without shard key (scatter-gather)
db.orders.find({ status: "pending" }) // If status is not shard key
// Good: Query with shard key prefix
db.orders.find({ customerId: "123", status: "pending" }) // If customerId is shard key
// Check query patterns
db.currentOp().inprog.forEach(op => {
if (op.ns === "mydb.orders") {
print(`Query: ${JSON.stringify(op.query)}`);
print(`Shard key: ${op.shardKey}`);
}
})Best practices:
- Always include shard key in queries when possible
- Avoid updates that change shard key values (requires document migration)
- Use targeted operations rather than broadcast operations
- Consider using zones for geographic or logical data separation
## Deep Dive: MongoDB Sharding Metadata Architecture
### Metadata Versioning System
MongoDB uses a versioned metadata system for sharded clusters:
1. Config servers store the authoritative metadata (chunk ranges, shard mappings)
2. mongos routers cache this metadata for routing decisions
3. Each chunk has a version number that increments on migration
4. Epoch values track major metadata changes
### When StaleShardVersion Becomes Critical
While typically transient, persistent StaleShardVersion errors can indicate:
- Network partitions preventing metadata propagation
- Config server issues where metadata becomes inconsistent
- Driver bugs in metadata caching logic
- Excessive chunk migrations overwhelming the system
### Performance Implications
Frequent StaleShardVersion errors can indicate:
1. Too aggressive balancing: Chunk size too small or imbalance thresholds too tight
2. Poor shard key choice: Leading to frequent chunk splits and migrations
3. Hot shards: Uneven workload distribution requiring constant rebalancing
### Monitoring and Alerting
Set up monitoring for:
- sharding.versions.stale metrics in MongoDB Atlas or Ops Manager
- Error code 63 frequency in application logs
- Balancer lock contention in config server logs
- Chunk migration rates and durations
### When to Contact MongoDB Support
Consider professional support if:
- Errors persist for hours despite retries
- Multiple clients experience simultaneous version mismatches
- Cluster becomes unstable or unresponsive
- You suspect metadata corruption in config servers
### Alternative Approaches
For read-heavy workloads experiencing frequent version mismatches:
1. Read preference secondary: Read from secondaries to avoid routing through mongos
2. Client-side caching: Cache query results to reduce database load
3. Connection pinning: Use driver features to maintain affinity to specific mongos instances
Remember: StaleShardVersion is a normal part of sharded cluster operations. Well-designed applications should handle it gracefully rather than treating it as a critical failure.
MongoOperationTimeoutError: Operation timed out
How to fix "MongoOperationTimeoutError: Operation timed out" in MongoDB
MongoServerError: PlanExecutor error during aggregation :: caused by :: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation.
How to fix "QueryExceededMemoryLimitNoDiskUseAllowed" in MongoDB
MissingSchemaError: Schema hasn't been registered for model
How to fix "MissingSchemaError: Schema hasn't been registered for model" in MongoDB/Mongoose
CastError: Cast to ObjectId failed for value "abc123" at path "_id"
How to fix "CastError: Cast to ObjectId failed" in MongoDB
OverwriteModelError: Cannot overwrite model once compiled
How to fix "OverwriteModelError: Cannot overwrite model once compiled" in MongoDB