How to fix ConcurrentSnapshotExecutionException: cannot snapshot while a snapshot/restore is in progress in Elasticsearch

ElasticsearchINTERMEDIATEMEDIUM

This error occurs when attempting to create a new snapshot in Elasticsearch while another snapshot or restore operation is already running on the same repository. Elasticsearch enforces snapshot serialization per repository to maintain data consistency and prevent conflicts in the repository state.

What this error means

The "ConcurrentSnapshotExecutionException: cannot snapshot while a snapshot/restore is in progress" error is thrown when Elasticsearch detects an attempt to initiate a snapshot operation while another snapshot or restore operation is already executing on the same repository. Elasticsearch enforces strict concurrency rules for snapshot operations at the repository level: 1. Only one snapshot creation can run at a time per repository 2. Snapshot creation cannot occur while a restore is in progress on the same repository 3. Snapshot deletion operations also block new snapshot creation until complete 4. Multiple repositories can have concurrent snapshots, but each repository is serialized This limitation exists to prevent repository corruption, ensure atomic operations, and maintain consistency of the repository's metadata and data files. The error typically happens when: - Automated snapshot schedules overlap (snapshot takes longer than the schedule interval) - Manual snapshot is triggered while a scheduled snapshot is running - Snapshot Lifecycle Management (SLM) policies are misconfigured - Previous snapshot operation is stuck and hasn't completed - Restore operation is running when a snapshot is triggered The error message protects your data by preventing simultaneous operations that could corrupt the repository or create inconsistent backups.

How to fix "ConcurrentSnapshotExecutionException: cannot snapshot while a snapshot/restore is in progress"

1Check current snapshot and restore operations

First, identify any operations currently running on the repository:

bash

# Check all currently running snapshots
curl -X GET "localhost:9200/_snapshot/_status" -u "username:password"

# Check snapshots for a specific repository
curl -X GET "localhost:9200/_snapshot/my_repository/_status" -u "username:password"

# List all snapshots in the repository
curl -X GET "localhost:9200/_snapshot/my_repository/_all" -u "username:password"

# Check for ongoing restore operations
curl -X GET "localhost:9200/_cat/recovery?v&active_only=true" -u "username:password"

# Get detailed snapshot information
curl -X GET "localhost:9200/_snapshot/my_repository/_current" -u "username:password"

Look for:
- Snapshots with "state": "IN_PROGRESS"
- Long-running snapshots (check start_time_in_millis)
- Failed snapshots still listed as in progress
- Restore operations in the recovery API output

This identifies whether the repository is truly busy or if a stuck operation is blocking new snapshots.

2Wait for current operations to complete or cancel stuck operations

If a legitimate operation is running, wait for it to finish. If it's stuck, cancel it:

bash

# Option 1: Wait for the current snapshot to complete
# Monitor progress:
curl -X GET "localhost:9200/_snapshot/my_repository/current_snapshot" -u "username:password"

# Check snapshot statistics
curl -X GET "localhost:9200/_snapshot/my_repository/_current?verbose=true" -u "username:password"

# Option 2: Cancel a stuck or unwanted snapshot
# This stops the in-progress snapshot
curl -X DELETE "localhost:9200/_snapshot/my_repository/snapshot_name" -u "username:password"

# Option 3: Cancel a stuck restore operation
# Stop all ongoing restore operations
curl -X DELETE "localhost:9200/_snapshot/my_repository/snapshot_name/_restore" -u "username:password"

# Verify the operation was cancelled
curl -X GET "localhost:9200/_snapshot/_status" -u "username:password"

Important notes:
- Deleting an in-progress snapshot safely cancels it
- Cancelled snapshots are removed from the repository
- If deletion fails, a cluster restart may be required (master node restart usually sufficient)
- Monitor cluster logs during cancellation for errors

3Adjust snapshot schedule to prevent overlaps

Modify your SLM policy or cron schedule to ensure snapshots complete before the next one starts:

bash

# Check existing SLM policies
curl -X GET "localhost:9200/_slm/policy" -u "username:password"

# Update SLM policy with longer interval
curl -X PUT "localhost:9200/_slm/policy/daily-snapshots" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "schedule": "0 0 2 * * ?",
  "name": "<daily-snap-{now/d}>",
  "repository": "my_repository",
  "config": {
    "indices": ["*"],
    "ignore_unavailable": false,
    "include_global_state": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}
'

# For manual cron-based snapshots, adjust timing
# Example: Change from every 30 minutes to every 2 hours
# Before: */30 * * * * (every 30 min)
# After:  0 */2 * * * (every 2 hours)

Schedule adjustment strategies:
- Measure average snapshot duration first
- Set schedule interval to 2-3x the average snapshot time
- Use daily schedules for large clusters
- Schedule snapshots during low-traffic periods
- Consider different schedules for different repositories

4Configure concurrent snapshot limits cluster-wide

Adjust Elasticsearch settings to control snapshot concurrency:

bash

# Check current snapshot settings
curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&filter_path=*.snapshot.*" -u "username:password"

# Set maximum concurrent snapshot operations (default is 1000)
# Note: This setting applies cluster-wide, not per-repository
curl -X PUT "localhost:9200/_cluster/settings" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "snapshot.max_concurrent_operations": 5
  }
}
'

# For production clusters with multiple repositories:
# You can run snapshots to different repositories concurrently
curl -X PUT "localhost:9200/_snapshot/repo1/snapshot1?wait_for_completion=false" -u "username:password"
curl -X PUT "localhost:9200/_snapshot/repo2/snapshot2?wait_for_completion=false" -u "username:password"
# These will run concurrently since they use different repositories

Important considerations:
- Each repository still allows only one snapshot at a time
- Concurrent operations setting affects overall cluster resources
- Higher concurrency increases CPU and I/O load
- For single repository setups, adjust scheduling instead of concurrency

5Use multiple repositories for parallel snapshots

Implement multiple repositories to enable concurrent snapshot operations:

bash

# Create additional repositories for parallel snapshots
curl -X PUT "localhost:9200/_snapshot/repository_hourly" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "type": "fs",
  "settings": {
    "location": "/mnt/snapshots/hourly",
    "compress": true
  }
}
'

curl -X PUT "localhost:9200/_snapshot/repository_daily" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "type": "s3",
  "settings": {
    "bucket": "my-snapshots-daily",
    "region": "us-east-1",
    "compress": true
  }
}
'

# Create SLM policies for each repository
# Hourly snapshots to fast local storage
curl -X PUT "localhost:9200/_slm/policy/hourly-local" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "schedule": "0 0 * * * ?",
  "repository": "repository_hourly",
  "config": {
    "indices": ["critical-*"],
    "include_global_state": false
  },
  "retention": {
    "expire_after": "24h",
    "min_count": 3,
    "max_count": 24
  }
}
'

# Daily snapshots to S3 for long-term retention
curl -X PUT "localhost:9200/_slm/policy/daily-s3" -u "username:password" -H 'Content-Type: application/json' -d'
{
  "schedule": "0 0 2 * * ?",
  "repository": "repository_daily",
  "config": {
    "indices": ["*"],
    "include_global_state": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 7,
    "max_count": 30
  }
}
'

# These will run concurrently without conflicts

Multi-repository benefits:
- Different snapshot frequencies without conflicts
- Separate fast/slow storage tiers
- Geographic redundancy
- Different retention policies per repository

6Monitor and troubleshoot stuck snapshots

Implement monitoring to detect and resolve stuck snapshot operations:

bash

# Create a monitoring script to detect long-running snapshots
# Monitor snapshot duration
curl -X GET "localhost:9200/_snapshot/_status" -u "username:password" | jq '.snapshots[] | {repository: .repository, snapshot: .snapshot, state: .state, duration_seconds: (.stats.time_in_millis / 1000)}'

# Check SLM execution history
curl -X GET "localhost:9200/_slm/policy/daily-snapshots?human=true" -u "username:password"

# View SLM statistics
curl -X GET "localhost:9200/_slm/stats" -u "username:password"

# If snapshot is genuinely stuck (hours without progress):
# 1. Check cluster health
curl -X GET "localhost:9200/_cluster/health" -u "username:password"

# 2. Check master node logs for errors
# Look for repository connection issues, disk space, permissions

# 3. If deletion fails, restart the master node
# This clears stuck operations from cluster state
# Find master node:
curl -X GET "localhost:9200/_cat/master?v" -u "username:password"

# After identifying master, perform rolling restart:
# systemctl restart elasticsearch  (on master node only)

# 4. Verify repository after restart
curl -X POST "localhost:9200/_snapshot/my_repository/_verify" -u "username:password"

Monitoring best practices:
- Set up alerts for snapshot duration exceeding thresholds
- Monitor snapshot success/failure rates
- Track repository storage usage
- Log SLM policy execution results
- Implement automatic cleanup of old snapshots

Advanced notes

## Advanced Snapshot Concurrency Management

### SLM Policy Skip vs. Queue Behavior
Elasticsearch SLM policies have specific behaviors when snapshots overlap:
- Current behavior: If a snapshot is already running when the schedule triggers, the new snapshot is skipped entirely
- Queuing proposal: Some Elasticsearch versions consider queuing snapshots instead, but this can lead to accumulation if snapshots consistently take longer than the interval
- Best practice: Design schedules with sufficient buffer time rather than relying on queuing

### Snapshot Performance Optimization
To reduce snapshot duration and prevent overlaps:
- Enable snapshot compression: Reduces storage size but increases CPU usage
- Adjust thread pool settings: thread_pool.snapshot.max controls concurrent shard snapshots
- Use incremental snapshots: Only changed segments are copied after the first full snapshot
- Optimize repository storage: Use faster storage tiers or increase IOPS for cloud storage
- Shard-level parallelism: Smaller shards snapshot faster than large monolithic shards

### Repository-Level vs. Cluster-Level Concurrency
Understanding concurrency controls:
- Per-repository limitation: Only one snapshot operation per repository (hard limit)
- Cluster-wide setting: snapshot.max_concurrent_operations controls total snapshot threads across all repositories
- Shard-level parallelism: Multiple shards can be snapshotted concurrently within a single snapshot operation
- Cross-repository concurrency: Different repositories can snapshot simultaneously

### Handling Stuck Snapshots in Production
Advanced recovery techniques:
1. Cluster state inspection: Check GET _cluster/state/metadata?filter_path=metadata.snapshots for stuck snapshot metadata
2. Manual cluster state cleanup: In extreme cases, editing cluster state can remove stuck references (requires cluster restart)
3. Repository re-registration: Delete and re-create repository registration (doesn't delete snapshot files)
4. Master node election: Force master re-election can clear stuck operations: POST _cluster/voting_config_exclusions?node_names=current_master

### Cloud Storage Repository Considerations
For S3, GCS, Azure repositories:
- Throttling limits: Cloud providers may throttle high-frequency API calls, slowing snapshots
- Network partitions: Transient network issues can leave snapshots in uncertain state
- Storage class impact: Standard storage is faster than archival tiers (Glacier, Cold Storage)
- Cross-region latency: Snapshots to remote regions take longer
- Concurrent connection limits: S3 has connection limits that can slow large snapshots

### Snapshot vs. Restore Concurrency Rules
Different operation types have different restrictions:
- Snapshot + Snapshot: Cannot run concurrently on same repository
- Snapshot + Restore: Cannot run concurrently on same repository
- Restore + Restore: Multiple restore operations can run concurrently
- Snapshot + Delete: Deletion blocks new snapshots until complete
- Cross-repository: All operations can run concurrently across different repositories

### Disaster Recovery Implications
Concurrency limitations affect DR planning:
- RTO considerations: Single repository serialization increases recovery time objectives
- Multi-repository strategy: Use multiple repositories for parallel backups of critical data
- Snapshot prioritization: Critical indices should snapshot to dedicated repositories
- Failover snapshots: Maintain snapshots in multiple geographic locations using different repositories

### Monitoring and Alerting Strategies
Implement comprehensive monitoring:
- Snapshot duration trends: Alert when snapshots exceed historical averages by 50%+
- SLM policy success rate: Alert on consecutive failures (3+ failures indicates systemic issue)
- Repository storage growth: Monitor for unexpected growth indicating failed deletions
- Concurrent operation attempts: Log and alert on ConcurrentSnapshotExecutionException frequency
- Master node snapshot metrics: Track _nodes/stats snapshot thread pool metrics

### Performance Tuning Parameters
Advanced settings for snapshot optimization:

json

{
  "snapshot.max_concurrent_operations": 5,
  "indices.recovery.max_bytes_per_sec": "100mb",
  "thread_pool.snapshot.max": 5,
  "repositories.fs.compress": true
}

### Version-Specific Behavior
Snapshot concurrency handling has evolved across versions:
- 7.x and earlier: Stricter serialization, less visibility into queued operations
- 8.x: Improved error messages and SLM statistics
- Future versions: Proposals for optional snapshot queuing and better overlap handling

How to fix ConcurrentSnapshotExecutionException: cannot snapshot while a snapshot/restore is in progress in Elasticsearch

What this error means

Typical symptoms

Common causes

How to fix "ConcurrentSnapshotExecutionException: cannot snapshot while a snapshot/restore is in progress"

Advanced notes

Related errors

Official resources & further reading