This error occurs when attempting to use the "sampler" or "diversified_sampler" aggregation on an aggregation type that does not support sampling. Elasticsearch sampling aggregations work by selecting a subset of documents for analysis, but they can only be applied to certain aggregation types like "terms", "date_histogram", or other bucket aggregations.
The "AggregationExecutionException: Aggregation [agg_name] does not support sampling" error occurs when you try to apply a sampling aggregation (either "sampler" or "diversified_sampler") to an aggregation type that doesn't support this feature. Sampling aggregations in Elasticsearch work by selecting a representative subset of documents from your search results before performing the actual aggregation. This can significantly improve performance when working with large datasets, as it reduces the number of documents that need to be processed. However, not all aggregation types support this sampling approach. The error specifically indicates that the aggregation named "[agg_name]" in your query is of a type that cannot be wrapped with a sampler. Common examples include: 1. **Metric aggregations** (sum, avg, min, max, stats, etc.) - These calculate values across all matching documents 2. **Pipeline aggregations** (bucket_script, derivative, cumulative_sum, etc.) - These operate on the output of other aggregations 3. **Matrix aggregations** - These work with multiple fields simultaneously 4. **Parent aggregations** that don't support child aggregations with sampling Sampling is primarily designed for **bucket aggregations** that group documents, such as: - terms - date_histogram - histogram - range - filters - significant_terms When you wrap an unsupported aggregation type with a sampler, Elasticsearch throws this exception to prevent incorrect or misleading results.
First, examine your query to identify the specific aggregation named in the error message:
// Example problematic query
{
"size": 0,
"aggs": {
"sample_docs": {
"sampler": {
"shard_size": 100
},
"aggs": {
"avg_price": { // This is a METRIC aggregation - doesn't support sampling!
"avg": {
"field": "price"
}
}
}
}
}
}The error message will specify which aggregation doesn't support sampling. Look for:
- Metric aggregations: avg, sum, min, max, stats, extended_stats, value_count, cardinality
- Pipeline aggregations: bucket_script, derivative, cumulative_sum, moving_avg, bucket_sort
- Matrix aggregations: matrix_stats
Check your query structure:
# Use the Explain API to understand your query structure
curl -X POST "localhost:9200/your-index/_explain/your-doc-id" -H 'Content-Type: application/json' -d'
{
"query": { ... },
"aggs": { ... }
}
'Sampling should wrap bucket aggregations, not metric aggregations. Restructure your query:
// BEFORE: Incorrect - sampler wraps metric aggregation
{
"aggs": {
"sample_docs": {
"sampler": {
"shard_size": 100
},
"aggs": {
"avg_price": { // METRIC - doesn't support sampling
"avg": { "field": "price" }
}
}
}
}
}
// AFTER: Correct - sampler wraps bucket aggregation, metric is nested inside
{
"aggs": {
"sample_docs": {
"sampler": {
"shard_size": 100
},
"aggs": {
"product_categories": { // BUCKET aggregation - supports sampling
"terms": {
"field": "category.keyword",
"size": 10
},
"aggs": {
"avg_price": { // METRIC - now nested inside bucket aggregation
"avg": { "field": "price" }
}
}
}
}
}
}
}
// Alternative: Use diversified_sampler for more representative sampling
{
"aggs": {
"sample_docs": {
"diversified_sampler": {
"shard_size": 100,
"field": "user_id.keyword" // Ensure samples come from different users
},
"aggs": {
"popular_products": {
"terms": {
"field": "product_id.keyword",
"size": 20
}
}
}
}
}
}Key principles:
1. Sampler wraps bucket aggregations, not metric aggregations
2. Metric aggregations go inside bucket aggregations
3. Pipeline aggregations operate on bucket aggregation results
4. Consider if you really need sampling - maybe your dataset isn't large enough
If you need to calculate metrics on a sampled subset, use one of these approaches:
// APPROACH 1: Filter query to limit dataset before aggregation
{
"query": {
"function_score": {
"query": { "match_all": {} },
"random_score": {}, // Random sampling at query level
"boost_mode": "replace"
}
},
"size": 0,
"aggs": {
"avg_price": { // Direct metric aggregation on filtered results
"avg": { "field": "price" }
}
}
}
// APPROACH 2: Use bucket selector for conditional sampling
{
"aggs": {
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "to": 50 },
{ "from": 50, "to": 100 },
{ "from": 100 }
]
},
"aggs": {
"sample_check": {
"bucket_selector": {
"buckets_path": { "count": "_count" },
"script": "params.count > 1000" // Only include buckets with >1000 docs
}
},
"avg_rating": {
"avg": { "field": "rating" }
}
}
}
}
}
// APPROACH 3: Use scripted metric aggregation with sampling logic
{
"aggs": {
"sampled_stats": {
"scripted_metric": {
"init_script": "state.samples = []; state.count = 0",
"map_script": """
if (Math.random() < 0.1) { // 10% sampling
state.samples.add(doc['price'].value);
state.count++;
}
""",
"combine_script": "return state",
"reduce_script": """
double sum = 0;
long total = 0;
for (state in states) {
for (sample in state.samples) {
sum += sample;
total++;
}
}
return ['avg': sum/total, 'count': total];
"""
}
}
}
}Each approach has trade-offs:
- Query-level filtering: Simple but doesn't guarantee representative sampling
- Bucket selector: Good for conditional aggregation but not true random sampling
- Scripted metric: Most flexible but has performance implications
Test different aggregation structures to find what works:
# Test 1: Verify which aggregations support sampling
curl -X POST "localhost:9200/test-index/_search?size=0" -H 'Content-Type: application/json' -d'
{
"aggs": {
"test_sampler": {
"sampler": { "shard_size": 100 },
"aggs": {
"test_agg": {
"terms": { "field": "category.keyword" } // This should work
}
}
}
}
}
'
# Test 2: Try the diversified_sampler variant
curl -X POST "localhost:9200/test-index/_search?size=0" -H 'Content-Type: application/json' -d'
{
"aggs": {
"test_diversified": {
"diversified_sampler": {
"shard_size": 100,
"field": "user_id.keyword"
},
"aggs": {
"test_agg": {
"date_histogram": { // Another bucket aggregation that supports sampling
"field": "timestamp",
"calendar_interval": "day"
}
}
}
}
}
}
'
# Test 3: Check aggregation documentation for compatibility
# Refer to Elasticsearch documentation for each aggregation type
# Bucket aggregations that typically support sampling:
# - terms, significant_terms
# - date_histogram, histogram
# - range, date_range, ip_range
# - filters
# - nested, reverse_nested
# - geohash_grid, geotile_gridCreate a compatibility matrix:
- ✅ Support sampling: terms, date_histogram, histogram, range, filters, geohash_grid
- ❌ Do NOT support sampling: avg, sum, min, max, stats, cardinality, percentiles
- ⚠️ Conditional support: nested (depends on child aggregations), composite (limited)
If sampling isn't suitable for your use case, consider these optimizations:
// OPTIMIZATION 1: Use doc_values and keyword fields
{
"aggs": {
"categories": {
"terms": {
"field": "category.keyword", // keyword field uses doc_values
"size": 100,
"execution_hint": "map" // Optimize execution
}
}
}
}
// OPTIMIZATION 2: Limit aggregation scope with query filters
{
"query": {
"range": {
"timestamp": {
"gte": "now-7d/d" // Only aggregate last 7 days
}
}
},
"aggs": {
"daily_sales": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day",
"min_doc_count": 1
},
"aggs": {
"total_sales": {
"sum": { "field": "amount" }
}
}
}
}
}
// OPTIMIZATION 3: Use composite aggregation for pagination
{
"size": 0,
"aggs": {
"paged_results": {
"composite": {
"size": 1000,
"sources": [
{
"category": {
"terms": { "field": "category.keyword" }
}
}
]
},
"aggs": {
"avg_price": {
"avg": { "field": "price" }
}
}
}
}
}
// OPTIMIZATION 4: Enable request cache for repeated queries
curl -X PUT "localhost:9200/your-index/_settings" -H 'Content-Type: application/json' -d'
{
"index.requests.cache.enable": true
}
'Additional performance tips:
1. Use `keyword` fields for aggregations instead of text fields
2. Enable `doc_values` on numeric/date fields
3. Set reasonable `size` limits on terms aggregations
4. Use `execution_hint: "map"` for medium-sized datasets
5. Consider `eager_global_ordinals: false` for high-cardinality fields
Set up monitoring to ensure your aggregations perform well:
# Monitor aggregation performance with Profile API
curl -X POST "localhost:9200/your-index/_search?size=0" -H 'Content-Type: application/json' -d'
{
"profile": true,
"aggs": {
"test_aggregation": {
"terms": {
"field": "category.keyword",
"size": 50
}
}
}
}
'
# Check aggregation memory usage
curl -X GET "localhost:9200/_nodes/stats/indices/fielddata?human&fields=*" -H 'Content-Type: application/json'
# Monitor circuit breaker trips (related to aggregation memory)
curl -X GET "localhost:9200/_nodes/stats/breaker?human&pretty"
# Use the Validate API to check query syntax
curl -X GET "localhost:9200/your-index/_validate/query?explain" -H 'Content-Type: application/json' -d'
{
"query": { ... },
"aggs": { ... }
}
'Create alerts for:
1. High memory usage in fielddata or request circuit breakers
2. Slow aggregation queries (use Profile API timing)
3. Frequent aggregation errors including sampling errors
4. High cardinality fields causing performance issues
Best practices:
- Test with production-like data before deploying new aggregation queries
- Use the `_validate` API to catch syntax errors early
- Monitor the `search` thread pool for queue buildup
- Consider dedicated coordinating nodes for aggregation-heavy workloads
## Understanding Elasticsearch Sampling Aggregations
### Sampler vs Diversified Sampler
Elasticsearch offers two sampling aggregations:
1. `sampler`: Randomly samples documents at the shard level
- Uses shard_size parameter to control sample size per shard
- Simple random sampling without guarantees about distribution
- Good for general-purpose sampling
2. `diversified_sampler`: Samples while ensuring diversity on a field
- Requires field parameter to diversify samples
- Ensures samples come from different values of the specified field
- Better for avoiding bias when sampling grouped data
- Supports max_docs_per_value to limit samples per field value
### How Sampling Works Internally
1. Shard-level sampling: Each shard independently samples documents
2. Sample size: Controlled by shard_size (default: 100)
3. Random seed: Uses Lucene's random scoring for reproducibility
4. Aggregation execution: Only sampled documents are passed to child aggregations
### Performance Characteristics
- Memory usage: Reduces memory by processing fewer documents
- Accuracy trade-off: Smaller samples = faster but less accurate
- Shard consideration: Each shard samples independently, so total sample size = shard_size × number_of_shards
- Cardinality impact: High-cardinality fields may not be well-represented in small samples
### When to Use Sampling
✅ Good use cases:
- Exploratory data analysis on large datasets
- Quick approximate aggregations for dashboard widgets
- Testing aggregation logic before full execution
- Identifying top categories/trends without exact counts
❌ Poor use cases:
- Exact calculations (revenue, counts, sums)
- Small datasets where sampling overhead isn't justified
- Aggregations requiring complete data (percentiles, cardinality)
- Compliance/reporting requiring exact figures
### Alternative Sampling Techniques
1. Query-time filtering: Use function_score with random_score
2. Index-time sampling: Store a "sample" flag during indexing
3. Rollup indices: Pre-aggregated data at different granularities
4. Transform API: Create summarized indices for common aggregations
### Compatibility Matrix
| Aggregation Type | Supports Sampler? | Notes |
|-----------------|-------------------|-------|
| terms | ✅ Yes | Primary use case for sampling |
| date_histogram | ✅ Yes | Good for time-series sampling |
| histogram | ✅ Yes | Works with numeric ranges |
| range/date_range | ✅ Yes | |
| filters | ✅ Yes | Each filter gets sampled independently |
| significant_terms | ✅ Yes | |
| geohash_grid | ✅ Yes | Geographic sampling |
| avg/sum/min/max | ❌ No | Metric aggregations don't support sampling |
| stats/extended_stats | ❌ No | |
| cardinality | ❌ No | Requires complete dataset for accuracy |
| percentiles | ❌ No | |
| pipeline aggregations | ❌ No | Operate on bucket results, not documents |
| matrix_stats | ❌ No | |
### Production Recommendations
1. Start with shard_size = 100 and adjust based on accuracy needs
2. Use diversified_sampler when sampling user/entity data to avoid bias
3. Monitor accuracy by comparing sampled vs full results periodically
4. Consider data distribution - skewed data may require larger samples
5. Test with different shard_sizes to find optimal performance/accuracy balance
QueryShardException: No mapping found for [field] in order to sort on
How to fix "QueryShardException: No mapping found for field in order to sort on" in Elasticsearch
IndexNotFoundException: no such index [index_name]
How to fix "IndexNotFoundException: no such index [index_name]" in Elasticsearch
DocumentMissingException: [index][type][id]: document missing
DocumentMissingException: Document missing
ParsingException: Unknown key for a START_OBJECT in [query]
How to fix "ParsingException: Unknown key for a START_OBJECT in [query]" in Elasticsearch
ScriptException: compile error
ScriptException: compile error