Alertmanager notification failures occur when alerts cannot be delivered to configured receivers like email, Slack, or webhooks. Common causes include SMTP misconfiguration, network connectivity issues, invalid receiver endpoints, TLS certificate errors, and timeout problems.
When Alertmanager logs "notification failed", it means an alert was triggered and routed to a receiver, but the delivery to that receiver (email, Slack, webhook, PagerDuty, etc.) encountered an error. This doesn't mean the alert itself failed—Alertmanager received the alert from Prometheus and determined it should be sent somewhere, but the actual delivery step failed. Notification failures are tracked per receiver integration. If multiple Alertmanager instances are running, the failure of one instance doesn't impact delivery if another instance can send it. However, if all instances fail, the alert won't reach anyone, which is why monitoring Alertmanager's own notification metrics is critical.
View the Alertmanager pod logs to see why notification delivery failed:
kubectl logs -n monitoring alertmanager-0 --tail=100Look for messages like:
- "x509: certificate signed by unknown authority" → TLS issue
- "connection refused" → Receiver unreachable
- "context deadline exceeded" → Timeout
- "authentication failed" → SMTP auth issue
For more verbose logging, check if debug mode is enabled:
kubectl get deployment alertmanager -n monitoring -o yaml | grep -i debugIf not, restart Alertmanager with debug flag in the pod spec.
Test connectivity from Alertmanager pod to receiver:
kubectl exec -it alertmanager-0 -n monitoring -- sh
# For email (SMTP)
telnet smtp.gmail.com 587
# For webhook
curl -v https://your-webhook-endpoint.com/alerts
# For Slack
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H "Content-Type: application/json" \
-d '{"text":"Test message"}'If connection refused or timeout, the receiver is down or unreachable. Check receiver service status, firewalls, security groups, and DNS resolution.
If using email receiver, check your Alertmanager config:
kubectl get secret alertmanager-config -n monitoring -o jsonpath="{.data.alertmanager\.yml}" | base64 -d | grep -A 10 "global:"Verify:
- smtp_smarthost is correct (e.g., "smtp.gmail.com:587")
- smtp_auth_username and smtp_auth_password are set
- smtp_require_tls is true (for most providers)
- from address is valid
Example correct config:
global:
smtp_smarthost: smtp.gmail.com:587
smtp_auth_username: [email protected]
smtp_auth_password: your-app-password
smtp_require_tls: true
from: [email protected]Note: Gmail requires app-specific passwords, not your regular password.
For webhook receivers, validate the configuration:
kubectl get alertmanagerconfig -n monitoring -o yaml
# or
kubectl get secret alertmanager-config -n monitoring -o jsonpath="{.data.alertmanager\.yml}" | base64 -dLook for webhook receivers:
receivers:
- name: webhook-receiver
webhook_configs:
- url: https://your-webhook-endpoint.com/alerts
send_resolved: trueTest the webhook with a sample alert payload:
curl -X POST https://your-webhook-endpoint.com/alerts \
-H "Content-Type: application/json" \
-d '[{"labels":{"alertname":"TestAlert"}}]'If the endpoint returns 4xx or 5xx, fix the endpoint configuration or the receiving service.
If logs show "x509: certificate signed by unknown authority", the receiver uses a self-signed or untrusted certificate:
Option 1 (Recommended): Fix the certificate
- Obtain proper certificate from trusted CA
- Update receiver service with valid certificate
Option 2: Skip verification (use with caution)
Add to Alertmanager config:
receivers:
- name: webhook-receiver
webhook_configs:
- url: https://self-signed-endpoint.com
tls_config:
insecure_skip_verify: trueUpdate the secret:
kubectl edit secret alertmanager-config -n monitoringOption 3: Add certificate to trust store
Mount the CA certificate in Alertmanager pod:
kubectl create configmap ca-cert --from-file=ca.crt -n monitoringThen reference in pod:
volumeMounts:
- name: ca-cert
mountPath: /etc/ssl/certs/custom-ca.crt
subPath: ca.crtIf alerts occasionally fail to send, the receiver might be slow:
kubectl logs -n monitoring alertmanager-0 | grep "context deadline exceeded"Increase timeout in Alertmanager config:
receivers:
- name: webhook-receiver
webhook_configs:
- url: https://your-webhook-endpoint.com/alerts
send_resolved: true
headers:
Content-Type: application/json
# Default is 10s; increase if receiver is slow
# Requires patching AlertmanagerConfig or editing secret directlyAlternatively, optimize the receiving service to respond faster. Check receiver logs for slow processing.
Create a test alert rule in Prometheus to verify delivery:
groups:
- name: test-alerts
rules:
- alert: TestAlert
expr: up == 1 # Always fires since most targets have up=1
for: 1m
annotations:
summary: "Test notification from Prometheus"
description: "This is a test alert to verify Alertmanager notification delivery"Apply the rule:
kubectl apply -f test-alert-rule.yamlWait ~1 minute, then check:
- Prometheus Alerts page shows the alert
- Alertmanager UI shows the alert
- Notification is delivered to receiver
If test alert arrives, your config is working. If not, keep debugging using earlier steps.
Monitor notification failures using Prometheus metrics:
kubectl exec -it prometheus-0 -n monitoring -- promtool query instant 'alertmanager_notifications_failed_total'Or query in Prometheus UI:
rate(alertmanager_notifications_failed_total[5m])Set up an alert for notification failures:
- alert: AlertmanagerNotificationFailures
expr: rate(alertmanager_notifications_failed_total[5m]) > 0
for: 5m
annotations:
summary: "Alertmanager notifications are failing"
description: "Check Alertmanager logs and receiver configuration"This creates a meta-alert that fires when Alertmanager itself can't send notifications.
In production Kubernetes clusters running kube-prometheus-stack, notification failures often stem from NetworkPolicy restrictions blocking outbound traffic from monitoring namespace to external services. Verify egress rules allow SMTP ports (25, 465, 587) and HTTPS (443) to receiver endpoints. For multi-instance Alertmanager deployments, ensure the replicas can all reach the receivers; if one instance fails consistently but others succeed, check pod-to-receiver networking specifically for that pod. For Slack/Teams/PagerDuty integrations, validate API tokens and endpoint URLs haven't expired or changed. Some cloud providers (EKS, GKE, AKS) may require additional configuration for egress via NAT gateways or outbound proxy rules. In air-gapped environments, webhook endpoints must be internal or accessible through VPN/proxy. For webhook rate limiting, ensure your receiver can handle the alert volume—consider batching or implementing backoff in Alertmanager's grouping rules. When troubleshooting intermittent failures, correlate with receiver resource metrics (CPU, memory) and network latency. For Gmail SMTP, use app-specific passwords, not regular passwords; Microsoft/Office365 may require additional authentication setup.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
error: invalid configuration
How to fix "error: invalid configuration" in Kubernetes
etcdserver: cluster ID mismatch
How to fix "etcdserver: cluster ID mismatch" in Kubernetes
running with swap on is not supported
How to fix "running with swap on is not supported" in kubeadm