AKS disk CSI errors occur when Azure Disk volumes fail to attach, mount, or provision in Kubernetes clusters. Common causes include incorrect RBAC permissions, disk resource group mismatches, case-sensitive URI format issues, or CSI driver installation problems. Fix by verifying service principal permissions, checking disk resource groups, validating volume configurations, and ensuring the Azure Disk CSI driver is properly deployed.
The Azure Disk Container Storage Interface (CSI) driver is a critical component in AKS that manages the lifecycle of persistent volumes backed by Azure Managed Disks. When CSI errors occur, pods cannot attach, mount, or provision disks, causing application failures. These errors stem from misaligned permissions between the AKS service principal and Azure resources, misconfigured volume references, or deployment issues with the CSI driver components. Unlike in-tree volume plugins, CSI drivers run as separate pods on cluster nodes and require explicit registration with kubelet for volume operations to succeed.
The AKS service principal must have Contributor role on the disk's resource group:
# Get the service principal ID from your AKS cluster
SP_ID=$(az aks show --resource-group <rg> --name <cluster-name> \
--query identity.principalId -o tsv)
# Get the disk's resource group (typically the MC_* node resource group)
DISK_RG=$(az disk show --resource-group <rg> --name <disk-name> \
--query resourceGroup -o tsv)
# Check the role assignment
az role assignment list --assignee $SP_ID \
--resource-group $DISK_RG \
--query "[].roleDefinitionName" -o tableIf Contributor is missing, assign it:
az role assignment create \
--assignee $SP_ID \
--role Contributor \
--resource-group $DISK_RGPersistentVolume disk resource IDs are case-sensitive:
# Get the correct disk resource ID
az disk show --resource-group <rg> --name <disk-name> --query id -o tsv
# Example output:
# /subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/MC_myRG_myCluster_eastus/providers/Microsoft.Compute/disks/my-diskIn your PersistentVolume YAML, ensure the diskURI exactly matches (case-sensitive):
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-pv
spec:
capacity:
storage: 32Gi
accessModes:
- ReadWriteOnce
azureDisk:
kind: Managed
diskName: my-disk
diskURI: /subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/MC_myRG_myCluster_eastus/providers/Microsoft.Compute/disks/my-disk
fsType: ext4AKS clusters create a managed resource group (MC_*) for node resources. Disks must be in this group:
# Find your cluster's node resource group
NODE_RG=$(az aks show --resource-group <your-rg> --name <cluster-name> \
--query nodeResourceGroup -o tsv)
echo "Node resource group: $NODE_RG"
# List all disks in the node resource group
az disk list --resource-group $NODE_RG --query "[].{name:name, id:id}" -o tableIf your disk is in a different resource group, either:
1. Move the disk to the node resource group, OR
2. Ensure the service principal has Contributor role on the disk's actual resource group
To move the disk:
# Use Azure Portal or Azure Resource Mover to relocate the disk to the node RGThe CSI driver must be deployed and running on all worker nodes:
# Check if the CSI driver is running
kubectl get pods -n kube-system | grep azuredisk
# Look for azuredisk-csi-node and azuredisk-csi-controller pods
kubectl get daemonset -n kube-system | grep azuredisk
kubectl get deployment -n kube-system | grep azurediskExpected output should show:
- azuredisk-csi-node-*: Running on every worker node (DaemonSet)
- azuredisk-csi-controller-*: Running controller pod(s)
If pods are missing or failing:
kubectl describe pod -n kube-system <azuredisk-csi-node-pod>
kubectl logs -n kube-system <azuredisk-csi-node-pod>Install the official Azure Disk CSI driver using Helm:
# Add the Azure Disk CSI driver Helm repository
helm repo add azuredisk-csi-driver https://raw.githubusercontent.com/kubernetes-sigs/azuredisk-csi-driver/master/charts
helm repo update
# Install the driver
helm install azuredisk-csi-driver azuredisk-csi-driver/azuredisk-csi-driver \
-n kube-system \
--set controller.replicas=2 \
--set node.tolerations[0].key=node-role.kubernetes.io/master \
--set node.tolerations[0].operator=Exists \
--set node.tolerations[1].key=node-role.kubernetes.io/control-plane \
--set node.tolerations[1].operator=ExistsOr upgrade if already installed:
helm upgrade azuredisk-csi-driver azuredisk-csi-driver/azuredisk-csi-driver \
-n kube-systemWait for deployment to complete:
kubectl rollout status daemonset/azuredisk-csi-node -n kube-system
kubectl rollout status deployment/azuredisk-csi-controller -n kube-systemEach Azure VM SKU has a maximum number of data disks that can attach:
# Check your node VM sizes
kubectl describe nodes | grep "node.kubernetes.io/instance-type"
# Common limits:
# Standard_B1s: 4 disks
# Standard_B2s: 8 disks
# Standard_D2s_v3: 8 disks
# Standard_D4s_v3: 16 disks
# Standard_D8s_v3: 32 disks
# Standard_D16s_v3: 32 disks
# Standard_E4s_v3: 8 disks
# Standard_E8s_v3: 16 disksCount current disk attachments:
kubectl describe node <node-name> | grep "attachedVolumes"If approaching the limit, either:
1. Use larger VM SKUs with higher attachment limits, OR
2. Use Azure Container Storage or other storage solutions (NFS, Azure Files) instead
Scale nodes to higher SKUs:
az aks nodepool update --resource-group <rg> --cluster-name <cluster> \
--name <nodepool> --vm-set-type VirtualMachineScaleSets \
--node-vm-size Standard_D8s_v3Ensure your StorageClass uses the managed-csi driver:
kubectl get storageclass
kubectl describe storageclass managed-csiExpected output should show:
Provisioner: disk.csi.azure.com
Parameters:
kind: Managed
storageaccounttype: Premium_LRS # or Standard_LRSCreate a test PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-azure-disk-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-csi
resources:
requests:
storage: 10GiApply and check status:
kubectl apply -f test-pvc.yaml
kubectl describe pvc test-azure-disk-pvc
kubectl get pvc test-azure-disk-pvc -w # Watch for Bound statusIf still failing, check events:
kubectl describe pvc test-azure-disk-pvc
kubectl get events -n default --sort-by='.lastTimestamp' | grep -i diskEnable debug logging in the CSI driver to see API calls:
# Check azuredisk-csi-controller logs
kubectl logs -n kube-system -l app=azuredisk-csi-controller -c azuredisk
# Check azuredisk-csi-node logs
kubectl logs -n kube-system -l app=azuredisk-csi-node -c azuredisk
# Look for specific errors like:
# - "RequestFailed"
# - "Forbidden"
# - "NotFound"
# - "InvalidParameter"For authentication issues:
# Verify cluster can reach Azure Resource Manager
kubectl run -it --image=mcr.microsoft.com/oss/azure-cli:latest --rm debug -- bash
# Inside the pod:
az login --service-principal -u $AZURE_CLIENT_ID -p $AZURE_CLIENT_SECRET --tenant $AZURE_TENANT_ID
az disk show --ids "/subscriptions/SUBSCRIPTION_ID/resourceGroups/RESOURCE_GROUP/providers/Microsoft.Compute/disks/DISK_NAME"AKS disk CSI driver behavior varies by Kubernetes version and Azure region. Ensure your AKS cluster is on a supported Kubernetes version (1.18+); older versions may require custom CSI driver configurations. For stop/start scenarios: a known issue exists where disks don't reattach to VMSS instances after cluster restart; manually detach disks in Azure Portal before restarting. Azure Container Storage is an emerging alternative offering higher throughput and lower latency than managed disks. For multi-zone clusters, use allowedTopologies in StorageClass to ensure disks provision in the same zone as pods. Premium_LRS disks cost more but guarantee IOPS; Standard_LRS suitable for dev/test. Use fsGroupChangePolicy: OnRootMismatch in Kubernetes 1.20+ to avoid performance penalties during large permission changes. For rootless deployments and SELinux-enabled nodes, ensure CSI driver pods have appropriate securityContext. Consider using Azure Policy to enforce disk encryption and compliance requirements. Monitor CSI metrics via Prometheus if available; look for provisioning latency and attachment failure rates.
Failed to connect to server: connection refused (HTTP/2)
How to fix "HTTP/2 connection refused" error in Kubernetes
missing request for cpu in container
How to fix "missing request for cpu in container" in Kubernetes HPA
error: invalid configuration
How to fix "error: invalid configuration" in Kubernetes
etcdserver: cluster ID mismatch
How to fix "etcdserver: cluster ID mismatch" in Kubernetes
running with swap on is not supported
How to fix "running with swap on is not supported" in kubeadm