Document Version: 1.0
Date: August 2025
Author: AI Infrastructure Team
Note: This document uses placeholder values like
<your-region>,ai-gpu-aks-rg, andai-h100-cluster. Replace these with your organization's naming conventions and preferred Azure regions.
This document provides a comprehensive guide for deploying AI models using vLLM on Azure Kubernetes Service (AKS) with NVIDIA H100 GPUs and Multi-Instance GPU (MIG) technology. The solution enables running multiple AI models simultaneously on a single GPU with hardware isolation, optimizing cost and resource utilization.
- ✅ Single H100 GPU serving 2 models simultaneously
- ✅ Hardware isolation with guaranteed performance
- ✅ Production-ready automated management
- ✅ Cost optimization through GPU sharing
- Quick Start
- Architecture Overview
- Prerequisites
- Phase 1: AKS Cluster Creation
- Phase 2: GPU Operator Installation
- Phase 3: MIG Configuration
- Phase 4: Model Deployments
- Monitoring and Operations
- Troubleshooting
- Cost Analysis
- Security Considerations (Optional)
TL;DR: Complete deployment in ~30 minutes with these essential commands
- Azure CLI installed and logged in
- kubectl installed
- Sufficient Azure quota for H100 GPUs
# 1. Create infrastructure (5 minutes)
az group create --name ai-gpu-aks-rg --location eastus
az aks create --resource-group ai-gpu-aks-rg --name ai-h100-cluster --location eastus --node-count 1 --generate-ssh-keys
az aks nodepool add --resource-group ai-gpu-aks-rg --cluster-name ai-h100-cluster --name gpupool --node-count 1 --node-vm-size Standard_NC40ads_H100_v5
az aks get-credentials --resource-group ai-gpu-aks-rg --name ai-h100-cluster
# 2. Install GPU components (10 minutes)
helm install --wait --create-namespace -n gpu-operator node-feature-discovery node-feature-discovery --repo https://kubernetes-sigs.github.io/node-feature-discovery/charts --set-json master.config.extraLabelNs='["nvidia.com"]'
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait gpu-operator -n gpu-operator nvidia/gpu-operator --set nfd.enabled=false --set driver.enabled=false
# 3. Configure MIG (2 minutes)
az aks nodepool update --cluster-name ai-h100-cluster --resource-group ai-gpu-aks-rg --nodepool-name gpupool --labels "nvidia.com/mig.config"="all-3g.47gb"
# 4. Deploy models (10 minutes)
kubectl create namespace vllm
# Apply the YAML manifests from Phase 4 sections- 2 AI Models running simultaneously on 1 GPU
- 47.5GB memory per model instance
- Hardware isolation between workloads
- 50% cost savings vs separate GPUs
Continue reading for detailed explanations and configurations.
┌─────────────────────────────────────────────────────────────────┐
│ Azure Kubernetes Service (AKS) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ vLLM Namespace │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ BGE-M3 Service │ │ Granite Vision │ │ │
│ │ │ (Embeddings) │ │ Service (Chat) │ │ │
│ │ │ Port: 8001 │ │ Port: 8000 │ │ │
│ │ └──────────────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ GPU Operator Namespace │ │
│ │ ┌────────────┐ ┌────────────┐ ┌─────────────────────────┐ │ │
│ │ │ NFD │ │Device Plugin│ │ MIG Manager │ │ │
│ │ └────────────┘ └────────────┘ └─────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ H100 GPU Node Pool │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ NVIDIA H100 NVL (94GB VRAM) │ │ │
│ │ │ ┌──────────────────┐ ┌──────────────────┐ │ │ │
│ │ │ │ MIG Instance 1 │ │ MIG Instance 2 │ │ │ │
│ │ │ │ 47.5GB Memory │ │ 47.5GB Memory │ │ │ │
│ │ │ │ 60 SM Units │ │ 60 SM Units │ │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ │ BGE-M3 Model │ │ Granite Vision │ │ │ │
│ │ │ │ (~2GB Used) │ │ (~41GB Used) │ │ │ │
│ │ │ └──────────────────┘ └──────────────────┘ │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Container Orchestration | Azure Kubernetes Service (AKS) | 1.30 | Container orchestration platform |
| GPU Hardware | NVIDIA H100 NVL | - | High-performance AI compute |
| GPU Virtualization | Multi-Instance GPU (MIG) | 3g.47gb profiles | Hardware-level GPU partitioning |
| GPU Management | NVIDIA GPU Operator | Latest | Automated GPU software stack |
| Model Serving | vLLM | Latest | High-performance LLM inference |
| Models | BGE-M3, Granite Vision 3.3-2B | Latest | Embeddings & Vision-Language models |
- Active Azure subscription with sufficient quota
- Resource group in preferred Azure region (with H100 availability)
- Azure CLI installed and configured
- kubectl installed and configured
| Resource | Quota Needed | Purpose |
|---|---|---|
| Standard_NC40ads_H100_v5 | 1 vCPU | H100 GPU instance |
| Total Regional vCPUs | 40+ | Node capacity |
| Premium Managed Disks | 200GB+ | Storage |
- Azure Kubernetes Service Contributor role
- Sufficient permissions to create/modify AKS resources
- Network access to install Helm charts and container images
# Create resource group in your preferred region
az group create \
--name ai-gpu-aks-rg \
--location <your-region># Create AKS cluster with system node pool
az aks create \
--resource-group ai-gpu-aks-rg \
--name ai-h100-cluster \
--location <your-region> \
--node-count 1 \
--node-vm-size Standard_D4s_v5 \
--kubernetes-version 1.30 \
--enable-managed-identity \
--network-plugin azure \
--network-policy azure \
--node-osdisk-type Managed \
--node-osdisk-size 100 \
--generate-ssh-keys# Add GPU node pool with H100
az aks nodepool add \
--resource-group ai-gpu-aks-rg \
--cluster-name ai-h100-cluster \
--name gpupool \
--node-count 1 \
--node-vm-size Standard_NC40ads_H100_v5 \
--node-osdisk-type Managed \
--node-osdisk-size 200 \
--max-pods 110 \
--kubernetes-version 1.30# Get cluster credentials
az aks get-credentials \
--resource-group ai-gpu-aks-rg \
--name ai-h100-cluster \
--overwrite-existing
# Verify cluster access
kubectl get nodesExpected Output:
NAME STATUS ROLES AGE VERSION
aks-gpupool-xxxxxxxx-vmss000001 Ready <none> 5m v1.30.14
aks-nodepool1-xxxxxxxx-vmss000001 Ready <none> 10m v1.30.14
The NVIDIA GPU Operator automates the management of all NVIDIA software components needed for GPU workloads.
# Install NFD as prerequisite for GPU Operator
helm install --wait --create-namespace -n gpu-operator \
node-feature-discovery node-feature-discovery \
--repo https://kubernetes-sigs.github.io/node-feature-discovery/charts \
--set-json master.config.extraLabelNs='["nvidia.com"]' \
--set-json worker.tolerations='[
{
"effect": "NoSchedule",
"key": "sku",
"operator": "Equal",
"value": "gpu"
},
{
"effect": "NoSchedule",
"key": "mig",
"value": "notReady",
"operator": "Equal"
}
]'# nfd-gpu-rule.yaml
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: nfd-gpu-rule
namespace: gpu-operator
spec:
rules:
- name: "nfd-gpu-rule"
labels:
"feature.node.kubernetes.io/pci-10de.present": "true"
matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["10de"]}kubectl apply -f nfd-gpu-rule.yaml# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install GPU Operator
helm install --wait gpu-operator -n gpu-operator nvidia/gpu-operator \
--set-json daemonsets.tolerations='[
{
"effect": "NoSchedule",
"key": "sku",
"operator": "Equal",
"value": "gpu"
}
]' \
--set nfd.enabled=false \
--set driver.enabled=false \
--set operator.runtimeClass=nvidia-container-runtime# Check all GPU Operator components
kubectl get pods -n gpu-operatorExpected Components:
nvidia-device-plugin-daemonset-xxxxxnvidia-mig-manager-xxxxxnvidia-dcgm-exporter-xxxxxgpu-feature-discovery-xxxxxnvidia-container-toolkit-daemonset-xxxxx
Multi-Instance GPU (MIG) allows partitioning the H100 into multiple isolated GPU instances.
# mig-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
# Configuration for 2x 3g.47gb instances (recommended)
all-3g.47gb:
- devices: [0]
mig-enabled: true
mig-devices:
"3g.47gb": 2
# Alternative: 3x 2g.24gb instances
all-2g.24gb:
- devices: [0]
mig-enabled: true
mig-devices:
"2g.24gb": 3
# Alternative: 7x 1g.12gb instances (maximum partitioning)
all-1g.12gb:
- devices: [0]
mig-enabled: true
mig-devices:
"1g.12gb": 7kubectl apply -f mig-config.yamlImportant: Different GPU models have different MIG profile names. Always check what's available on your specific GPU:
# Check available MIG instance profiles on your GPU
kubectl exec -n gpu-operator \
$(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') \
-- nvidia-smi mig -lgipExpected Output for H100 NVL:
+-------------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|===============================================================================|
| 0 MIG 1g.12gb 19 7/7 10.75 No 16 1 0 |
| 0 MIG 2g.24gb 14 3/3 21.62 No 32 2 0 |
| 0 MIG 3g.47gb 9 2/2 46.38 No 60 3 0 | ← We use this for 2 deployments
| 0 MIG 7g.94gb 0 1/1 93.12 No 132 7 0 |
+-------------------------------------------------------------------------------+
Common MIG Profiles by GPU Model:
- H100 NVL: Uses
3g.47gb(46.38 GiB per instance)- A100 80GB: Uses
3g.40gb(39.59 GiB per instance)- H100 SXM: May vary, check with the command above
Use the profile name exactly as shown in your output when configuring MIG.
# Label the GPU node with the CORRECT MIG configuration for your GPU
# For H100 NVL, use "3g.47gb". For A100, use "3g.40gb"
az aks nodepool update \
--cluster-name ai-h100-cluster \
--resource-group ai-gpu-aks-rg \
--nodepool-name gpupool \
--labels "nvidia.com/mig.config"="all-3g.47gb"
# Configure MIG Manager for reboot (H100 requirement)
kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge \
-p '{"spec": {"migManager": {"env": [{"name": "WITH_REBOOT", "value": "true"}]}}}'# Check MIG instances are created
kubectl exec -n gpu-operator $(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi mig -lgi
# Verify GPU resources are available
kubectl describe node -l agentpool=gpupool | grep nvidia.com/gpuExpected Output:
+---------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=========================================================|
| 0 MIG 3g.47gb 9 1 0:4 |
+---------------------------------------------------------+
| 0 MIG 3g.47gb 9 2 4:4 |
+---------------------------------------------------------+
kubectl create namespace vllm# bge-m3-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: bge-m3-mig
namespace: vllm
spec:
replicas: 1
selector:
matchLabels:
app: bge-m3-mig
template:
metadata:
labels:
app: bge-m3-mig
spec:
nodeSelector:
agentpool: gpupool
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
args:
- "--model"
- "BAAI/bge-m3"
- "--trust-remote-code"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.85"
- "--dtype"
- "float16"
- "--api-key"
- "token-abc123"
- "--port"
- "8001"
- "--host"
- "0.0.0.0"
- "--enable-prefix-caching"
- "--max-num-seqs"
- "256"
ports:
- containerPort: 8001
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: "spawn"
- name: OMP_NUM_THREADS
value: "4"
resources:
limits:
nvidia.com/gpu: 1
memory: "30Gi"
cpu: "6"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
cpu: "4"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
name: bge-m3-service
namespace: vllm
spec:
type: LoadBalancer
selector:
app: bge-m3-mig
ports:
- port: 8001
targetPort: 8001# granite-vision-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: granite-vision-mig
namespace: vllm
spec:
replicas: 1
selector:
matchLabels:
app: granite-vision-mig
template:
metadata:
labels:
app: granite-vision-mig
spec:
nodeSelector:
agentpool: gpupool
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
args:
- "--model"
- "ibm-granite/granite-vision-3.3-2b"
- "--trust-remote-code"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.85"
- "--dtype"
- "auto"
- "--api-key"
- "token-abc123"
- "--port"
- "8000"
- "--host"
- "0.0.0.0"
- "--enable-prefix-caching"
ports:
- containerPort: 8000
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: "spawn"
resources:
limits:
nvidia.com/gpu: 1
memory: "30Gi"
cpu: "6"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
cpu: "4"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
name: granite-vision-service
namespace: vllm
spec:
type: LoadBalancer
selector:
app: granite-vision-mig
ports:
- port: 8000
targetPort: 8000# Deploy BGE-M3
kubectl apply -f bge-m3-deployment.yaml
# Deploy Granite Vision
kubectl apply -f granite-vision-deployment.yaml
# Check deployment status
kubectl get pods,svc -n vllm# Get external IPs
kubectl get svc -n vllm
# Test BGE-M3 embeddings
curl -X POST http://<BGE_EXTERNAL_IP>:8001/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer token-abc123" \
-d '{
"model": "BAAI/bge-m3",
"input": "Enterprise AI infrastructure with MIG technology"
}' | jq '.data[0].embedding | length'
# Test Granite Vision chat
curl -X POST http://<GRANITE_EXTERNAL_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer token-abc123" \
-d '{
"model": "ibm-granite/granite-vision-3.3-2b",
"messages": [{"role": "user", "content": "Explain MIG technology benefits"}],
"max_tokens": 150
}' | jq .# Check GPU utilization across MIG instances
kubectl exec -n gpu-operator \
$(kubectl get pods -n gpu-operator -l app=nvidia-mig-manager -o jsonpath='{.items[0].metadata.name}') \
-- nvidia-smi# Monitor pod resource usage
kubectl top pods -n vllm
# Check application logs
kubectl logs -f deployment/bge-m3-mig -n vllm
kubectl logs -f deployment/granite-vision-mig -n vllm
# Monitor service health
kubectl get endpoints -n vllmThe GPU Operator includes DCGM exporter for Prometheus monitoring:
# Access DCGM metrics
kubectl port-forward -n gpu-operator svc/nvidia-dcgm-exporter 9400:9400
# Sample metrics endpoint
curl http://localhost:9400/metrics | grep -i gpuSymptoms:
nvidia.com/gpu: 1instead of2in node allocatable- Models sharing same GPU without isolation
Solution:
# Check MIG configuration
kubectl describe node -l agentpool=gpupool | grep mig.config
# Restart MIG manager if needed
kubectl delete pod -l app=nvidia-mig-manager -n gpu-operator
# Verify GPU processes
kubectl exec -n gpu-operator <mig-manager-pod> -- nvidia-smiSymptoms:
- Pods stuck in
Pendingstate - Event:
Insufficient nvidia.com/gpu
Solution:
# Check GPU resource availability
kubectl describe node -l agentpool=gpupool | grep -A10 Allocatable
# Verify device plugin is running
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonsetSymptoms:
- vLLM containers crashing during model load
- OOM (Out of Memory) errors
Solution:
# Check available memory per MIG instance
kubectl exec -n gpu-operator <mig-manager-pod> -- nvidia-smi mig -lgi
# Adjust gpu-memory-utilization in deployment
# Reduce from 0.85 to 0.7 if neededSymptoms:
- Services not accessible externally
- LoadBalancer stuck in
<pending>
Solution:
# Check service status
kubectl describe svc -n vllm
# Verify network policies
kubectl get networkpolicy -n vllm
# Check AKS load balancer configuration
az aks show -n vllm-h100-cluster -g vllm-aks-rg --query networkProfile# Comprehensive cluster health check
kubectl get nodes,pods,svc --all-namespaces
kubectl top nodes
kubectl describe node -l agentpool=gpupool
# GPU-specific diagnostics
kubectl get pods -n gpu-operator
kubectl logs -l app=nvidia-device-plugin-daemonset -n gpu-operator
kubectl exec -n gpu-operator <mig-manager-pod> -- nvidia-smi -q
# Application diagnostics
kubectl describe pods -n vllm
kubectl logs -f <pod-name> -n vllm| Resource | Type | Quantity | Monthly Cost (USD) |
|---|---|---|---|
| AKS Cluster | Management | 1 | Free |
| System Node Pool | Standard_D4s_v5 | 1 | ~$120 |
| GPU Node Pool | Standard_NC40ads_H100_v5 | 1 | ~$2,400 |
| Managed Disks | Premium SSD | 300GB | ~$60 |
| Load Balancer | Standard | 2 | ~$40 |
| Total | ~$2,620 |
Without MIG (2 separate GPU nodes):
- 2x Standard_NC40ads_H100_v5 = ~$4,800/month
With MIG (1 GPU node, 2 isolated instances):
- 1x Standard_NC40ads_H100_v5 = ~$2,400/month
- Savings: ~$2,400/month (50% reduction)
- Hardware Utilization: 90%+ GPU utilization vs 40-60% without MIG
- Operational Efficiency: Single node management vs multiple nodes
- Development Velocity: Faster iteration with shared infrastructure
- Scalability: Easy reconfiguration of MIG profiles as needs change
Note: The basic setup works without additional security configurations. These are optional hardening measures for production environments.
The deployment includes basic security out-of-the-box:
- ✅ API Authentication: Bearer token required (
token-abc123) - ✅ Hardware Isolation: MIG provides GPU-level isolation
- ✅ Network Isolation: Kubernetes namespace separation
- ✅ TLS: HTTPS endpoints via LoadBalancer
Click to expand optional security configurations
# Optional: Restrict network traffic between namespaces
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: vllm-network-policy
namespace: vllm
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- ports:
- protocol: TCP
port: 8000
- protocol: TCP
port: 8001
egress:
- to: []
ports:
- protocol: TCP
port: 443 # HTTPS
- protocol: UDP
port: 53 # DNS# Optional: Create dedicated service account
apiVersion: v1
kind: ServiceAccount
metadata:
name: vllm-sa
namespace: vllm- Change API Keys: Replace
token-abc123with secure tokens - Rate Limiting: Add ingress controllers with rate limits
- Input Validation: Implement request/response validation
- Audit Logging: Enable Azure Monitor for comprehensive logging
- Private Endpoints: Use private AKS clusters for sensitive workloads
- Data Residency: All processing within your chosen Azure region
- No Data Persistence: Models don't store request/response data
- Hardware Isolation: MIG ensures complete GPU-level separation
- Encrypted Communication: TLS/HTTPS for all API calls
- Right-sizing: Monitor actual usage and adjust resource requests/limits
- Node Affinity: Use node selectors to ensure GPU workloads run on GPU nodes
- Horizontal Scaling: Plan for multiple replicas with additional GPU nodes
- Vertical Scaling: Adjust MIG profiles based on workload requirements
- GitOps: Store all configurations in version control
- CI/CD Integration: Automate deployments with proper testing
- Monitoring: Implement comprehensive monitoring and alerting
- Backup/Recovery: Regular backup of configuration and state
- Model Caching: Use persistent volumes for model caching
- Batch Processing: Optimize batch sizes for throughput
- Memory Management: Fine-tune GPU memory utilization
- Network Optimization: Use cluster-internal services where possible
This implementation provides organizations with:
- Cost-Effective AI Infrastructure: 50% cost reduction through GPU sharing
- Production-Ready Platform: Automated management and monitoring
- Scalable Architecture: Easy to extend with additional models/nodes
- Enterprise Security: Comprehensive security and compliance features
- Operational Excellence: Full observability and troubleshooting capabilities
The MIG-enabled AKS cluster successfully demonstrates how modern GPU virtualization can optimize AI workload deployments while maintaining strict isolation and performance guarantees.
- Production Readiness: Implement comprehensive monitoring and alerting
- Model Expansion: Add additional AI models as business requires
- Automation: Develop CI/CD pipelines for model deployment
- Optimization: Continuous performance tuning based on usage patterns
- Scaling: Plan for multi-node GPU clusters as demand grows
Document Status: ✅ Complete
Last Updated: August 2025
Review Cycle: Quarterly
Next Review: November 2025