Skip to content

Instantly share code, notes, and snippets.

@surajssd
Created March 5, 2026 13:34
Show Gist options
  • Select an option

  • Save surajssd/3e758b486297aac58c836223e8f9ed39 to your computer and use it in GitHub Desktop.

Select an option

Save surajssd/3e758b486297aac58c836223e8f9ed39 to your computer and use it in GitHub Desktop.
DeepSeek R1 (671B) on Azure H100: Storage Options Comparison — Premium SSD vs Premium Files vs Premium Blob vs Azure Managed Lustre vs Run:AI Model Streamer

DeepSeek R1 (671B) on Azure H100: Storage Options Comparison

Deployment Context

Parameter Value
Model DeepSeek R1 — 671B parameters (MoE, 37B activated)
Model Size on Disk ~689 GB (163 safetensor files, BF16)
VM SKU Standard_ND96isr_H100_v5
GPUs per Node 8× NVIDIA H100 80GB (640 GB total GPU memory)
vCPUs / RAM 96 vCPUs / 1,900 GiB
Network Bandwidth 80 Gbps Ethernet (~10 GB/s) + 8× 400 Gbps InfiniBand (3.2 Tbps aggregate for GPU-direct RDMA)
VM Uncached Disk Throughput 612 MB/s (VM-level cap for remote storage)
Local NVMe 8× NVMe disks, 28 TiB total
VM Price (Pay-as-you-go) $98.32/hr ($71,790/mo) per node
Nodes Required Minimum 2 nodes (16× H100) for BF16 full-precision inference (model needs ~1.34 TB GPU memory in BF16; 2 nodes = 1.28 TB, tight but workable with MoE sparsity since only 37B params active). More commonly 2–4 nodes for comfortable serving with KV-cache headroom.

Note: All storage pricing below is for East US region, Premium / LRS tiers, pay-as-you-go unless noted.


1. Five-Way Storage Comparison

1.1 Architecture & Protocol

Azure Disk CSI Azure File CSI Azure Blob CSI Azure Managed Lustre (AMLFS) Run:AI Model Streamer
Provisioner disk.csi.azure.com file.csi.azure.com blob.csi.azure.com azurelustre.csi.azure.com N/A (SDK in vLLM)
Protocol Block device (iSCSI) SMB 3.0 / NFS 4.1 BlobFuse2 (FUSE) or NFS 3.0 Lustre (kernel client) Azure Blob SDK (REST API)
Access Mode RWO only RWX ✅ RWX ✅ RWX ✅ N/A (no mount)
Data Path Blob → Managed Disk → block device → VFS → vLLM → GPU Azure Files → SMB/NFS → kernel VFS → vLLM → GPU Blob → FUSE/NFS → VFS → vLLM → GPU Blob (hydrated via HSM) → Lustre OSTs (SSD) → kernel client → vLLM → GPU Blob → SDK → direct to GPU memory
Kernel-level I/O ✅ Yes ✅ Yes (NFS) / Partial (SMB) ❌ No (FUSE has context switches) ✅ Yes (native Lustre client) ❌ No (userspace SDK)
Multi-node sharing ❌ Single node only ✅ But throughput shared across clients ✅ But throughput shared Designed for this — parallel striping ✅ Each pod streams independently

1.2 Performance with Premium Tiers (H100 VM Context)

Premium SSD (P80) Premium Files (NFS) Premium Blob (NFS) AMLFS (500 MB/s tier) Run:AI Streamer
Max Throughput (per disk/share) 900 MB/s (P80) ~300 MB/s per share (Premium) ~1-2 GB/s (Premium NFS, per storage account) 500 MB/s per TiB — 4 TiB = 2 GB/s, 16 TiB = 8 GB/s Limited by VM network: ~10 GB/s (80 Gbps Ethernet)
Effective Throughput on H100 VM Capped at 612 MB/s (VM uncached disk limit) ~300 MB/s (share-level limit) ~1-2 GB/s (but VM capped at 612 MB/s for remote storage) Up to 8+ GB/s (Lustre bypasses disk throughput caps via network mount) ~5-10 GB/s (parallel blob SDK streams, limited by VM Ethernet)
Latency Low (local attached SSD) Medium-High (network + SMB/NFS overhead) Medium (FUSE) to Medium-Low (NFS 3.0) Very Low (kernel Lustre client, parallel striping) Low-Medium (network only, no FS layer)
Est. Time to Load 689 GB ~19 min (at 612 MB/s cap) ~38 min (at 300 MB/s) ~19 min (at 612 MB/s cap) ~2-6 min (at 2-8 GB/s depending on cluster size) ~1-2 min (at 5-10 GB/s parallel streams)
Cold Start Behavior Must clone/attach disk first (~1-3 min overhead) Read-through from network share — slow for large models BlobFuse2: full download to local cache first; NFS: read-through First access triggers hydration from blob; subsequent reads at full SSD speed Streams immediately — no pre-download
Warm Start (pod restart) Fast (data on attached disk) Re-reads from network share BlobFuse2: fast (local cache); NFS: re-reads Very fast (data on Lustre SSDs, kernel cache) Re-downloads every time
Multi-node Scaling ❌ Each node needs its own disk copy ✅ Shared but throughput degrades per client ✅ Shared but throughput degrades Best — aggregate throughput scales with cluster size ✅ Each pod independent (N pods = N× blob bandwidth)

Important VM Bottleneck: The Standard_ND96isr_H100_v5 has an uncached remote storage throughput cap of 612 MB/s. This means Premium SSD, Azure Files, and Blob CSI (all remote storage) are all bottlenecked at this limit regardless of the storage tier's theoretical maximum. Only AMLFS (network-mounted, not through the remote storage path) and Run:AI Streamer (pure network SDK) bypass this cap.

1.3 Operational Complexity

Premium SSD Premium Files Premium Blob AMLFS Run:AI Streamer
Setup Effort Low — built-in SC in AKS Low — built-in SC in AKS Medium — enable blob CSI driver High — provision AMLFS cluster, VNet peering, install CSI driver, configure blob HSM integration High — custom vLLM Docker image, monkey-patch model_weights, workload identity for Azure Blob
vLLM Changes None None None None Yes — custom image, --load-format runai_streamer, az:// scheme patch
Model Format Any Any Any Any Safetensors only
Infrastructure Managed disks (always attached) Managed file shares Blob storage account Dedicated SSD cluster (always running) Blob storage account only
Maintenance None None None Quarterly maintenance windows, manage Lustre lifecycle Maintain custom vLLM image/fork

2. Cost Analysis

2.1 Storage Costs (for ~689 GB model, Premium LRS, East US)

Storage Option Size to Provision Monthly Cost Notes
Premium SSD (P60) 8 TiB (smallest ≥ 689 GB with decent throughput) $946/mo per disk RWO only — need one per node. 2 nodes = $1,892/mo
Premium SSD (P70) 16 TiB $1,802/mo per disk Higher throughput (750 MB/s). 2 nodes = $3,604/mo
Premium SSD (P80) 32 TiB $3,604/mo per disk Max throughput (900 MB/s). 2 nodes = $7,208/mo
Premium Files (LRS) 1 TiB (700 GiB provisioned) $0.16/GiB/mo × 700 = ~$112/mo Shared across nodes. Low throughput (~300 MB/s).
Premium Block Blob 689 GB $0.15/GB/mo × 689 = ~$103/mo Cheapest storage. Used with Blob CSI or Run:AI Streamer.
AMLFS (40 MB/s/TiB) 48 TiB (minimum) $0.000114/GiB/hr × 49,152 GiB × 730 hrs = ~$4,093/mo Overkill capacity, lower throughput (1.9 GB/s).
AMLFS (125 MB/s/TiB) 16 TiB (minimum) $0.000198/GiB/hr × 16,384 GiB × 730 hrs = ~$2,369/mo Good balance: 2 GB/s throughput.
AMLFS (250 MB/s/TiB) 8 TiB (minimum) $0.000287/GiB/hr × 8,192 GiB × 730 hrs = ~$1,716/mo 2 GB/s throughput, smallest footprint.
AMLFS (500 MB/s/TiB) 4 TiB (minimum) $0.000466/GiB/hr × 4,096 GiB × 730 hrs = ~$1,394/mo 2 GB/s throughput, highest per-TiB cost but smallest minimum.
Run:AI Streamer 689 GB (Blob storage) ~$103/mo (blob) + network egress Cheapest. Egress: ~$0.087/GB × 689 GB = ~$60 per full model load.

2.2 Total Cost of Ownership (2-Node Deployment)

Assumptions

  • Standard_ND96isr_H100_v5 nodes = $143,580/mo compute
  • Running 24/7 for model serving

Short-Term (1 Month — Experimentation / PoC)

Solution Compute Storage Total/Month Notes
Premium SSD (P60 × 2) $143,580 $1,892 $145,472 Simple but RWO — must duplicate disk per node
Premium Files $143,580 $112 $143,692 Cheapest storage, but painfully slow loading (~38 min)
Premium Blob CSI $143,580 $103 $143,683 Cheapest, moderate speed, FUSE overhead
AMLFS (500 MB/s) $143,580 $1,394 $144,974 Fast shared reads, but high setup effort for a PoC
Run:AI Streamer $143,580 $103 + ~$60 egress $143,743 Fastest cold start, minimal infra, but custom image needed

Long-Term (12 Months — Production)

Solution Compute (12 mo) Storage (12 mo) Total/Year Model Load Time Verdict
Premium SSD (P60 × 2) $1,722,960 $22,704 $1,745,664 ~19 min ❌ RWO limitation, disk per node, doesn't scale
Premium Files $1,722,960 $1,344 $1,724,304 ~38 min ❌ Too slow for production model loading
Premium Blob CSI $1,722,960 $1,236 $1,724,196 ~19 min 🟡 OK for cost, but FUSE overhead + VM cap
AMLFS (500 MB/s, 4 TiB) $1,722,960 $16,728 $1,739,688 ~2-6 min ✅ Best for multi-node shared reads at scale
AMLFS (250 MB/s, 8 TiB) $1,722,960 $20,592 $1,743,552 ~2-6 min ✅ Same throughput, more capacity for multiple models
Run:AI Streamer $1,722,960 $1,236 + ~$720 egress (1 load/day) $1,724,916 ~1-2 min ✅ Best cold start, lowest storage cost

3. Key Insights

Storage Cost is Negligible vs. Compute

At $143,580/mo for just 2 H100 nodes, storage costs are less than 1-2% of total spend in every scenario. The decision should be driven by performance and operational fit, not storage cost.

The Real Bottleneck: VM Remote Storage Cap (612 MB/s)

The Standard_ND96isr_H100_v5 caps uncached remote disk throughput at 612 MB/s. This means:

  • Premium SSD, Azure Files, and Blob CSI are all capped at roughly the same effective throughput regardless of their storage tier.
  • Only AMLFS (Lustre network mount) and Run:AI Streamer (pure SDK/network) bypass this cap.
  • This makes the performance difference between Premium SSD, Files, and Blob CSI largely irrelevant — they're all bottlenecked by the VM.

Multi-Node Model Loading: The Critical Factor

DeepSeek R1 at 671B requires multiple nodes. This changes the calculus:

Approach Behavior at 2+ Nodes
Premium SSD Must clone disk per node. 2 nodes = 2 disks = 2× cost. Can't share.
Premium Files Shared, but ~300 MB/s total shared across all nodes. Slower as nodes increase.
Premium Blob CSI Shared via Blob, but each node FUSE-caches independently. 612 MB/s per node cap.
AMLFS True parallel filesystem. Each node reads from Lustre at full speed independently. Aggregate throughput scales with Lustre cluster size.
Run:AI Streamer Each node streams independently from Blob. Good parallelism but each node consumes separate bandwidth. No shared cache benefit.

4. Recommendations

🏆 Best Overall for Production: AMLFS (500 MB/s tier, 4 TiB)

  • Why: True parallel shared filesystem, bypasses VM remote storage cap, kernel-level I/O, data hydrated from Blob stays warm on Lustre SSDs.
  • Cost: ~$1,394/mo (< 1% of compute cost).
  • Best when: Serving DeepSeek R1 across multiple nodes in steady-state production, especially if you add more models or replicas later.
  • Trade-off: Highest setup complexity (VNet peering, CSI driver, Blob HSM integration, maintenance windows).

🥈 Best for Fast Auto-Scaling / Spot: Run:AI Model Streamer

  • Why: Fastest cold start (~1-2 min), zero infrastructure beyond Blob storage, streams directly to GPU memory.
  • Cost: ~$103/mo storage + ~$60/model load egress.
  • Best when: Pods come and go (auto-scaling, spot instances), you want minimal infrastructure, and cold start time is critical.
  • Trade-off: Re-downloads on every restart, requires custom vLLM image with monkey-patching, safetensors format only.

🥉 Best Budget Option: Premium Blob CSI (NFS)

  • Why: Cheapest storage ($103/mo), standard Kubernetes PVC, no custom images needed.
  • Best when: Cost-sensitive, moderate load times acceptable, development/staging environments.
  • Trade-off: Capped at 612 MB/s by VM, FUSE overhead with BlobFuse2 (NFS 3.0 is better but still limited).

❌ Avoid for This Use Case

  • Premium SSD: RWO only — cannot share across nodes. Must duplicate 689 GB per node. Doesn't scale.
  • Premium Files: Too slow (~300 MB/s shared). Acceptable only for config files and tokenizers, not model weights.

5. Hybrid Strategy (Recommended for Production)

The optimal production setup combines approaches:

  1. AMLFS for your core production model (DeepSeek R1) — always warm, shared across all nodes, fast loading.
  2. Run:AI Streamer for experimental models or burst capacity — no infrastructure commitment, streams on demand.
  3. Premium Blob as the backing store for both — cheapest long-term storage, AMLFS hydrates from it, Run:AI streams from it.
┌──────────────────────────────────────────────────────┐
│                  Azure Blob Storage                  │
│            (Premium Block Blob, ~$103/mo)            │
│           ┌───────────────────────────┐              │
│           │  deepseek-r1/  (689 GB)   │              │
│           └─────────┬─────────┬───────┘              │
│                     │         │                      │
└─────────────────────┼─────────┼──────────────────────┘
                      │         │
          ┌───────────┘         └───────────┐
          │ HSM Hydration                   │ Direct SDK Stream
          ▼                                 ▼
┌───────────────────┐           ┌───────────────────────┐
│  Azure Managed    │           │   Run:AI Streamer     │
│  Lustre (4 TiB)   │           │   (in custom vLLM)    │
│  $1,394/mo        │           │   $0 infra            │
│                   │           │                       │
│  Kernel mount     │           │  Blob SDK → GPU mem   │
│  on all nodes     │           │  per-pod streaming    │
└────────┬──────────┘           └───────────┬───────────┘
         │                                  │
         ▼                                  ▼
┌──────────────────┐            ┌──────────────────────┐
│  Production pods │            │  Experimental /      │
│  (steady-state)  │            │  burst / spot pods   │
│  2-4 H100 nodes  │            │  on-demand H100s     │
└──────────────────┘            └──────────────────────┘

Appendix: Pricing Sources

All prices are for East US, LRS, pay-as-you-go as of March 2026.

Item Price Source
Standard_ND96isr_H100_v5 $98.32/hr Azure VM Pricing
Premium SSD P60 (8 TiB) $946.08/mo Azure Retail Prices API
Premium SSD P70 (16 TiB) $1,802.06/mo Azure Retail Prices API
Premium SSD P80 (32 TiB) $3,604.11/mo Azure Retail Prices API
Premium Files (LRS) $0.16/GiB/mo Azure Retail Prices API
Premium Block Blob (LRS) $0.15/GB/mo Azure Retail Prices API
AMLFS 40 MB/s/TiB $0.000114/GiB/hr Azure Retail Prices API
AMLFS 125 MB/s/TiB $0.000198/GiB/hr Azure Retail Prices API
AMLFS 250 MB/s/TiB $0.000287/GiB/hr Azure Retail Prices API
AMLFS 500 MB/s/TiB $0.000466/GiB/hr Azure Retail Prices API
Data egress (first 100 TB) ~$0.087/GB Azure Bandwidth Pricing

Reference Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment