You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DeepSeek R1 (671B) on Azure H100: Storage Options Comparison — Premium SSD vs Premium Files vs Premium Blob vs Azure Managed Lustre vs Run:AI Model Streamer
Minimum 2 nodes (16× H100) for BF16 full-precision inference (model needs ~1.34 TB GPU memory in BF16; 2 nodes = 1.28 TB, tight but workable with MoE sparsity since only 37B params active). More commonly 2–4 nodes for comfortable serving with KV-cache headroom.
Note: All storage pricing below is for East US region, Premium / LRS tiers, pay-as-you-go unless noted.
Limited by VM network: ~10 GB/s (80 Gbps Ethernet)
Effective Throughput on H100 VM
Capped at 612 MB/s (VM uncached disk limit)
~300 MB/s (share-level limit)
~1-2 GB/s (but VM capped at 612 MB/s for remote storage)
Up to 8+ GB/s (Lustre bypasses disk throughput caps via network mount)
~5-10 GB/s (parallel blob SDK streams, limited by VM Ethernet)
Latency
Low (local attached SSD)
Medium-High (network + SMB/NFS overhead)
Medium (FUSE) to Medium-Low (NFS 3.0)
Very Low (kernel Lustre client, parallel striping)
Low-Medium (network only, no FS layer)
Est. Time to Load 689 GB
~19 min (at 612 MB/s cap)
~38 min (at 300 MB/s)
~19 min (at 612 MB/s cap)
~2-6 min (at 2-8 GB/s depending on cluster size)
~1-2 min (at 5-10 GB/s parallel streams)
Cold Start Behavior
Must clone/attach disk first (~1-3 min overhead)
Read-through from network share — slow for large models
BlobFuse2: full download to local cache first; NFS: read-through
First access triggers hydration from blob; subsequent reads at full SSD speed
Streams immediately — no pre-download
Warm Start (pod restart)
Fast (data on attached disk)
Re-reads from network share
BlobFuse2: fast (local cache); NFS: re-reads
Very fast (data on Lustre SSDs, kernel cache)
Re-downloads every time
Multi-node Scaling
❌ Each node needs its own disk copy
✅ Shared but throughput degrades per client
✅ Shared but throughput degrades
✅ Best — aggregate throughput scales with cluster size
✅ Each pod independent (N pods = N× blob bandwidth)
Important VM Bottleneck: The Standard_ND96isr_H100_v5 has an uncached remote storage throughput cap of 612 MB/s. This means Premium SSD, Azure Files, and Blob CSI (all remote storage) are all bottlenecked at this limit regardless of the storage tier's theoretical maximum. Only AMLFS (network-mounted, not through the remote storage path) and Run:AI Streamer (pure network SDK) bypass this cap.
Cheapest storage, but painfully slow loading (~38 min)
Premium Blob CSI
$143,580
$103
$143,683
Cheapest, moderate speed, FUSE overhead
AMLFS (500 MB/s)
$143,580
$1,394
$144,974
Fast shared reads, but high setup effort for a PoC
Run:AI Streamer
$143,580
$103 + ~$60 egress
$143,743
Fastest cold start, minimal infra, but custom image needed
Long-Term (12 Months — Production)
Solution
Compute (12 mo)
Storage (12 mo)
Total/Year
Model Load Time
Verdict
Premium SSD (P60 × 2)
$1,722,960
$22,704
$1,745,664
~19 min
❌ RWO limitation, disk per node, doesn't scale
Premium Files
$1,722,960
$1,344
$1,724,304
~38 min
❌ Too slow for production model loading
Premium Blob CSI
$1,722,960
$1,236
$1,724,196
~19 min
🟡 OK for cost, but FUSE overhead + VM cap
AMLFS (500 MB/s, 4 TiB)
$1,722,960
$16,728
$1,739,688
~2-6 min
✅ Best for multi-node shared reads at scale
AMLFS (250 MB/s, 8 TiB)
$1,722,960
$20,592
$1,743,552
~2-6 min
✅ Same throughput, more capacity for multiple models
Run:AI Streamer
$1,722,960
$1,236 + ~$720 egress (1 load/day)
$1,724,916
~1-2 min
✅ Best cold start, lowest storage cost
3. Key Insights
Storage Cost is Negligible vs. Compute
At $143,580/mo for just 2 H100 nodes, storage costs are less than 1-2% of total spend in every scenario. The decision should be driven by performance and operational fit, not storage cost.
The Real Bottleneck: VM Remote Storage Cap (612 MB/s)
The Standard_ND96isr_H100_v5 caps uncached remote disk throughput at 612 MB/s. This means:
Premium SSD, Azure Files, and Blob CSI are all capped at roughly the same effective throughput regardless of their storage tier.
Only AMLFS (Lustre network mount) and Run:AI Streamer (pure SDK/network) bypass this cap.
This makes the performance difference between Premium SSD, Files, and Blob CSI largely irrelevant — they're all bottlenecked by the VM.
Multi-Node Model Loading: The Critical Factor
DeepSeek R1 at 671B requires multiple nodes. This changes the calculus:
Approach
Behavior at 2+ Nodes
Premium SSD
Must clone disk per node. 2 nodes = 2 disks = 2× cost. Can't share.
Premium Files
Shared, but ~300 MB/s total shared across all nodes. Slower as nodes increase.
Premium Blob CSI
Shared via Blob, but each node FUSE-caches independently. 612 MB/s per node cap.
AMLFS
True parallel filesystem. Each node reads from Lustre at full speed independently. Aggregate throughput scales with Lustre cluster size.
Run:AI Streamer
Each node streams independently from Blob. Good parallelism but each node consumes separate bandwidth. No shared cache benefit.
4. Recommendations
🏆 Best Overall for Production: AMLFS (500 MB/s tier, 4 TiB)
Why: True parallel shared filesystem, bypasses VM remote storage cap, kernel-level I/O, data hydrated from Blob stays warm on Lustre SSDs.
Cost: ~$1,394/mo (< 1% of compute cost).
Best when: Serving DeepSeek R1 across multiple nodes in steady-state production, especially if you add more models or replicas later.