surajssd/deepseek-r1-storage-comparison.md

## deepseek-r1-storage-comparison.md

      
    Raw
  

              deepseek-r1-storage-comparison.md
            
          
    DeepSeek R1 (671B) on Azure H100: Storage Options Comparison

Deployment Context


Parameter
Value


Model
DeepSeek R1 — 671B parameters (MoE, 37B activated)


Model Size on Disk
~689 GB (163 safetensor files, BF16)


VM SKU
Standard_ND96isr_H100_v5


GPUs per Node
8× NVIDIA H100 80GB (640 GB total GPU memory)


vCPUs / RAM
96 vCPUs / 1,900 GiB


Network Bandwidth
80 Gbps Ethernet (~10 GB/s) + 8× 400 Gbps InfiniBand (3.2 Tbps aggregate for GPU-direct RDMA)


VM Uncached Disk Throughput
612 MB/s (VM-level cap for remote storage)


Local NVMe
8× NVMe disks, 28 TiB total


VM Price (Pay-as-you-go)
$98.32/hr ($71,790/mo) per node


Nodes Required
Minimum 2 nodes (16× H100) for BF16 full-precision inference (model needs ~1.34 TB GPU memory in BF16; 2 nodes = 1.28 TB, tight but workable with MoE sparsity since only 37B params active). More commonly 2–4 nodes for comfortable serving with KV-cache headroom.


Note: All storage pricing below is for East US region, Premium / LRS tiers, pay-as-you-go unless noted.


1. Five-Way Storage Comparison

1.1 Architecture & Protocol


Azure Disk CSI
Azure File CSI
Azure Blob CSI
Azure Managed Lustre (AMLFS)
Run:AI Model Streamer


Provisioner
disk.csi.azure.com
file.csi.azure.com
blob.csi.azure.com
azurelustre.csi.azure.com
N/A (SDK in vLLM)


Protocol
Block device (iSCSI)
SMB 3.0 / NFS 4.1
BlobFuse2 (FUSE) or NFS 3.0
Lustre (kernel client)
Azure Blob SDK (REST API)


Access Mode
RWO only ❌
RWX ✅
RWX ✅
RWX ✅
N/A (no mount)


Data Path
Blob → Managed Disk → block device → VFS → vLLM → GPU
Azure Files → SMB/NFS → kernel VFS → vLLM → GPU
Blob → FUSE/NFS → VFS → vLLM → GPU
Blob (hydrated via HSM) → Lustre OSTs (SSD) → kernel client → vLLM → GPU
Blob → SDK → direct to GPU memory


Kernel-level I/O
✅ Yes
✅ Yes (NFS) / Partial (SMB)
❌ No (FUSE has context switches)
✅ Yes (native Lustre client)
❌ No (userspace SDK)


Multi-node sharing
❌ Single node only
✅ But throughput shared across clients
✅ But throughput shared
✅ Designed for this — parallel striping
✅ Each pod streams independently


1.2 Performance with Premium Tiers (H100 VM Context)


Premium SSD (P80)
Premium Files (NFS)
Premium Blob (NFS)
AMLFS (500 MB/s tier)
Run:AI Streamer


Max Throughput (per disk/share)
900 MB/s (P80)
~300 MB/s per share (Premium)
~1-2 GB/s (Premium NFS, per storage account)
500 MB/s per TiB — 4 TiB = 2 GB/s, 16 TiB = 8 GB/s
Limited by VM network: ~10 GB/s (80 Gbps Ethernet)


Effective Throughput on H100 VM
Capped at 612 MB/s (VM uncached disk limit)
~300 MB/s (share-level limit)
~1-2 GB/s (but VM capped at 612 MB/s for remote storage)
Up to 8+ GB/s (Lustre bypasses disk throughput caps via network mount)
~5-10 GB/s (parallel blob SDK streams, limited by VM Ethernet)


Latency
Low (local attached SSD)
Medium-High (network + SMB/NFS overhead)
Medium (FUSE) to Medium-Low (NFS 3.0)
Very Low (kernel Lustre client, parallel striping)
Low-Medium (network only, no FS layer)


Est. Time to Load 689 GB
~19 min (at 612 MB/s cap)
~38 min (at 300 MB/s)
~19 min (at 612 MB/s cap)
~2-6 min (at 2-8 GB/s depending on cluster size)
~1-2 min (at 5-10 GB/s parallel streams)


Cold Start Behavior
Must clone/attach disk first (~1-3 min overhead)
Read-through from network share — slow for large models
BlobFuse2: full download to local cache first; NFS: read-through
First access triggers hydration from blob; subsequent reads at full SSD speed
Streams immediately — no pre-download


Warm Start (pod restart)
Fast (data on attached disk)
Re-reads from network share
BlobFuse2: fast (local cache); NFS: re-reads
Very fast (data on Lustre SSDs, kernel cache)
Re-downloads every time


Multi-node Scaling
❌ Each node needs its own disk copy
✅ Shared but throughput degrades per client
✅ Shared but throughput degrades
✅ Best — aggregate throughput scales with cluster size
✅ Each pod independent (N pods = N× blob bandwidth)


Important VM Bottleneck: The Standard_ND96isr_H100_v5 has an uncached remote storage throughput cap of 612 MB/s. This means Premium SSD, Azure Files, and Blob CSI (all remote storage) are all bottlenecked at this limit regardless of the storage tier's theoretical maximum. Only AMLFS (network-mounted, not through the remote storage path) and Run:AI Streamer (pure network SDK) bypass this cap.

1.3 Operational Complexity


Premium SSD
Premium Files
Premium Blob
AMLFS
Run:AI Streamer


Setup Effort
Low — built-in SC in AKS
Low — built-in SC in AKS
Medium — enable blob CSI driver
High — provision AMLFS cluster, VNet peering, install CSI driver, configure blob HSM integration
High — custom vLLM Docker image, monkey-patch model_weights, workload identity for Azure Blob


vLLM Changes
None
None
None
None
Yes — custom image, --load-format runai_streamer, az:// scheme patch


Model Format
Any
Any
Any
Any
Safetensors only


Infrastructure
Managed disks (always attached)
Managed file shares
Blob storage account
Dedicated SSD cluster (always running)
Blob storage account only


Maintenance
None
None
None
Quarterly maintenance windows, manage Lustre lifecycle
Maintain custom vLLM image/fork


2. Cost Analysis

2.1 Storage Costs (for ~689 GB model, Premium LRS, East US)


Storage Option
Size to Provision
Monthly Cost
Notes


Premium SSD (P60)
8 TiB (smallest ≥ 689 GB with decent throughput)
$946/mo per disk
RWO only — need one per node. 2 nodes = $1,892/mo


Premium SSD (P70)
16 TiB
$1,802/mo per disk
Higher throughput (750 MB/s). 2 nodes = $3,604/mo


Premium SSD (P80)
32 TiB
$3,604/mo per disk
Max throughput (900 MB/s). 2 nodes = $7,208/mo


Premium Files (LRS)
1 TiB (700 GiB provisioned)
$0.16/GiB/mo × 700 = ~$112/mo
Shared across nodes. Low throughput (~300 MB/s).


Premium Block Blob
689 GB
$0.15/GB/mo × 689 = ~$103/mo
Cheapest storage. Used with Blob CSI or Run:AI Streamer.


AMLFS (40 MB/s/TiB)
48 TiB (minimum)
$0.000114/GiB/hr × 49,152 GiB × 730 hrs = ~$4,093/mo
Overkill capacity, lower throughput (1.9 GB/s).


AMLFS (125 MB/s/TiB)
16 TiB (minimum)
$0.000198/GiB/hr × 16,384 GiB × 730 hrs = ~$2,369/mo
Good balance: 2 GB/s throughput.


AMLFS (250 MB/s/TiB)
8 TiB (minimum)
$0.000287/GiB/hr × 8,192 GiB × 730 hrs = ~$1,716/mo
2 GB/s throughput, smallest footprint.


AMLFS (500 MB/s/TiB)
4 TiB (minimum)
$0.000466/GiB/hr × 4,096 GiB × 730 hrs = ~$1,394/mo
2 GB/s throughput, highest per-TiB cost but smallest minimum.


Run:AI Streamer
689 GB (Blob storage)
~$103/mo (blob) + network egress
Cheapest. Egress: ~$0.087/GB × 689 GB = ~$60 per full model load.


2.2 Total Cost of Ownership (2-Node Deployment)

Assumptions


2× Standard_ND96isr_H100_v5 nodes = $143,580/mo compute
Running 24/7 for model serving

Short-Term (1 Month — Experimentation / PoC)


Solution
Compute
Storage
Total/Month
Notes


Premium SSD (P60 × 2)
$143,580
$1,892
$145,472
Simple but RWO — must duplicate disk per node


Premium Files
$143,580
$112
$143,692
Cheapest storage, but painfully slow loading (~38 min)


Premium Blob CSI
$143,580
$103
$143,683
Cheapest, moderate speed, FUSE overhead


AMLFS (500 MB/s)
$143,580
$1,394
$144,974
Fast shared reads, but high setup effort for a PoC


Run:AI Streamer
$143,580
$103 + ~$60 egress
$143,743
Fastest cold start, minimal infra, but custom image needed


Long-Term (12 Months — Production)


Solution
Compute (12 mo)
Storage (12 mo)
Total/Year
Model Load Time
Verdict


Premium SSD (P60 × 2)
$1,722,960
$22,704
$1,745,664
~19 min
❌ RWO limitation, disk per node, doesn't scale


Premium Files
$1,722,960
$1,344
$1,724,304
~38 min
❌ Too slow for production model loading


Premium Blob CSI
$1,722,960
$1,236
$1,724,196
~19 min
🟡 OK for cost, but FUSE overhead + VM cap


AMLFS (500 MB/s, 4 TiB)
$1,722,960
$16,728
$1,739,688
~2-6 min
✅ Best for multi-node shared reads at scale


AMLFS (250 MB/s, 8 TiB)
$1,722,960
$20,592
$1,743,552
~2-6 min
✅ Same throughput, more capacity for multiple models


Run:AI Streamer
$1,722,960
$1,236 + ~$720 egress (1 load/day)
$1,724,916
~1-2 min
✅ Best cold start, lowest storage cost


3. Key Insights

Storage Cost is Negligible vs. Compute

At $143,580/mo for just 2 H100 nodes, storage costs are less than 1-2% of total spend in every scenario. The decision should be driven by performance and operational fit, not storage cost.
The Real Bottleneck: VM Remote Storage Cap (612 MB/s)

The Standard_ND96isr_H100_v5 caps uncached remote disk throughput at 612 MB/s. This means:

Premium SSD, Azure Files, and Blob CSI are all capped at roughly the same effective throughput regardless of their storage tier.
Only AMLFS (Lustre network mount) and Run:AI Streamer (pure SDK/network) bypass this cap.
This makes the performance difference between Premium SSD, Files, and Blob CSI largely irrelevant — they're all bottlenecked by the VM.

Multi-Node Model Loading: The Critical Factor

DeepSeek R1 at 671B requires multiple nodes. This changes the calculus:


Approach
Behavior at 2+ Nodes


Premium SSD
Must clone disk per node. 2 nodes = 2 disks = 2× cost. Can't share.


Premium Files
Shared, but ~300 MB/s total shared across all nodes. Slower as nodes increase.


Premium Blob CSI
Shared via Blob, but each node FUSE-caches independently. 612 MB/s per node cap.


AMLFS
True parallel filesystem. Each node reads from Lustre at full speed independently. Aggregate throughput scales with Lustre cluster size.


Run:AI Streamer
Each node streams independently from Blob. Good parallelism but each node consumes separate bandwidth. No shared cache benefit.


4. Recommendations

🏆 Best Overall for Production: AMLFS (500 MB/s tier, 4 TiB)


Why: True parallel shared filesystem, bypasses VM remote storage cap, kernel-level I/O, data hydrated from Blob stays warm on Lustre SSDs.
Cost: ~$1,394/mo (< 1% of compute cost).
Best when: Serving DeepSeek R1 across multiple nodes in steady-state production, especially if you add more models or replicas later.
Trade-off: Highest setup complexity (VNet peering, CSI driver, Blob HSM integration, maintenance windows).

🥈 Best for Fast Auto-Scaling / Spot: Run:AI Model Streamer


Why: Fastest cold start (~1-2 min), zero infrastructure beyond Blob storage, streams directly to GPU memory.
Cost: ~$103/mo storage + ~$60/model load egress.
Best when: Pods come and go (auto-scaling, spot instances), you want minimal infrastructure, and cold start time is critical.
Trade-off: Re-downloads on every restart, requires custom vLLM image with monkey-patching, safetensors format only.

🥉 Best Budget Option: Premium Blob CSI (NFS)


Why: Cheapest storage ($103/mo), standard Kubernetes PVC, no custom images needed.
Best when: Cost-sensitive, moderate load times acceptable, development/staging environments.
Trade-off: Capped at 612 MB/s by VM, FUSE overhead with BlobFuse2 (NFS 3.0 is better but still limited).

❌ Avoid for This Use Case


Premium SSD: RWO only — cannot share across nodes. Must duplicate 689 GB per node. Doesn't scale.
Premium Files: Too slow (~300 MB/s shared). Acceptable only for config files and tokenizers, not model weights.


5. Hybrid Strategy (Recommended for Production)

The optimal production setup combines approaches:

AMLFS for your core production model (DeepSeek R1) — always warm, shared across all nodes, fast loading.
Run:AI Streamer for experimental models or burst capacity — no infrastructure commitment, streams on demand.
Premium Blob as the backing store for both — cheapest long-term storage, AMLFS hydrates from it, Run:AI streams from it.

┌──────────────────────────────────────────────────────┐
│                  Azure Blob Storage                  │
│            (Premium Block Blob, ~$103/mo)            │
│           ┌───────────────────────────┐              │
│           │  deepseek-r1/  (689 GB)   │              │
│           └─────────┬─────────┬───────┘              │
│                     │         │                      │
└─────────────────────┼─────────┼──────────────────────┘
                      │         │
          ┌───────────┘         └───────────┐
          │ HSM Hydration                   │ Direct SDK Stream
          ▼                                 ▼
┌───────────────────┐           ┌───────────────────────┐
│  Azure Managed    │           │   Run:AI Streamer     │
│  Lustre (4 TiB)   │           │   (in custom vLLM)    │
│  $1,394/mo        │           │   $0 infra            │
│                   │           │                       │
│  Kernel mount     │           │  Blob SDK → GPU mem   │
│  on all nodes     │           │  per-pod streaming    │
└────────┬──────────┘           └───────────┬───────────┘
         │                                  │
         ▼                                  ▼
┌──────────────────┐            ┌──────────────────────┐
│  Production pods │            │  Experimental /      │
│  (steady-state)  │            │  burst / spot pods   │
│  2-4 H100 nodes  │            │  on-demand H100s     │
└──────────────────┘            └──────────────────────┘


Appendix: Pricing Sources

All prices are for East US, LRS, pay-as-you-go as of March 2026.


Item
Price
Source


Standard_ND96isr_H100_v5
$98.32/hr
Azure VM Pricing


Premium SSD P60 (8 TiB)
$946.08/mo
Azure Retail Prices API


Premium SSD P70 (16 TiB)
$1,802.06/mo
Azure Retail Prices API


Premium SSD P80 (32 TiB)
$3,604.11/mo
Azure Retail Prices API


Premium Files (LRS)
$0.16/GiB/mo
Azure Retail Prices API


Premium Block Blob (LRS)
$0.15/GB/mo
Azure Retail Prices API


AMLFS 40 MB/s/TiB
$0.000114/GiB/hr
Azure Retail Prices API


AMLFS 125 MB/s/TiB
$0.000198/GiB/hr
Azure Retail Prices API


AMLFS 250 MB/s/TiB
$0.000287/GiB/hr
Azure Retail Prices API


AMLFS 500 MB/s/TiB
$0.000466/GiB/hr
Azure Retail Prices API


Data egress (first 100 TB)
~$0.087/GB
Azure Bandwidth Pricing


Reference Links


ND H100 v5 Series VM Specifications
Azure Managed Lustre Overview
Azure Managed Lustre CSI Driver for AKS
Azure Managed Lustre Throughput Configurations
Azure Managed Disk Types
DeepSeek R1 on Hugging Face
Run:AI Model Streamer — Azure Blob Support (PR #116)
Parameter	Value
Model	DeepSeek R1 — 671B parameters (MoE, 37B activated)
Model Size on Disk	~689 GB (163 safetensor files, BF16)
VM SKU	`Standard_ND96isr_H100_v5`
GPUs per Node	8× NVIDIA H100 80GB (640 GB total GPU memory)
vCPUs / RAM	96 vCPUs / 1,900 GiB
Network Bandwidth	80 Gbps Ethernet (~10 GB/s) + 8× 400 Gbps InfiniBand (3.2 Tbps aggregate for GPU-direct RDMA)
VM Uncached Disk Throughput	612 MB/s (VM-level cap for remote storage)
Local NVMe	8× NVMe disks, 28 TiB total
VM Price (Pay-as-you-go)	~~$98.32/hr (~~$71,790/mo) per node
Nodes Required	Minimum 2 nodes (16× H100) for BF16 full-precision inference (model needs ~1.34 TB GPU memory in BF16; 2 nodes = 1.28 TB, tight but workable with MoE sparsity since only 37B params active). More commonly 2–4 nodes for comfortable serving with KV-cache headroom.
	Azure Disk CSI	Azure File CSI	Azure Blob CSI	Azure Managed Lustre (AMLFS)	Run:AI Model Streamer
Provisioner	`disk.csi.azure.com`	`file.csi.azure.com`	`blob.csi.azure.com`	`azurelustre.csi.azure.com`	N/A (SDK in vLLM)
Protocol	Block device (iSCSI)	SMB 3.0 / NFS 4.1	BlobFuse2 (FUSE) or NFS 3.0	Lustre (kernel client)	Azure Blob SDK (REST API)
Access Mode	RWO only ❌	RWX ✅	RWX ✅	RWX ✅	N/A (no mount)
Data Path	Blob → Managed Disk → block device → VFS → vLLM → GPU	Azure Files → SMB/NFS → kernel VFS → vLLM → GPU	Blob → FUSE/NFS → VFS → vLLM → GPU	Blob (hydrated via HSM) → Lustre OSTs (SSD) → kernel client → vLLM → GPU	Blob → SDK → direct to GPU memory
Kernel-level I/O	✅ Yes	✅ Yes (NFS) / Partial (SMB)	❌ No (FUSE has context switches)	✅ Yes (native Lustre client)	❌ No (userspace SDK)
Multi-node sharing	❌ Single node only	✅ But throughput shared across clients	✅ But throughput shared	✅ Designed for this — parallel striping	✅ Each pod streams independently
	Premium SSD (P80)	Premium Files (NFS)	Premium Blob (NFS)	AMLFS (500 MB/s tier)	Run:AI Streamer
Max Throughput (per disk/share)	900 MB/s (P80)	~300 MB/s per share (Premium)	~1-2 GB/s (Premium NFS, per storage account)	500 MB/s per TiB — 4 TiB = 2 GB/s, 16 TiB = 8 GB/s	Limited by VM network: ~10 GB/s (80 Gbps Ethernet)
Effective Throughput on H100 VM	Capped at 612 MB/s (VM uncached disk limit)	~300 MB/s (share-level limit)	~1-2 GB/s (but VM capped at 612 MB/s for remote storage)	Up to 8+ GB/s (Lustre bypasses disk throughput caps via network mount)	~5-10 GB/s (parallel blob SDK streams, limited by VM Ethernet)
Latency	Low (local attached SSD)	Medium-High (network + SMB/NFS overhead)	Medium (FUSE) to Medium-Low (NFS 3.0)	Very Low (kernel Lustre client, parallel striping)	Low-Medium (network only, no FS layer)
Est. Time to Load 689 GB	~19 min (at 612 MB/s cap)	~38 min (at 300 MB/s)	~19 min (at 612 MB/s cap)	~2-6 min (at 2-8 GB/s depending on cluster size)	~1-2 min (at 5-10 GB/s parallel streams)
Cold Start Behavior	Must clone/attach disk first (~1-3 min overhead)	Read-through from network share — slow for large models	BlobFuse2: full download to local cache first; NFS: read-through	First access triggers hydration from blob; subsequent reads at full SSD speed	Streams immediately — no pre-download
Warm Start (pod restart)	Fast (data on attached disk)	Re-reads from network share	BlobFuse2: fast (local cache); NFS: re-reads	Very fast (data on Lustre SSDs, kernel cache)	Re-downloads every time
Multi-node Scaling	❌ Each node needs its own disk copy	✅ Shared but throughput degrades per client	✅ Shared but throughput degrades	✅ Best — aggregate throughput scales with cluster size	✅ Each pod independent (N pods = N× blob bandwidth)
	Premium SSD	Premium Files	Premium Blob	AMLFS	Run:AI Streamer
Setup Effort	Low — built-in SC in AKS	Low — built-in SC in AKS	Medium — enable blob CSI driver	High — provision AMLFS cluster, VNet peering, install CSI driver, configure blob HSM integration	High — custom vLLM Docker image, monkey-patch model_weights, workload identity for Azure Blob
vLLM Changes	None	None	None	None	Yes — custom image, `--load-format runai_streamer`, `az://` scheme patch
Model Format	Any	Any	Any	Any	Safetensors only
Infrastructure	Managed disks (always attached)	Managed file shares	Blob storage account	Dedicated SSD cluster (always running)	Blob storage account only
Maintenance	None	None	None	Quarterly maintenance windows, manage Lustre lifecycle	Maintain custom vLLM image/fork
Storage Option	Size to Provision	Monthly Cost	Notes
Premium SSD (P60)	8 TiB (smallest ≥ 689 GB with decent throughput)	$946/mo per disk	RWO only — need one per node. 2 nodes = $1,892/mo
Premium SSD (P70)	16 TiB	$1,802/mo per disk	Higher throughput (750 MB/s). 2 nodes = $3,604/mo
Premium SSD (P80)	32 TiB	$3,604/mo per disk	Max throughput (900 MB/s). 2 nodes = $7,208/mo
Premium Files (LRS)	1 TiB (700 GiB provisioned)	$0.16/GiB/mo × 700 = ~$112/mo	Shared across nodes. Low throughput (~300 MB/s).
Premium Block Blob	689 GB	$0.15/GB/mo × 689 = ~$103/mo	Cheapest storage. Used with Blob CSI or Run:AI Streamer.
AMLFS (40 MB/s/TiB)	48 TiB (minimum)	$0.000114/GiB/hr × 49,152 GiB × 730 hrs = ~$4,093/mo	Overkill capacity, lower throughput (1.9 GB/s).
AMLFS (125 MB/s/TiB)	16 TiB (minimum)	$0.000198/GiB/hr × 16,384 GiB × 730 hrs = ~$2,369/mo	Good balance: 2 GB/s throughput.
AMLFS (250 MB/s/TiB)	8 TiB (minimum)	$0.000287/GiB/hr × 8,192 GiB × 730 hrs = ~$1,716/mo	2 GB/s throughput, smallest footprint.
AMLFS (500 MB/s/TiB)	4 TiB (minimum)	$0.000466/GiB/hr × 4,096 GiB × 730 hrs = ~$1,394/mo	2 GB/s throughput, highest per-TiB cost but smallest minimum.
Run:AI Streamer	689 GB (Blob storage)	~$103/mo (blob) + network egress	Cheapest. Egress: ~$0.087/GB × 689 GB = ~$60 per full model load.
Solution	Compute	Storage	Total/Month	Notes
Premium SSD (P60 × 2)	$143,580	$1,892	$145,472	Simple but RWO — must duplicate disk per node
Premium Files	$143,580	$112	$143,692	Cheapest storage, but painfully slow loading (~38 min)
Premium Blob CSI	$143,580	$103	$143,683	Cheapest, moderate speed, FUSE overhead
AMLFS (500 MB/s)	$143,580	$1,394	$144,974	Fast shared reads, but high setup effort for a PoC
Run:AI Streamer	$143,580	$103 + ~$60 egress	$143,743	Fastest cold start, minimal infra, but custom image needed
Solution	Compute (12 mo)	Storage (12 mo)	Total/Year	Model Load Time	Verdict
Premium SSD (P60 × 2)	$1,722,960	$22,704	$1,745,664	~19 min	❌ RWO limitation, disk per node, doesn't scale
Premium Files	$1,722,960	$1,344	$1,724,304	~38 min	❌ Too slow for production model loading
Premium Blob CSI	$1,722,960	$1,236	$1,724,196	~19 min	🟡 OK for cost, but FUSE overhead + VM cap
AMLFS (500 MB/s, 4 TiB)	$1,722,960	$16,728	$1,739,688	~2-6 min	✅ Best for multi-node shared reads at scale
AMLFS (250 MB/s, 8 TiB)	$1,722,960	$20,592	$1,743,552	~2-6 min	✅ Same throughput, more capacity for multiple models
Run:AI Streamer	$1,722,960	$1,236 + ~$720 egress (1 load/day)	$1,724,916	~1-2 min	✅ Best cold start, lowest storage cost
Approach	Behavior at 2+ Nodes
Premium SSD	Must clone disk per node. 2 nodes = 2 disks = 2× cost. Can't share.
Premium Files	Shared, but ~300 MB/s total shared across all nodes. Slower as nodes increase.
Premium Blob CSI	Shared via Blob, but each node FUSE-caches independently. 612 MB/s per node cap.
AMLFS	True parallel filesystem. Each node reads from Lustre at full speed independently. Aggregate throughput scales with Lustre cluster size.
Run:AI Streamer	Each node streams independently from Blob. Good parallelism but each node consumes separate bandwidth. No shared cache benefit.
Item	Price	Source
Standard_ND96isr_H100_v5	$98.32/hr	Azure VM Pricing
Premium SSD P60 (8 TiB)	$946.08/mo	Azure Retail Prices API
Premium SSD P70 (16 TiB)	$1,802.06/mo	Azure Retail Prices API
Premium SSD P80 (32 TiB)	$3,604.11/mo	Azure Retail Prices API
Premium Files (LRS)	$0.16/GiB/mo	Azure Retail Prices API
Premium Block Blob (LRS)	$0.15/GB/mo	Azure Retail Prices API
AMLFS 40 MB/s/TiB	$0.000114/GiB/hr	Azure Retail Prices API
AMLFS 125 MB/s/TiB	$0.000198/GiB/hr	Azure Retail Prices API
AMLFS 250 MB/s/TiB	$0.000287/GiB/hr	Azure Retail Prices API
AMLFS 500 MB/s/TiB	$0.000466/GiB/hr	Azure Retail Prices API
Data egress (first 100 TB)	~$0.087/GB	Azure Bandwidth Pricing