mosure/ramdisk.md

## ramdisk.md

      
    Raw
  

              ramdisk.md
            
          
    ramdisk is a scaling primitive for local agent orchestration

local development used to be human-paced: one developer, one editor, occasional builds and tests. consumer ssd endurance assumptions were built around that pattern.
agent-driven development changes the load profile. when you run 4-32 local agents in parallel, each doing build, test, validation, and coding loops, write pressure scales horizontally just like cpu and memory demand.

  
      graph LR;
    a["agent count"] --> e["daily host writes"];
    b["cycles per agent per day"] --> e;
    c["logical writes per cycle"] --> e;
    d["host overhead factor"] --> e;
    e --> f["annual host writes"];
    f --> g["TBW budget check"];
    g -->|over| h["move hot paths to ramdisk"];
    g -->|within| i["safe headroom to scale"];

    
      Loading

  
scaling math

use a simple model:
daily_host_writes_GB = agents * cycles_per_agent_per_day * logical_writes_per_cycle_GB * host_overhead_factor
optional (for nand-wear reasoning, not TBW comparison):
daily_nand_writes_GB = daily_host_writes_GB * device_wa_factor
for continuous operation, define:
cycles_per_agent_per_day = cycles_per_hour * 24
typical write ranges per cycle in real dev loops:


task
write range per cycle


build artifacts and object files
1-6 GB


test temp files and coverage/log output
0.2-2 GB


ml numerical correctness validation (intermediate tensors, traces, eval output)
1-10 GB


coding overhead (indexing, logs, git/object churn)
0.05-0.3 GB


moderate orchestrated setup:

agents = 8
cycles_per_agent_per_day = 20
logical_writes_per_cycle_GB = 5
host_overhead_factor = 1.3 (filesystem metadata, journaling, copy-on-write overhead)

result:

daily_host_writes_GB = 8 * 20 * 5 * 1.3 = 1040 GB/day
annual_host_writes_TB ~= 380 TB/year

compare that with endurance. if a 1 TB drive is rated at 600 TBW, a 10-year budget is:
600 TB / 3650 days ~= 0.164 TB/day ~= 164 GB/day
TBW means terabytes written: the total cumulative host writes the drive is rated to absorb over its warranted endurance life.
TBW is a warranty/rating figure, not a hard failure cliff. real endurance varies with workload and operating conditions, but TBW is still a useful planning budget.
consumer warranties are often much shorter than 10 years; the 10-year view is a planning horizon, not a warranty promise.
explicit 24/7 lifetime math

to make the continuous effect concrete, assume:

each agent runs 2 cycles/hour continuously
each cycle writes 2.5 GB
host_overhead_factor = 1.3
drive endurance is 600 TBW (common consumer class)

then:

cycles_per_agent_per_day = 2 * 24 = 48
daily_host_writes_per_agent_GB = 48 * 2.5 * 1.3 = 156 GB/day
daily_host_writes_GB = agents * daily_host_writes_per_agent_GB
ssd_lifetime_years = (drive_TBW * 1000) / (daily_host_writes_GB * 365)

note: calculations here use decimal storage units (the same convention used by drive vendors): 1 TB = 1000 GB.
when comparing to TBW, use host writes. do not multiply by device_wa_factor for the TBW check.


agents (24/7)
daily writes
annual writes
lifetime for 600 TBW ssd


1
156 GB/day
56.9 TB/year
10.5 years


4
624 GB/day
227.8 TB/year
2.6 years


8
1248 GB/day
455.5 TB/year
1.3 years


16
2496 GB/day
911.0 TB/year
0.66 years (~8 months)


the key point: under a fixed per-agent workload, lifetime scales inversely with agent count (if agents double, expected lifetime roughly halves). contention, cache behavior, and io throttling can bend real-world results above or below this baseline.
why ramdisk matters

ramdisk (tmpfs) shifts high-churn ephemeral writes into dram:

near-zero ssd wear for throwaway artifacts
lower latency for compile/test loops
less io queue contention under concurrent agents

this is no longer a micro-optimization. it is a durability and throughput control. tmpfs is volatile and memory-backed, so size it conservatively to avoid swap pressure that can reintroduce ssd writes.
implicit vs explicit ramdisk

both ubuntu and macos already use an implicit in-memory file cache (page cache / unified buffer cache). this helps performance, but it is not the same as an explicit ramdisk mount.
implicit (os-managed cache):

reads and writes are often served from ram first
dirty pages are often flushed later (writeback, journal checkpoints, fsync), especially for longer-lived files
eviction is global and workload-agnostic under memory pressure
files deleted before writeback may avoid full data flush, but metadata/journal traffic still tends to persist
sustained high-churn paths can still produce substantial long-term TBW consumption

explicit (tmpfs or mounted ramdisk):

writes to that mount are memory-backed by design
per-path control is deterministic (TMPDIR, CARGO_TARGET_DIR, SCCACHE_DIR)
easy per-agent isolation (/mnt/ramdisk/agent-<id>)
hard quotas via per-agent mounts (tmpfs -o size=...) or cgroup/systemd memory limits
predictable cleanup by unmount/delete

why explicit ramdisk with sccache:

page cache can mask latency, but it does not guarantee write avoidance on persistent filesystems
the biggest write sink is often build output materialization (CARGO_TARGET_DIR) even on cache hits
with many continuous agents, local sccache metadata/object churn can still create steady writeback pressure
setting SCCACHE_DIR to explicit ramdisk makes local cache-write avoidance deterministic for hot entries

practical guidance:

if reboot persistence matters most, keep SCCACHE_DIR on ssd (or remote backend), and put build/test scratch on ramdisk
if minimizing local ssd wear matters most, place SCCACHE_DIR on ramdisk with a bounded size (SCCACHE_CACHE_SIZE) and accept cache loss on reboot
best of both: keep a hot local ramdisk tier and use a remote sccache backend for durability/sharing


      graph TB;
    subgraph p[persistent ssd tier]
      s1["source repos"];
      s2["dependency cache kept across reboots"];
      s3["final artifacts"];
    end

    subgraph r[ramdisk tmpfs tier]
      r1["build scratch"];
      r2["test temp and coverage scratch"];
      r3["ml intermediates and throwaway checkpoints"];
      r4["agent temp and logs"];
    end

    a1["agent 1"] --> r1;
    a2["agent 2"] --> r2;
    r1 --> s3;
    r2 --> s3;

    
      Loading

  
practical storage tiering

keep persistent/reproducible state on ssd:

source repos
dependency caches you want to keep
final artifacts

move high-churn ephemeral paths to ramdisk:

build scratch (target, dist, temp object dirs)
test temp + coverage scratch
ml validation intermediates
agent-local temp/log workdirs

minimal setup pattern:
AGENT_ID=${AGENT_ID:-0}

sudo mkdir -p /mnt/ramdisk
sudo mount -t tmpfs -o size=24G,mode=1777 tmpfs /mnt/ramdisk

mkdir -p /mnt/ramdisk/agent-$AGENT_ID/{tmp,target,sccache}
export TMPDIR=/mnt/ramdisk/agent-$AGENT_ID/tmp
export CARGO_TARGET_DIR=/mnt/ramdisk/agent-$AGENT_ID/target

# optional:
export SCCACHE_DIR=/mnt/ramdisk/agent-$AGENT_ID/sccache
# export SCCACHE_CACHE_SIZE=10G
for multiple agents, set a unique AGENT_ID per worker to prevent cross-agent contention.
note: some systems already mount /tmp as tmpfs. a dedicated mount is still useful for deterministic sizing and per-agent isolation.
verify with real counters

validate the model with host-write counters during a normal agent run:

nvme: compare data_units_written before/after (nvme smart-log)
smart: compare host-write attributes before/after (smartctl -A)

using counter deltas helps calibrate host_overhead_factor for your actual workload.
rule of thumb

before increasing local agent count, compute projected writes and compare with a 10-year daily budget:
daily_budget_GB ~= (drive_TBW * 1000) / 3650
if projected writes are already a large fraction of that budget, add ramdisk first and then scale agents.
in agent-native development, horizontal scaling without storage tiering is a hidden reliability bug.
task	write range per cycle
build artifacts and object files	1-6 GB
test temp files and coverage/log output	0.2-2 GB
ml numerical correctness validation (intermediate tensors, traces, eval output)	1-10 GB
coding overhead (indexing, logs, git/object churn)	0.05-0.3 GB
agents (24/7)	daily writes	annual writes	lifetime for 600 TBW ssd
1	156 GB/day	56.9 TB/year	10.5 years
4	624 GB/day	227.8 TB/year	2.6 years
8	1248 GB/day	455.5 TB/year	1.3 years
16	2496 GB/day	911.0 TB/year	0.66 years (~8 months)