A phase-gated bootstrap guide for running large models across two Apple Silicon nodes using EXO. Battle-tested on Mac Studio M4 Max (128 GB unified) pairs over Thunderbolt 5.
Reference performance (Qwen3.5-122B-A10B-4bit):
- ~52 tok/s sustained at 512–1024 token range
- Stable at concurrency=2 (p95 ~10.37s)
Use this whenever a new model drops on Hugging Face. It is the fastest path when you want to avoid repeating filesystem + startup churn.
./model-drop organization/model-nameThis runs:
- Pre-ring artifact audit (
release_orchestrator.sh) - Two-node bootstrap handoff when audit passes
One-time setup for defaults:
- Edit
config/model_drop.envonce
Speed options:
./model-drop organization/model-name --fast(strict-only checks become warnings)./model-drop organization/model-name --resume-bootstrap(skip re-audit, rerun handoff)./model-drop-latest --resume-bootstrap(reuse most recent model id automatically)./model-drop organization/model-name --online(forceEXO_OFFLINE=falsefor cluster join)
- Prepare both machines (tools, paths, disk).
- Download model directly on Node 1.
- Verify Node 1 is complete + path-correct.
- Run Node 1 local readiness canary.
- Download model directly on Node 2.
- Verify Node 2 is complete + path-correct.
- Run two-node EXO startup and transport gates.
- Create instance and only then run the first chat.
This catches the biggest source of failure first: one node is incomplete or misconfigured while the other appears fine.
- One direct download per node. No rsync/sync-to-other-node handoffs.
- Each phase must pass before moving on.
- No transport/inference work until both nodes pass file integrity checks.
- Use explicit commands and pass/fail checkpoints, not process-count heuristics.
If you are a new agent/operator inheriting this run, do this in order:
- Confirm target nodes + model id + model dirs are correct on both hosts.
- Confirm model artifacts are complete on both hosts before any EXO create/placement.
- Confirm cluster join (
topology.nodes=2) before any benchmark. - Confirm serving truth via actual
/v1/chat/completionscall, not only/staterunner maps. - Only then run long benchmarks.
Exit criteria for "working":
- Both nodes reachable and in same namespace
- Model artifacts complete and parity-passing on both nodes
- Chat request succeeds from both nodes
- Benchmark artifacts are written incrementally (phase checkpoints)
Fast stop criteria:
- Missing HF binary on either node
- Model dir mismatch (e.g.
~/Models/...vs~/.exo/models/...) - Cluster isolated (
nodes=1) for >60s - Repeated create with
nodeToRunner=0and no successful chat
-
Download + transfer was a major time sink. Direct download on each node is faster and more reliable than downloading once and transferring.
-
Path drift caused repeated false failures. Using both
~/Models/...and~/.exo/models/...at different times causes chaos. Treat model dir as a strict input and audit that exact directory on each host. -
Manifest parity must ignore local HF cache metadata noise.
.cache/huggingface/download/*.metadatadiffers naturally across nodes. Compare serving artifacts (actual model/config/index files), not cache metadata files. -
Tooling asymmetry blocked bootstrap unexpectedly. Example: missing HF binary path on one node. Preflight must assert exact binary path per node before long runs.
-
Cluster health is not proven by process count.
pgrepand running processes are weak signals. Strong signals:topology.nodes=2, successful chat completion, and stable latency. -
/staterunner mapping can lag or mislead. Successful chat is possible whilenodeToRunnerremains0. Serving truth is an actual completion response from the target model. -
Startup flags can silently isolate nodes.
--no-apion one node orEXO_OFFLINE=trueduring join checks can make cluster appear broken. For join validation, use consistent flags andEXO_OFFLINE=false. -
Cross-node code/runtime drift creates non-deterministic behavior. Discovery/transport settings differed between nodes during debugging. Keep both nodes on matching code + bindings before transport triage.
-
Long runs need checkpointed writes. A crash late in a run can lose all metrics if results are only written at end. Save phase artifacts (
token_sweep,thinking,concurrency,tasks) as they complete. -
Benchmark time can look like "hang" without heartbeat logs. Print progress every fixed request count and include current phase/timestamp.
Yes. With this gated order and known failure signatures, reaching a working state is materially faster:
- Before: high churn from mixed paths, transfer retries, and late detection
- Now: early hard gates catch path/asset/tooling errors in minutes
- Expected: first reliable serveability signal quickly, then benchmarking
What still dominates time:
- Large-token benchmark sweeps
- True transport/runtime bugs (if present), not download mechanics
Keep this exact section order and update only the indicated blocks each run:
90-Second Handoff(stable)Current Run Header(mutable): model id, nodes, dirs, known blockersPhase Gates(stable): pass/fail criteriaFailure Signatures(stable): symptom → likely cause → next commandRun Ledger(mutable): timestamped outcomes per phaseHand-off Block(mutable): shortest context for next agent
MODEL:
NODES:
MODEL_DIRS:
LAST_PASS_PHASE:
CURRENT_PHASE:
KNOWN_BLOCKER:
STRONGEST_EVIDENCE:
NEXT_COMMAND:
DO_NOT_REPEAT:
MODEL_ID="organization/model-name"
MODEL_INSTANCE_ID="organization--model-name"
NODE1="user1@node1-host"
NODE2="user2@node2-host"
MODEL_DIR_NODE1="/Users/user1/Models/${MODEL_INSTANCE_ID}"
MODEL_DIR_NODE2="/Users/user2/Models/${MODEL_INSTANCE_ID}"
HF_CMD_NODE1="/path/to/user1/hf" # e.g. ~/hf-cli-env/bin/hf
HF_CMD_NODE2="/path/to/user2/hf" # e.g. ~/Library/Python/3.x/bin/hf
EXPECTED_SAFE_TENSOR_COUNT="" # optional: e.g. 39
EXPECTED_GGUF_COUNT="" # optional: e.g. 3
EXO_API_PORT="52415"
EXO_NAMESPACE="my-cluster"
EXO_LIBP2P_PORT="51001"
EXO_LOG_DIR_NODE1="~/exo-bootstrap-logs"
EXO_LOG_DIR_NODE2="~/exo-bootstrap-logs"
PROFILE_PATH="model_profiles.json"
RUN_REPORT_DIR="artifacts/bootstrap-runs"Before placement or ring setup, run artifact audit across 1–4 nodes:
export MODEL_ID="organization/model-name"
export NODES="user1@host1,user2@host2" # up to 4 nodes
./scripts/release_orchestrator.shPASS means:
- No partial download markers (
.incomplete,.part) - Deterministic manifest parity across all nodes (path + size)
- Consistent artifact format across nodes (safetensors or GGUF)
- Required files for safetensors mode are present (
config.json, and index when sharded)
Only proceed to EXO placement/runners after RING_INPUT_READY.
Add a row for each model after first successful bootstrap. Keep this section as the source of truth for fast re-runs.
The repo also has a machine-readable profile:
model_profiles.json- Canonical model metadata and status used by
scripts/bootstrap_two_node_model.sh statusvalues:active: ready to runblocked: stop by default, requiresALLOW_BLOCKED_MODEL=truewatch: active but risky/known to driftdeprecated: no longer use
| Model ID | Instance ID | HF include | Safetensor shard count | GGUF count | Preflight disk threshold | Notes |
|---|---|---|---|---|---|---|
<model-id> |
<instance-id> |
<exact --include flags> |
TBD |
TBD |
<model-specific minimum> |
<topology-specific notes> |
- Run
find "$MODEL_DIR" -type f | wc -land record shard counts after a clean first run. - Check and record:
- Main variant file prefix naming pattern (e.g.
model.safetensors-xxxxxx.safetensors) - Total folder size (
du -skin GB/TB range) - Any required non-shard artifacts (
config.json, tokenizer, index files)
- Main variant file prefix naming pattern (e.g.
- Set
EXPECTED_SAFE_TENSOR_COUNT/EXPECTED_GGUF_COUNTonly when deterministic counts are confirmed.
ssh "$NODE1" "mkdir -p $EXO_LOG_DIR_NODE1; python3 --version; $HF_CMD_NODE1 --help >/tmp/hf_help.log; command -v rg || true; df -h ~ ~/Models"
ssh "$NODE2" "mkdir -p $EXO_LOG_DIR_NODE2; python3 --version; $HF_CMD_NODE2 --help >/tmp/hf_help.log; command -v rg || true; df -h ~ ~/Models"PASS if:
- HF CLI help works on both nodes.
- SSH access works both directions.
- Disk appears adequate for the model and caches.
STOP if:
- One node fails CLI check.
- One node is unreachable.
- Disk is clearly insufficient.
ssh "$NODE1" "mkdir -p '$MODEL_DIR_NODE1'; \
$HF_CMD_NODE1 download '$MODEL_ID' --local-dir '$MODEL_DIR_NODE1' [--include \"...\" ]"ssh "$NODE1" "find '$MODEL_DIR_NODE1' -type f \\( -name '*.incomplete' -o -name '*.part' \\) -print | sed -n '1,40p'; \
du -sk '$MODEL_DIR_NODE1'; ls -lah '$MODEL_DIR_NODE1' | sed -n '1,220p'"PASS if:
- No
.incomplete/.partmarkers. - Files exist for expected variant.
- Size is plausible for selected model.
- Optional count checks match if configured:
ls -1 '$MODEL_DIR_NODE1'/model.safetensors-*.safetensors | wc -l == EXPECTED_SAFE_TENSOR_COUNT(if set)ls -1 '$MODEL_DIR_NODE1'/*.gguf | wc -l == EXPECTED_GGUF_COUNT(if set)
STOP if:
- Any incomplete marker remains after wait and retry.
- Counts fail.
Bring up only Node 1 EXO for API + model visibility.
ssh "$NODE1" "pkill -9 -f '/opt/homebrew/bin/exo|python -m exo|EXO.app/Contents/MacOS/EXO' || true; \
nohup EXO_OFFLINE=true EXO_LIBP2P_NAMESPACE=$EXO_NAMESPACE EXO_LIBP2P_PORT=$EXO_LIBP2P_PORT \
/opt/homebrew/bin/exo --api-port $EXO_API_PORT --verbose --no-fast-synch --no-downloads >$EXO_LOG_DIR_NODE1/node1_canary.log 2>&1 & \
sleep 4; \
python3 - <<PY
import urllib.request, json, sys
base='http://127.0.0.1:${EXO_API_PORT}'
state=urllib.request.urlopen(base+'/state',timeout=8).read()
models=urllib.request.urlopen(base+'/v1/models?status=downloaded',timeout=8).read()
print('state', bool(state), 'models', bool(models))
print(state.decode(errors='ignore')[:300])
print(models.decode(errors='ignore')[:300])
PYPASS if:
/stateis reachable.- Target model id appears in
/v1/models?status=downloaded.
If canary fails, fix Node 1 now and restart from Phase 1.
Stop canary before moving forward:
ssh "$NODE1" "pkill -9 -f 'EXO.app/Contents/MacOS/EXO|python -m exo|/opt/homebrew/bin/exo' || true"ssh "$NODE2" "mkdir -p '$MODEL_DIR_NODE2'; \
$HF_CMD_NODE2 download '$MODEL_ID' --local-dir '$MODEL_DIR_NODE2' [--include \"...\" ]"ssh "$NODE2" "find '$MODEL_DIR_NODE2' -type f \\( -name '*.incomplete' -o -name '*.part' \\) -print | sed -n '1,40p'; \
du -sk '$MODEL_DIR_NODE2'; ls -lah '$MODEL_DIR_NODE2' | sed -n '1,220p'"PASS criteria mirror Phase 1A for Node 2 (same format and counts).
ssh "$NODE1" "find '$MODEL_DIR_NODE1' -type f | sort | sed -n '1,240p'"
ssh "$NODE2" "find '$MODEL_DIR_NODE2' -type f | sort | sed -n '1,240p'"PASS if:
- File trees match where variant naming is deterministic.
- Expected file classes exist on both nodes.
- Both nodes have non-empty expected main shards.
STOP if:
- Shard count mismatch.
- One node missing index/config/manifest files.
ssh "$NODE1" "pkill -9 -f '/opt/homebrew/bin/exo|python -m exo|EXO.app/Contents/MacOS/EXO' || true; \
nohup EXO_OFFLINE=true EXO_LIBP2P_NAMESPACE=$EXO_NAMESPACE EXO_LIBP2P_PORT=$EXO_LIBP2P_PORT \
/opt/homebrew/bin/exo --api-port $EXO_API_PORT --verbose --no-fast-synch --no-downloads >$EXO_LOG_DIR_NODE1/node1.log 2>&1 &"
ssh "$NODE2" "pkill -9 -f '/opt/homebrew/bin/exo|python -m exo|EXO.app/Contents/MacOS/EXO' || true; \
nohup EXO_OFFLINE=true EXO_LIBP2P_NAMESPACE=$EXO_NAMESPACE EXO_LIBP2P_PORT=$EXO_LIBP2P_PORT \
/opt/homebrew/bin/exo --api-port $EXO_API_PORT --verbose --no-fast-synch --no-downloads >$EXO_LOG_DIR_NODE2/node2.log 2>&1 &"
sleep 5
ssh "$NODE1" "pgrep -af 'exo|EXO' | head"
ssh "$NODE2" "pgrep -af 'exo|EXO' | head"
curl -sS "http://127.0.0.1:$EXO_API_PORT/state" | python3 -m json.tool | head -n 40PASS if:
- Both processes are up.
- API state responds.
STOP if:
- Immediate crash loop.
- API unreachable.
Run from a node with local API access:
python3 - <<PY
import json, urllib.request, urllib.parse, time
base = "http://127.0.0.1:${EXO_API_PORT}"
model = "${MODEL_ID}"
def get(path, timeout=8):
with urllib.request.urlopen(base + path, timeout=timeout) as r:
return json.loads(r.read())
def post(path, obj, timeout=30):
data = json.dumps(obj).encode()
req = urllib.request.Request(base + path, data=data, headers={"Content-Type": "application/json"}, method="POST")
with urllib.request.urlopen(req, timeout=timeout) as r:
return r.getcode(), json.loads(r.read())
m = get("/v1/models?status=downloaded")
ids = [x.get("id") for x in m.get("data", []) if isinstance(x, dict)]
if model not in ids:
raise SystemExit(f"model_not_in_downloaded:{model}")
q = urllib.parse.urlencode({"model_id": model, "min_nodes": "2", "sharding":"Pipeline", "instance_meta":"MlxRingInstance"})
placement = get("/instance/placement?" + q)
code, resp = post("/instance", {"instance": placement}, timeout=30)
print("create_status", code)
print("command_id", resp.get("command_id"))
for i in range(90):
s = get("/state", timeout=8)
inst = s.get("instances", {})
if not inst:
print("t", i, "instances=0")
time.sleep(1)
continue
iid = next(iter(inst))
entry = inst[iid]
payload = entry[next(iter(entry.keys()))]
n2r = payload.get("nodeToRunner", {})
print("t", i, "iid", iid, "nodeToRunner", len(n2r), "r2s", len(payload.get("runnerToShard", {})))
if len(n2r) >= 2:
print("PLACEMENT_READY")
break
time.sleep(1)
else:
raise SystemExit("placement_not_ready_in_window")
PYPASS if:
- Model appears in downloaded list.
- Instance create request succeeds.
nodeToRunnerreaches 2 and stays.
STOP if:
nodeToRunnerremains 0.- Repeated churn or no runner attachment.
ssh "$NODE1" "grep -Eo '[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+:[0-9]+' $EXO_LOG_DIR_NODE1/node1.log | tail -n 40"
ssh "$NODE2" "grep -Eo '[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+:[0-9]+' $EXO_LOG_DIR_NODE2/node2.log | tail -n 40"Use each announced <ip>:<port> and validate listeners/reachability:
ssh "$NODE1" "lsof -nP -iTCP:<port> -sTCP:LISTEN || true"
ssh "$NODE1" "nc -vz -w 2 <peer-ip> <port> ; echo rc=$?"PASS if:
- No
error: 60/65in logs. - Ports listed are actually listening.
- Connectivity checks reach both directions used by ring/JACCL path.
STOP if:
connection refusedon declared transport ports.- Repeated backoff connect errors.
python3 - <<PY
import json, urllib.request, time
url = "http://127.0.0.1:${EXO_API_PORT}/v1/chat/completions"
payload = {
"model": "${MODEL_ID}",
"messages": [{"role":"user","content":"Reply with exactly OK"}],
"max_tokens": 8,
"stream": False,
"temperature": 0.0,
}
start = time.time()
try:
req = urllib.request.Request(url, data=json.dumps(payload).encode(), headers={"Content-Type":"application/json"}, method="POST")
with urllib.request.urlopen(req, timeout=120) as r:
body = r.read().decode(errors="ignore")
print("status", r.getcode())
print("lat_s", round(time.time()-start, 2))
print(body[:500])
except Exception as e:
print(type(e).__name__, round(time.time()-start, 2), str(e))
PYPASS if:
- Returns before timeout.
- Status 200.
- Payload includes
choices.
STOP if:
- Timeout.
- Broken transport message.
- Incomplete response.
# Set only these:
export MODEL_ID="organization/model-name"
export MODEL_INSTANCE_ID="organization--model-name" # optional if deriving is not desired
export PROFILE_PATH="model_profiles.json"
export NODE1="user1@node1-host"
export NODE2="user2@node2-host"
export ALLOW_BLOCKED_MODEL=false
# Run the script (it executes the same phases):
./scripts/bootstrap_two_node_model.shOptional runtime flags:
HF_INCLUDE(for shard filtering)EXPECTED_SAFE_TENSOR_COUNT/EXPECTED_GGUF_COUNTEXO_API_PORT,EXO_NAMESPACE,EXO_LIBP2P_PORT,RUN_REPORT_DIR
If model is marked blocked in model_profiles.json, set ALLOW_BLOCKED_MODEL=true once you explicitly accept the risk.
If Phase 5 or 6 fails, do not move to Phase 7. Fix the failed phase and re-run the script.
- If Phase 1/1A fails → fix Node 1 download and path.
- If Phase 1B fails → stop and fix Node 1 environment/startup.
- If Phase 2/2A fails → fix Node 2 download and path.
- If Phase 3 fails → do not start distributed EXO.
- If Phase 4/5 fails → do not run transport-dependent benchmarks.
- If Phase 6 fails → fix network/transport before repeating benchmark.
- If any phase passes → keep artifacts and logs before next model.
- New model appears: you start with download and completeness, not inference.
- Do not skip the canary or parity gate.
- A green
has_modelorstatus=downloadedwithout both-node parity is not enough.