Skip to content

Instantly share code, notes, and snippets.

@trevin-creator
Created February 27, 2026 15:08
Show Gist options
  • Select an option

  • Save trevin-creator/b963914a150f2552774f000159ec9b37 to your computer and use it in GitHub Desktop.

Select an option

Save trevin-creator/b963914a150f2552774f000159ec9b37 to your computer and use it in GitHub Desktop.
Qwen3.5-122B-A10B on 2× Mac Studio M4 Max via Exo + Thunderbolt 5 RDMA – Resilient Day-0 Bootstrap Guide

What This Is

A phase-gated bootstrap guide for running large models across two Apple Silicon nodes using EXO. Battle-tested on Mac Studio M4 Max (128 GB unified) pairs over Thunderbolt 5.

Reference performance (Qwen3.5-122B-A10B-4bit):

  • ~52 tok/s sustained at 512–1024 token range
  • Stable at concurrency=2 (p95 ~10.37s)

Two-Node Model Bootstrap (Ruthless, Phase-Gated)

Use this whenever a new model drops on Hugging Face. It is the fastest path when you want to avoid repeating filesystem + startup churn.

One-Command Default (Recommended)

./model-drop organization/model-name

This runs:

  1. Pre-ring artifact audit (release_orchestrator.sh)
  2. Two-node bootstrap handoff when audit passes

One-time setup for defaults:

  • Edit config/model_drop.env once

Speed options:

  • ./model-drop organization/model-name --fast (strict-only checks become warnings)
  • ./model-drop organization/model-name --resume-bootstrap (skip re-audit, rerun handoff)
  • ./model-drop-latest --resume-bootstrap (reuse most recent model id automatically)
  • ./model-drop organization/model-name --online (force EXO_OFFLINE=false for cluster join)

Fastest Repeatable Sequence

  1. Prepare both machines (tools, paths, disk).
  2. Download model directly on Node 1.
  3. Verify Node 1 is complete + path-correct.
  4. Run Node 1 local readiness canary.
  5. Download model directly on Node 2.
  6. Verify Node 2 is complete + path-correct.
  7. Run two-node EXO startup and transport gates.
  8. Create instance and only then run the first chat.

This catches the biggest source of failure first: one node is incomplete or misconfigured while the other appears fine.

Core Rules

  • One direct download per node. No rsync/sync-to-other-node handoffs.
  • Each phase must pass before moving on.
  • No transport/inference work until both nodes pass file integrity checks.
  • Use explicit commands and pass/fail checkpoints, not process-count heuristics.

90-Second Handoff (Read This First)

If you are a new agent/operator inheriting this run, do this in order:

  1. Confirm target nodes + model id + model dirs are correct on both hosts.
  2. Confirm model artifacts are complete on both hosts before any EXO create/placement.
  3. Confirm cluster join (topology.nodes=2) before any benchmark.
  4. Confirm serving truth via actual /v1/chat/completions call, not only /state runner maps.
  5. Only then run long benchmarks.

Exit criteria for "working":

  • Both nodes reachable and in same namespace
  • Model artifacts complete and parity-passing on both nodes
  • Chat request succeeds from both nodes
  • Benchmark artifacts are written incrementally (phase checkpoints)

Fast stop criteria:

  • Missing HF binary on either node
  • Model dir mismatch (e.g. ~/Models/... vs ~/.exo/models/...)
  • Cluster isolated (nodes=1) for >60s
  • Repeated create with nodeToRunner=0 and no successful chat

Lessons Absorbed from Two-Node Bootstrap Attempts

  1. Download + transfer was a major time sink. Direct download on each node is faster and more reliable than downloading once and transferring.

  2. Path drift caused repeated false failures. Using both ~/Models/... and ~/.exo/models/... at different times causes chaos. Treat model dir as a strict input and audit that exact directory on each host.

  3. Manifest parity must ignore local HF cache metadata noise. .cache/huggingface/download/*.metadata differs naturally across nodes. Compare serving artifacts (actual model/config/index files), not cache metadata files.

  4. Tooling asymmetry blocked bootstrap unexpectedly. Example: missing HF binary path on one node. Preflight must assert exact binary path per node before long runs.

  5. Cluster health is not proven by process count. pgrep and running processes are weak signals. Strong signals: topology.nodes=2, successful chat completion, and stable latency.

  6. /state runner mapping can lag or mislead. Successful chat is possible while nodeToRunner remains 0. Serving truth is an actual completion response from the target model.

  7. Startup flags can silently isolate nodes. --no-api on one node or EXO_OFFLINE=true during join checks can make cluster appear broken. For join validation, use consistent flags and EXO_OFFLINE=false.

  8. Cross-node code/runtime drift creates non-deterministic behavior. Discovery/transport settings differed between nodes during debugging. Keep both nodes on matching code + bindings before transport triage.

  9. Long runs need checkpointed writes. A crash late in a run can lose all metrics if results are only written at end. Save phase artifacts (token_sweep, thinking, concurrency, tasks) as they complete.

  10. Benchmark time can look like "hang" without heartbeat logs. Print progress every fixed request count and include current phase/timestamp.

If You Repeated from Scratch, Would It Be Faster?

Yes. With this gated order and known failure signatures, reaching a working state is materially faster:

  • Before: high churn from mixed paths, transfer retries, and late detection
  • Now: early hard gates catch path/asset/tooling errors in minutes
  • Expected: first reliable serveability signal quickly, then benchmarking

What still dominates time:

  • Large-token benchmark sweeps
  • True transport/runtime bugs (if present), not download mechanics

Document Structure for Agent/Operator Absorption

Keep this exact section order and update only the indicated blocks each run:

  1. 90-Second Handoff (stable)
  2. Current Run Header (mutable): model id, nodes, dirs, known blockers
  3. Phase Gates (stable): pass/fail criteria
  4. Failure Signatures (stable): symptom → likely cause → next command
  5. Run Ledger (mutable): timestamped outcomes per phase
  6. Hand-off Block (mutable): shortest context for next agent

Hand-off Block Template (Paste at Top of Active Thread)

MODEL:
NODES:
MODEL_DIRS:
LAST_PASS_PHASE:
CURRENT_PHASE:
KNOWN_BLOCKER:
STRONGEST_EVIDENCE:
NEXT_COMMAND:
DO_NOT_REPEAT:

Inputs

MODEL_ID="organization/model-name"
MODEL_INSTANCE_ID="organization--model-name"
NODE1="user1@node1-host"
NODE2="user2@node2-host"
MODEL_DIR_NODE1="/Users/user1/Models/${MODEL_INSTANCE_ID}"
MODEL_DIR_NODE2="/Users/user2/Models/${MODEL_INSTANCE_ID}"
HF_CMD_NODE1="/path/to/user1/hf"           # e.g. ~/hf-cli-env/bin/hf
HF_CMD_NODE2="/path/to/user2/hf"           # e.g. ~/Library/Python/3.x/bin/hf
EXPECTED_SAFE_TENSOR_COUNT=""               # optional: e.g. 39
EXPECTED_GGUF_COUNT=""                      # optional: e.g. 3
EXO_API_PORT="52415"
EXO_NAMESPACE="my-cluster"
EXO_LIBP2P_PORT="51001"
EXO_LOG_DIR_NODE1="~/exo-bootstrap-logs"
EXO_LOG_DIR_NODE2="~/exo-bootstrap-logs"
PROFILE_PATH="model_profiles.json"
RUN_REPORT_DIR="artifacts/bootstrap-runs"

Pre-Ring Audit (Run This First for New Drops)

Before placement or ring setup, run artifact audit across 1–4 nodes:

export MODEL_ID="organization/model-name"
export NODES="user1@host1,user2@host2"   # up to 4 nodes
./scripts/release_orchestrator.sh

PASS means:

  • No partial download markers (.incomplete, .part)
  • Deterministic manifest parity across all nodes (path + size)
  • Consistent artifact format across nodes (safetensors or GGUF)
  • Required files for safetensors mode are present (config.json, and index when sharded)

Only proceed to EXO placement/runners after RING_INPUT_READY.

Model Registry (Capture Once, Reuse Every New Drop)

Add a row for each model after first successful bootstrap. Keep this section as the source of truth for fast re-runs.

The repo also has a machine-readable profile:

  • model_profiles.json
  • Canonical model metadata and status used by scripts/bootstrap_two_node_model.sh
  • status values:
    • active: ready to run
    • blocked: stop by default, requires ALLOW_BLOCKED_MODEL=true
    • watch: active but risky/known to drift
    • deprecated: no longer use
Model ID Instance ID HF include Safetensor shard count GGUF count Preflight disk threshold Notes
<model-id> <instance-id> <exact --include flags> TBD TBD <model-specific minimum> <topology-specific notes>

How to Derive Counts and Notes (Before Reusing a Model)

  • Run find "$MODEL_DIR" -type f | wc -l and record shard counts after a clean first run.
  • Check and record:
    • Main variant file prefix naming pattern (e.g. model.safetensors-xxxxxx.safetensors)
    • Total folder size (du -sk in GB/TB range)
    • Any required non-shard artifacts (config.json, tokenizer, index files)
  • Set EXPECTED_SAFE_TENSOR_COUNT / EXPECTED_GGUF_COUNT only when deterministic counts are confirmed.

Phase 0 — Preflight on Both Nodes

ssh "$NODE1" "mkdir -p $EXO_LOG_DIR_NODE1; python3 --version; $HF_CMD_NODE1 --help >/tmp/hf_help.log; command -v rg || true; df -h ~ ~/Models"
ssh "$NODE2" "mkdir -p $EXO_LOG_DIR_NODE2; python3 --version; $HF_CMD_NODE2 --help >/tmp/hf_help.log; command -v rg || true; df -h ~ ~/Models"

PASS if:

  • HF CLI help works on both nodes.
  • SSH access works both directions.
  • Disk appears adequate for the model and caches.

STOP if:

  • One node fails CLI check.
  • One node is unreachable.
  • Disk is clearly insufficient.

Phase 1 — Download on Node 1 (Primary)

ssh "$NODE1" "mkdir -p '$MODEL_DIR_NODE1'; \
  $HF_CMD_NODE1 download '$MODEL_ID' --local-dir '$MODEL_DIR_NODE1' [--include \"...\" ]"

Phase 1A — Node 1 File Completion Gate

ssh "$NODE1" "find '$MODEL_DIR_NODE1' -type f \\( -name '*.incomplete' -o -name '*.part' \\) -print | sed -n '1,40p'; \
  du -sk '$MODEL_DIR_NODE1'; ls -lah '$MODEL_DIR_NODE1' | sed -n '1,220p'"

PASS if:

  • No .incomplete/.part markers.
  • Files exist for expected variant.
  • Size is plausible for selected model.
  • Optional count checks match if configured:
    • ls -1 '$MODEL_DIR_NODE1'/model.safetensors-*.safetensors | wc -l == EXPECTED_SAFE_TENSOR_COUNT (if set)
    • ls -1 '$MODEL_DIR_NODE1'/*.gguf | wc -l == EXPECTED_GGUF_COUNT (if set)

STOP if:

  • Any incomplete marker remains after wait and retry.
  • Counts fail.

Phase 1B — Node 1 Local Canary

Bring up only Node 1 EXO for API + model visibility.

ssh "$NODE1" "pkill -9 -f '/opt/homebrew/bin/exo|python -m exo|EXO.app/Contents/MacOS/EXO' || true; \
  nohup EXO_OFFLINE=true EXO_LIBP2P_NAMESPACE=$EXO_NAMESPACE EXO_LIBP2P_PORT=$EXO_LIBP2P_PORT \
  /opt/homebrew/bin/exo --api-port $EXO_API_PORT --verbose --no-fast-synch --no-downloads >$EXO_LOG_DIR_NODE1/node1_canary.log 2>&1 & \
  sleep 4; \
  python3 - <<PY
import urllib.request, json, sys
base='http://127.0.0.1:${EXO_API_PORT}'
state=urllib.request.urlopen(base+'/state',timeout=8).read()
models=urllib.request.urlopen(base+'/v1/models?status=downloaded',timeout=8).read()
print('state', bool(state), 'models', bool(models))
print(state.decode(errors='ignore')[:300])
print(models.decode(errors='ignore')[:300])
PY

PASS if:

  • /state is reachable.
  • Target model id appears in /v1/models?status=downloaded.

If canary fails, fix Node 1 now and restart from Phase 1.

Stop canary before moving forward:

ssh "$NODE1" "pkill -9 -f 'EXO.app/Contents/MacOS/EXO|python -m exo|/opt/homebrew/bin/exo' || true"

Phase 2 — Download on Node 2

ssh "$NODE2" "mkdir -p '$MODEL_DIR_NODE2'; \
  $HF_CMD_NODE2 download '$MODEL_ID' --local-dir '$MODEL_DIR_NODE2' [--include \"...\" ]"

Phase 2A — Node 2 File Completion Gate

ssh "$NODE2" "find '$MODEL_DIR_NODE2' -type f \\( -name '*.incomplete' -o -name '*.part' \\) -print | sed -n '1,40p'; \
  du -sk '$MODEL_DIR_NODE2'; ls -lah '$MODEL_DIR_NODE2' | sed -n '1,220p'"

PASS criteria mirror Phase 1A for Node 2 (same format and counts).


Phase 3 — Cross-Node Completion Parity Gate (Mandatory)

ssh "$NODE1" "find '$MODEL_DIR_NODE1' -type f | sort | sed -n '1,240p'"
ssh "$NODE2" "find '$MODEL_DIR_NODE2' -type f | sort | sed -n '1,240p'"

PASS if:

  • File trees match where variant naming is deterministic.
  • Expected file classes exist on both nodes.
  • Both nodes have non-empty expected main shards.

STOP if:

  • Shard count mismatch.
  • One node missing index/config/manifest files.

Phase 4 — Full Two-Node EXO Startup (Clean)

ssh "$NODE1" "pkill -9 -f '/opt/homebrew/bin/exo|python -m exo|EXO.app/Contents/MacOS/EXO' || true; \
nohup EXO_OFFLINE=true EXO_LIBP2P_NAMESPACE=$EXO_NAMESPACE EXO_LIBP2P_PORT=$EXO_LIBP2P_PORT \
  /opt/homebrew/bin/exo --api-port $EXO_API_PORT --verbose --no-fast-synch --no-downloads >$EXO_LOG_DIR_NODE1/node1.log 2>&1 &"

ssh "$NODE2" "pkill -9 -f '/opt/homebrew/bin/exo|python -m exo|EXO.app/Contents/MacOS/EXO' || true; \
nohup EXO_OFFLINE=true EXO_LIBP2P_NAMESPACE=$EXO_NAMESPACE EXO_LIBP2P_PORT=$EXO_LIBP2P_PORT \
  /opt/homebrew/bin/exo --api-port $EXO_API_PORT --verbose --no-fast-synch --no-downloads >$EXO_LOG_DIR_NODE2/node2.log 2>&1 &"

sleep 5
ssh "$NODE1" "pgrep -af 'exo|EXO' | head"
ssh "$NODE2" "pgrep -af 'exo|EXO' | head"
curl -sS "http://127.0.0.1:$EXO_API_PORT/state" | python3 -m json.tool | head -n 40

PASS if:

  • Both processes are up.
  • API state responds.

STOP if:

  • Immediate crash loop.
  • API unreachable.

Phase 5 — Placement and Runner Readiness

Run from a node with local API access:

python3 - <<PY
import json, urllib.request, urllib.parse, time
base = "http://127.0.0.1:${EXO_API_PORT}"
model = "${MODEL_ID}"

def get(path, timeout=8):
    with urllib.request.urlopen(base + path, timeout=timeout) as r:
        return json.loads(r.read())

def post(path, obj, timeout=30):
    data = json.dumps(obj).encode()
    req = urllib.request.Request(base + path, data=data, headers={"Content-Type": "application/json"}, method="POST")
    with urllib.request.urlopen(req, timeout=timeout) as r:
        return r.getcode(), json.loads(r.read())

m = get("/v1/models?status=downloaded")
ids = [x.get("id") for x in m.get("data", []) if isinstance(x, dict)]
if model not in ids:
    raise SystemExit(f"model_not_in_downloaded:{model}")

q = urllib.parse.urlencode({"model_id": model, "min_nodes": "2", "sharding":"Pipeline", "instance_meta":"MlxRingInstance"})
placement = get("/instance/placement?" + q)
code, resp = post("/instance", {"instance": placement}, timeout=30)
print("create_status", code)
print("command_id", resp.get("command_id"))

for i in range(90):
    s = get("/state", timeout=8)
    inst = s.get("instances", {})
    if not inst:
        print("t", i, "instances=0")
        time.sleep(1)
        continue
    iid = next(iter(inst))
    entry = inst[iid]
    payload = entry[next(iter(entry.keys()))]
    n2r = payload.get("nodeToRunner", {})
    print("t", i, "iid", iid, "nodeToRunner", len(n2r), "r2s", len(payload.get("runnerToShard", {})))
    if len(n2r) >= 2:
        print("PLACEMENT_READY")
        break
    time.sleep(1)
else:
    raise SystemExit("placement_not_ready_in_window")
PY

PASS if:

  • Model appears in downloaded list.
  • Instance create request succeeds.
  • nodeToRunner reaches 2 and stays.

STOP if:

  • nodeToRunner remains 0.
  • Repeated churn or no runner attachment.

Phase 6 — Transport Health Check Before Chat

ssh "$NODE1" "grep -Eo '[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+:[0-9]+' $EXO_LOG_DIR_NODE1/node1.log | tail -n 40"
ssh "$NODE2" "grep -Eo '[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+:[0-9]+' $EXO_LOG_DIR_NODE2/node2.log | tail -n 40"

Use each announced <ip>:<port> and validate listeners/reachability:

ssh "$NODE1" "lsof -nP -iTCP:<port> -sTCP:LISTEN || true"
ssh "$NODE1" "nc -vz -w 2 <peer-ip> <port> ; echo rc=$?"

PASS if:

  • No error: 60/65 in logs.
  • Ports listed are actually listening.
  • Connectivity checks reach both directions used by ring/JACCL path.

STOP if:

  • connection refused on declared transport ports.
  • Repeated backoff connect errors.

Phase 7 — Bounded Smoke Inference

python3 - <<PY
import json, urllib.request, time
url = "http://127.0.0.1:${EXO_API_PORT}/v1/chat/completions"
payload = {
    "model": "${MODEL_ID}",
    "messages": [{"role":"user","content":"Reply with exactly OK"}],
    "max_tokens": 8,
    "stream": False,
    "temperature": 0.0,
}
start = time.time()
try:
    req = urllib.request.Request(url, data=json.dumps(payload).encode(), headers={"Content-Type":"application/json"}, method="POST")
    with urllib.request.urlopen(req, timeout=120) as r:
        body = r.read().decode(errors="ignore")
        print("status", r.getcode())
        print("lat_s", round(time.time()-start, 2))
        print(body[:500])
except Exception as e:
    print(type(e).__name__, round(time.time()-start, 2), str(e))
PY

PASS if:

  • Returns before timeout.
  • Status 200.
  • Payload includes choices.

STOP if:

  • Timeout.
  • Broken transport message.
  • Incomplete response.

Quick Start Command Card for a New Model

# Set only these:
export MODEL_ID="organization/model-name"
export MODEL_INSTANCE_ID="organization--model-name"      # optional if deriving is not desired
export PROFILE_PATH="model_profiles.json"
export NODE1="user1@node1-host"
export NODE2="user2@node2-host"
export ALLOW_BLOCKED_MODEL=false

# Run the script (it executes the same phases):
./scripts/bootstrap_two_node_model.sh

Optional runtime flags:

  • HF_INCLUDE (for shard filtering)
  • EXPECTED_SAFE_TENSOR_COUNT / EXPECTED_GGUF_COUNT
  • EXO_API_PORT, EXO_NAMESPACE, EXO_LIBP2P_PORT, RUN_REPORT_DIR

If model is marked blocked in model_profiles.json, set ALLOW_BLOCKED_MODEL=true once you explicitly accept the risk.

If Phase 5 or 6 fails, do not move to Phase 7. Fix the failed phase and re-run the script.

Decision Logic (State Machine)

  • If Phase 1/1A fails → fix Node 1 download and path.
  • If Phase 1B fails → stop and fix Node 1 environment/startup.
  • If Phase 2/2A fails → fix Node 2 download and path.
  • If Phase 3 fails → do not start distributed EXO.
  • If Phase 4/5 fails → do not run transport-dependent benchmarks.
  • If Phase 6 fails → fix network/transport before repeating benchmark.
  • If any phase passes → keep artifacts and logs before next model.

Failure-First Mindset for Speed

  • New model appears: you start with download and completeness, not inference.
  • Do not skip the canary or parity gate.
  • A green has_model or status=downloaded without both-node parity is not enough.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment