User: Use a rayon::ThreadPool to parallelize the handling of MlMsg::Compute and MlMsg::Compile, shared compiled_map in an arc std rw lock, clone manager_tx.
Assistant: Completed: Refactor ML worker parallelism (2/3)
Loaded AGENTS.md and README.md.
Implemented in lib.rs and Cargo.toml:
- Added
rayonto thewebnncrate. - Reworked
ml_loop()to use arayon::ThreadPool. - Moved the ML cache to
Arc<RwLock<HashMap<(ContextId, GraphId), MlCacheEntry>>>. - Split
MlMsg::ComputeandMlMsg::Compilehandling into parallel helpers. - Cloned
manager_txper pool task so each task can reportWebNNMsgresults independently. - Changed
MlCacheEntryto hold sharedArcdata so compute tasks can read cache entries without holding the lock during execution. - Updated shutdown to always join the ML dispatcher thread and wait for in-flight pool work.
Validation:
./mach checkpassed.
Made changes.
User: why did you put Arc around members for MLCacheEntry? The cache itself is shared with arc, and the members are not cloned, are they?
Assistant: That cloned() clones the whole MlCacheEntry. Because graph_info and converted_program are wrapped in Arc, that clone is shallow for those fields: it increments refcounts instead of deep-copying the GraphInfo and Vec<u8>.
So the reason for the inner Arcs was:
- release the
RwLockquickly, - move an owned cache entry into the rayon task logic,
- avoid deep-cloning large cached data each time a compute task reads it.
Without those inner Arcs, .cloned() would deep-clone GraphInfo and Vec<u8> on every compute lookup.
That said, the shape is not the only option. A cleaner alternative is:
- store
Arc<MlCacheEntry>in the map, - then clone just the outer entry pointer from the map.
That would avoid wrapping individual members and make the ownership model clearer.
User: but why do you clone the return value of read_compiled_map?
Assistant: It is not cloning the return value of read_compiled_map(). It is cloning the map entry fetched through the guard in lib.rs:
read_compiled_map(&compiled_map)returns anRwLockReadGuard.get(&(ctx_id, key))returnsOption<&MlCacheEntry>.cloned()turns that intoOption<MlCacheEntry>
The clone is there to break the borrow from the lock guard before the expensive compute starts.
Without .cloned(), entry would be a reference tied to the RwLockReadGuard, so one of two bad things would happen:
- the read lock would need to stay held across the whole compute, blocking writers like
Compile, or - Rust would reject using the borrowed entry after the temporary guard is dropped.
So .cloned() is used to:
- copy out an owned snapshot of the cache entry,
- release the read lock immediately,
- let compute proceed without holding the lock.
Because lib.rs wraps the heavy fields in Arc, that clone is cheap for graph_info and converted_program.
If preferred, this can be made clearer by storing Arc<MlCacheEntry> in the map and cloning that pointer instead.
User: Ok I see.
But it's kind of weird to clone the arc inside a structure that is protected by a lock, and then to drop the lock and keep using the arc. It compiles, but logically it's weird and could lead to unexpected behavior
Assistant: Agreed.
It is memory-safe, but the ownership model is awkward. The current pattern in lib.rs effectively says:
- the map is protected by
RwLock - each entry also has independently shared lifetime via inner
Arcs - compute reads an entry, detaches from the map, and keeps using the detached data
That is valid for immutable cache data, but it weakens the mental model of “the lock protects the cache contents”. The lock ends up protecting the map structure more than the entry data.
A clearer design is:
HashMap<(ContextId, GraphId), Arc<MlCacheEntry>>- keep
MlCacheEntryfields plain, without innerArc - clone the
Arc<MlCacheEntry>out of the map - drop the lock
- use the entry through that single outer
Arc
That preserves the same performance property, but makes the sharing boundary explicit and easier to reason about.
If wanted, this can be refactored in lib.rs.
User: No I think what you should do is rather do Arc<RwLock<HashMap<(ContextId, GraphId), Arc<Mutex>>>>; and remove the inner arcs.
User: also import everything top of file (isn't that clear from the guidance?)
Assistant:
Changes:
- Normalized imports at the top of lib.rs.
- Replaced remaining fully-qualified
std::syncandrustnn::graphuses with imported names. - Kept the cache as
Arc<RwLock<HashMap<(ContextId, GraphId), Arc<Mutex<MlCacheEntry>>>>>in lib.rs. - Added a component note in README.md to keep shared/sync types imported at module scope.
Validation:
./mach checkpassed.
Made changes.
User:
- if the lock is poisened, go into error handling mode(see existing code that zeroes outputs and so on)
- instead of read/write map helpers, just have a
get_cache_entryhelper that internally locks the map, clones the entry, and then returns a optional guard to it(none if errors). If not found, assert_debug. Writing an entry can also be done in a single helper call. - I think we now also have introduced a race condition, because if compute request comes in while the compilation is ongoing, then the compute will fail because the cache is empty.
So we need to turn MlCacheEntry into an enum, with one Compiling and a Compiled variant, where compiled is basically the current struct, and we need to associate the inner mutex with a condvar, so that the compute requests waits on the condvar until compilation is finished. Actually, we also need a Destroyed variant, for now only set it when compilation fails. If the compute sees that, it goes into error handling mode.
Assistant:
What changed:
- Replaced the cache payload with an
MlCacheEntrystate enum in lib.rs:CompilingCompiledDestroyed
- Changed cache storage to a per-entry
Arc<(Mutex<MlCacheEntry>, Condvar)>behind the shared map in lib.rs. - Removed the separate map read/write guard helpers and replaced them with:
get_cache_entry()in lib.rsinsert_cache_entry()in lib.rs
- Made poisoned map/mutex/condvar paths enter compute error handling mode and emit zero/empty output results instead of continuing normally.
- Inserted a
Compilingcache entry before dispatching async compile work in lib.rs. - Made compute wait on the entry
Condvarwhile the state isCompilingin lib.rs. - On compile failure, transition the entry to
Destroyed, notify waiters, and let later compute requests fall into error handling mode in lib.rs.
One note: get_cache_entry() returns the cloned shared entry handle, not a MutexGuard directly. With standard-library guard lifetimes, returning a guard after cloning the entry out of the map is awkward; this keeps the single map-read helper while still dropping the map lock immediately.
Validation:
./mach checkpassed.
User: Ok remove those helper functions, just do the locking right inside the compile and compute stuff, and yes you can switch to parking lot.
Also remove the seen_graphs thing, replace that with a compilation task waiting for the entry to be compiled(or destroyed), and then sending a response but not doing compilation if it already started.
Assistant:
Updated:
- lib.rs
- Cargo.toml
- README.md
What changed:
- Switched the ML cache locking to
parking_lotin lib.rs and added the dependency in Cargo.toml. - Removed
seen_graphsfromContextin lib.rs and fromget_or_compile()logic in lib.rs. - Removed the cache helper functions entirely.
- Inlined cache map/entry locking directly inside
handle_ml_compute()andml_loop()/handle_ml_compile()in lib.rs. - Changed
MlCacheEntryto use:CompilingCompiled { ... }Destroyed(String)
- Made duplicate compile requests wait on the existing cache entry instead of starting a second compile in lib.rs.
- Kept compute waiting on
Compiling, and treatingDestroyed(...)as error-mode output handling in lib.rs.
Validation:
./mach checkpassed.
Made changes.
User: The whole should_compile logic and move to inside handle_ml_compile(since all logic for compute also is inside the euqivalent function)
Assistant: made changes.
User: Ok can you confirm logically that we had indeed a race condition before before making the changes that added the enum with Compiling variant?
Assistant: Yes.
Before the MlCacheEntry::Compiling state was introduced, there was a real race.
The problematic sequence was:
- Manager sends
MlMsg::Compilefor graph$g$ . - ML side schedules compile work, but the cache entry is not inserted until compilation finishes.
- Before compile finishes, manager sends
MlMsg::Computefor the same graph$g$ . - Compute handler looks in the cache map.
- Cache lookup misses, because compile has started but has not yet published anything.
- Compute falls into the “missing cache entry” path and returns zero/empty outputs.
That is a classic check-before-publication race:
- compile work was in progress,
- but the shared state did not yet represent “in progress”,
- so compute could incorrectly conclude “not present”.
In other words, the cache used to encode only:
- “compiled”
- “not compiled / not present”
but it needed a third state:
- “currently compiling”
Adding Compiling fixed exactly that gap, and the condvar made compute block until the state transitions to either:
Compiled, orDestroyed.
So yes, logically, the race existed before that change.