gterzian/gpt_5.4(1).md

## gpt_5.4(1).md

      
    Raw
  

              gpt_5.4(1).md
            
          
    User: Use a rayon::ThreadPool to parallelize the handling of MlMsg::Compute and MlMsg::Compile, shared compiled_map in an arc std rw lock, clone manager_tx.
Assistant: Completed: Refactor ML worker parallelism (2/3)
Loaded AGENTS.md and README.md.
Implemented in lib.rs and Cargo.toml:

Added rayon to the webnn crate.
Reworked ml_loop() to use a rayon::ThreadPool.
Moved the ML cache to Arc<RwLock<HashMap<(ContextId, GraphId), MlCacheEntry>>>.
Split MlMsg::Compute and MlMsg::Compile handling into parallel helpers.
Cloned manager_tx per pool task so each task can report WebNNMsg results independently.
Changed MlCacheEntry to hold shared Arc data so compute tasks can read cache entries without holding the lock during execution.
Updated shutdown to always join the ML dispatcher thread and wait for in-flight pool work.

Validation:

./mach check passed.

Made changes.
User: why did you put Arc around members for MLCacheEntry? The cache itself is shared with arc, and the members are not cloned, are they?
Assistant: That cloned() clones the whole MlCacheEntry. Because graph_info and converted_program are wrapped in Arc, that clone is shallow for those fields: it increments refcounts instead of deep-copying the GraphInfo and Vec<u8>.
So the reason for the inner Arcs was:

release the RwLock quickly,
move an owned cache entry into the rayon task logic,
avoid deep-cloning large cached data each time a compute task reads it.

Without those inner Arcs, .cloned() would deep-clone GraphInfo and Vec<u8> on every compute lookup.
That said, the shape is not the only option. A cleaner alternative is:

store Arc<MlCacheEntry> in the map,
then clone just the outer entry pointer from the map.

That would avoid wrapping individual members and make the ownership model clearer.
User: but why do you clone the return value of read_compiled_map?
Assistant: It is not cloning the return value of read_compiled_map(). It is cloning the map entry fetched through the guard in lib.rs:

read_compiled_map(&compiled_map) returns an RwLockReadGuard
.get(&(ctx_id, key)) returns Option<&MlCacheEntry>
.cloned() turns that into Option<MlCacheEntry>

The clone is there to break the borrow from the lock guard before the expensive compute starts.
Without .cloned(), entry would be a reference tied to the RwLockReadGuard, so one of two bad things would happen:

the read lock would need to stay held across the whole compute, blocking writers like Compile, or
Rust would reject using the borrowed entry after the temporary guard is dropped.

So .cloned() is used to:

copy out an owned snapshot of the cache entry,
release the read lock immediately,
let compute proceed without holding the lock.

Because lib.rs wraps the heavy fields in Arc, that clone is cheap for graph_info and converted_program.
If preferred, this can be made clearer by storing Arc<MlCacheEntry> in the map and cloning that pointer instead.
User: Ok I see.
But it's kind of weird to clone the arc inside a structure that is protected by a lock, and then to drop the lock and keep using the arc. It compiles, but logically it's weird and could lead to unexpected behavior
Assistant: Agreed.
It is memory-safe, but the ownership model is awkward. The current pattern in lib.rs effectively says:

the map is protected by RwLock
each entry also has independently shared lifetime via inner Arcs
compute reads an entry, detaches from the map, and keeps using the detached data

That is valid for immutable cache data, but it weakens the mental model of “the lock protects the cache contents”. The lock ends up protecting the map structure more than the entry data.
A clearer design is:

HashMap<(ContextId, GraphId), Arc<MlCacheEntry>>
keep MlCacheEntry fields plain, without inner Arc
clone the Arc<MlCacheEntry> out of the map
drop the lock
use the entry through that single outer Arc

That preserves the same performance property, but makes the sharing boundary explicit and easier to reason about.
If wanted, this can be refactored in lib.rs.
User: No I think what you should do is rather do Arc<RwLock<HashMap<(ContextId, GraphId), Arc<Mutex>>>>; and remove the inner arcs.
User: also import everything top of file (isn't that clear from the guidance?)
Assistant:
Changes:

Normalized imports at the top of lib.rs.
Replaced remaining fully-qualified std::sync and rustnn::graph uses with imported names.
Kept the cache as Arc<RwLock<HashMap<(ContextId, GraphId), Arc<Mutex<MlCacheEntry>>>>> in lib.rs.
Added a component note in README.md to keep shared/sync types imported at module scope.

Validation:

./mach check passed.

Made changes.
User:

if the lock is poisened, go into error handling mode(see existing code that zeroes outputs and so on)
instead of read/write map helpers, just have a get_cache_entry helper that internally locks the map, clones the entry, and then returns a optional guard to it(none if errors). If not found, assert_debug. Writing an entry can also be done in a single helper call.
I think we now also have introduced a race condition, because if  compute request comes in while the compilation is ongoing, then the compute will fail because the cache is empty.

So we need to turn MlCacheEntry into an enum, with one Compiling and a Compiled variant, where compiled is basically the current struct, and we need to associate the inner mutex with a condvar, so that the compute requests waits on the condvar until compilation is finished. Actually, we also need a Destroyed variant, for now only set it when compilation fails. If the compute sees that, it goes into error handling mode.
Assistant:
What changed:

Replaced the cache payload with an MlCacheEntry state enum in lib.rs:

Compiling
Compiled
Destroyed


Changed cache storage to a per-entry Arc<(Mutex<MlCacheEntry>, Condvar)> behind the shared map in lib.rs.
Removed the separate map read/write guard helpers and replaced them with:

get_cache_entry() in lib.rs
insert_cache_entry() in lib.rs


Made poisoned map/mutex/condvar paths enter compute error handling mode and emit zero/empty output results instead of continuing normally.
Inserted a Compiling cache entry before dispatching async compile work in lib.rs.
Made compute wait on the entry Condvar while the state is Compiling in lib.rs.
On compile failure, transition the entry to Destroyed, notify waiters, and let later compute requests fall into error handling mode in lib.rs.

One note: get_cache_entry() returns the cloned shared entry handle, not a MutexGuard directly. With standard-library guard lifetimes, returning a guard after cloning the entry out of the map is awkward; this keeps the single map-read helper while still dropping the map lock immediately.
Validation:

./mach check passed.

User: Ok remove those helper functions, just do the locking right inside the compile and compute stuff, and yes you can switch to parking lot.
Also remove the seen_graphs thing, replace that with a compilation task waiting for the entry to be compiled(or destroyed), and then sending a response but not doing compilation if it already started.
Assistant:
Updated:

lib.rs
Cargo.toml
README.md

What changed:

Switched the ML cache locking to parking_lot in lib.rs and added the dependency in Cargo.toml.
Removed seen_graphs from Context in lib.rs and from get_or_compile() logic in lib.rs.
Removed the cache helper functions entirely.
Inlined cache map/entry locking directly inside handle_ml_compute() and ml_loop() / handle_ml_compile() in lib.rs.
Changed MlCacheEntry to use:

Compiling
Compiled { ... }
Destroyed(String)


Made duplicate compile requests wait on the existing cache entry instead of starting a second compile in lib.rs.
Kept compute waiting on Compiling, and treating Destroyed(...) as error-mode output handling in lib.rs.

Validation:

./mach check passed.

Made changes.
User: The whole should_compile logic and move to inside handle_ml_compile(since all logic for compute also is inside the euqivalent function)
Assistant: made changes.
User: Ok can you confirm logically that we had indeed a race condition before before making the changes that added the enum with Compiling variant?
Assistant: Yes.
Before the MlCacheEntry::Compiling state was introduced, there was a real race.
The problematic sequence was:

Manager sends MlMsg::Compile for graph $g$.
ML side schedules compile work, but the cache entry is not inserted until compilation finishes.
Before compile finishes, manager sends MlMsg::Compute for the same graph $g$.
Compute handler looks in the cache map.
Cache lookup misses, because compile has started but has not yet published anything.
Compute falls into the “missing cache entry” path and returns zero/empty outputs.

That is a classic check-before-publication race:

compile work was in progress,
but the shared state did not yet represent “in progress”,
so compute could incorrectly conclude “not present”.

In other words, the cache used to encode only:

“compiled”
“not compiled / not present”

but it needed a third state:

“currently compiling”

Adding Compiling fixed exactly that gap, and the condvar made compute block until the state transitions to either:

Compiled, or
Destroyed.

So yes, logically, the race existed before that change.
No results found