lilith/safe-parallel-gist.md

## safe-parallel-gist.md

      
    Raw
  

              safe-parallel-gist.md
            
          
    How zenjxl-decoder achieves #![forbid(unsafe_code)] in parallel decoding

The problem

JPEG XL decoding is embarrassingly parallel at the tile level. A 4K image has dozens of independent tiles that can decode simultaneously. The hard part is the output: all those tiles write to the same output buffer, and Rust's borrow checker won't let multiple threads hold &mut references to the same buffer — even if they write to non-overlapping regions.
We wanted full tile-level parallelism with #![forbid(unsafe_code)].
The core insight

If you split a buffer into disjoint pieces before handing them to threads, Rust's type system guarantees safety for free. No raw pointers, no unsafe impl Send, no runtime checks. The challenge is doing this without copying.
BufferStorage: two representations, one interface

pub(crate) enum BufferStorage<'a> {
    Contiguous {
        data: &'a mut [u8],
        bytes_between_rows: usize,
    },
    Fragmented {
        rows: Vec<&'a mut [u8]>,
    },
}
A Contiguous buffer is the normal case: one flat allocation, rows at fixed stride. A Fragmented buffer is a collection of per-row mutable slices that may come from different locations in memory. Both expose the same row_mut(row) -> &mut [u8] interface.
Why two? When you split a buffer by rows, the sub-buffers are still contiguous (each is a sub-slice of the original). When you split by columns, they're not — column 0-255 of row 0 and column 0-255 of row 1 aren't adjacent in a row-major buffer. So column fragments use Fragmented storage: a Vec of per-row slices, each pointing into the original buffer.
split_into_tile_grid: the key decomposition

pub(crate) fn split_into_tile_grid(
    &mut self,
    split_rows: &[usize],             // row boundaries between bands
    split_cols_per_band: &[&[usize]],  // column boundaries per band
) -> Vec<Vec<JxlOutputBuffer<'_>>>
This splits a single output buffer into a 2D grid of fragments — rows split into bands, then each band split into column ranges. Every fragment is a JxlOutputBuffer with exclusive ownership of its region.
The implementation is a chain of split_at_mut() calls. split_at_mut is Rust's primitive for dividing one &mut [T] into two non-overlapping &mut [T]s. By chaining these, we decompose the entire buffer into arbitrarily many disjoint fragments, each provably non-aliasing at compile time.
The result: each tile gets its own JxlOutputBuffer fragment. Rayon processes all tiles in parallel. Each thread writes directly to the final output — no temporary buffers, no copy-back phase.
The parallel render path

Before spawning parallel work:

Group tiles by gy (row band) and sort by gx (column) within each band
Compute row split points at band boundaries and column split points at tile boundaries
Call split_into_tile_grid() on each output channel buffer
Move each fragment to its corresponding tile via Option::take()
Track column offsets for coordinate mapping (needed when progressive decoding delivers partial tile batches)

Then par_iter_mut() over the fragments. Each thread renders its tile directly into the output. Zero copies.
Two-phase fallback

When tiles can't be cleanly band-split (e.g., a single-band image), we fall back to allocating per-tile Vec<u8> buffers, rendering in parallel, then copying back sequentially.
Coordinate mapping for progressive decoding

In progressive/chunked decoding, a batch might contain only tiles gx=3..6 of a row, not gx=0..2. The first fragment covers columns [0, col_start_of_gx3), but tile gx=3 needs to write starting at its column offset within that fragment. We track item_col_offsets per fragment and adjust: rect.origin.0 = tile_origin.0 - fragment_col_offset.
Performance

All numbers are 8-thread, 4K JPEG XL images, best-of-3 reps with 1 warmup, on the same machine in the same session (WSL2, 32 logical cores). Comparing the safe fragment implementation against the previous unsafe baseline that used raw pointers for shared output access.


Image
Unsafe baseline
Safe fragments
Change


city_4k_q75
209.2 MP/s
206.8 MP/s
-1.1%


city_4k_q90
206.5
205.2
-0.7%


forest_4k_q90
220.5
216.1
-2.0%


landscape_4k_q90
231.7
231.4
-0.1%


portrait_4k_q75
244.3
249.1
+2.0%


portrait_4k_q90
239.2
241.8
+1.1%


Average


-0.1%


Portrait images are faster with the safe approach. Wide images show a small regression from the enum dispatch in row_mut (one extra branch per row access). On average: effectively zero cost.
We tried several approaches before landing here:

Mutex + copy-back: -13.7% average
Band-split only (no column fragments): -10.2%
Hybrid (direct-write for tall, two-phase for wide): -4.0%
Fragment-based (final): -0.1%

What changed in Cargo.toml

# Before
threads = ["rayon", "allow-unsafe"]

# After
threads = ["rayon"]
The allow-unsafe feature existed solely because the parallel output path needed unsafe. With fragments, it doesn't. The crate now compiles with #![forbid(unsafe_code)] regardless of whether threading is enabled.
The takeaway

The key is decomposing your data before parallelism, not trying to share it during. split_at_mut() is the primitive that makes this work: it's the compiler-verified proof that two references don't alias.
The cost is some upfront work to compute split points and build the fragment grid. For large images with many tiles, this setup cost is negligible compared to the actual decode work. For tiny images with few tiles, you're not benefiting from parallelism anyway.
All output formats (U8, U16, F16, F32) go through the same parallel path. JxlOutputBuffer works in bytes; bytes_per_row accounts for sample width. No format-specific serial fallbacks.
Image	Unsafe baseline	Safe fragments	Change
city_4k_q75	209.2 MP/s	206.8 MP/s	-1.1%
city_4k_q90	206.5	205.2	-0.7%
forest_4k_q90	220.5	216.1	-2.0%
landscape_4k_q90	231.7	231.4	-0.1%
portrait_4k_q75	244.3	249.1	+2.0%
portrait_4k_q90	239.2	241.8	+1.1%
Average			-0.1%
No results found