JPEG XL decoding is embarrassingly parallel at the tile level. A 4K image has dozens of independent tiles that can decode simultaneously. The hard part is the output: all those tiles write to the same output buffer, and Rust's borrow checker won't let multiple threads hold &mut references to the same buffer — even if they write to non-overlapping regions.
We wanted full tile-level parallelism with #![forbid(unsafe_code)].
If you split a buffer into disjoint pieces before handing them to threads, Rust's type system guarantees safety for free. No raw pointers, no unsafe impl Send, no runtime checks. The challenge is doing this without copying.
pub(crate) enum BufferStorage<'a> {
Contiguous {
data: &'a mut [u8],
bytes_between_rows: usize,
},
Fragmented {
rows: Vec<&'a mut [u8]>,
},
}A Contiguous buffer is the normal case: one flat allocation, rows at fixed stride. A Fragmented buffer is a collection of per-row mutable slices that may come from different locations in memory. Both expose the same row_mut(row) -> &mut [u8] interface.
Why two? When you split a buffer by rows, the sub-buffers are still contiguous (each is a sub-slice of the original). When you split by columns, they're not — column 0-255 of row 0 and column 0-255 of row 1 aren't adjacent in a row-major buffer. So column fragments use Fragmented storage: a Vec of per-row slices, each pointing into the original buffer.
pub(crate) fn split_into_tile_grid(
&mut self,
split_rows: &[usize], // row boundaries between bands
split_cols_per_band: &[&[usize]], // column boundaries per band
) -> Vec<Vec<JxlOutputBuffer<'_>>>This splits a single output buffer into a 2D grid of fragments — rows split into bands, then each band split into column ranges. Every fragment is a JxlOutputBuffer with exclusive ownership of its region.
The implementation is a chain of split_at_mut() calls. split_at_mut is Rust's primitive for dividing one &mut [T] into two non-overlapping &mut [T]s. By chaining these, we decompose the entire buffer into arbitrarily many disjoint fragments, each provably non-aliasing at compile time.
The result: each tile gets its own JxlOutputBuffer fragment. Rayon processes all tiles in parallel. Each thread writes directly to the final output — no temporary buffers, no copy-back phase.
Before spawning parallel work:
- Group tiles by
gy(row band) and sort bygx(column) within each band - Compute row split points at band boundaries and column split points at tile boundaries
- Call
split_into_tile_grid()on each output channel buffer - Move each fragment to its corresponding tile via
Option::take() - Track column offsets for coordinate mapping (needed when progressive decoding delivers partial tile batches)
Then par_iter_mut() over the fragments. Each thread renders its tile directly into the output. Zero copies.
When tiles can't be cleanly band-split (e.g., a single-band image), we fall back to allocating per-tile Vec<u8> buffers, rendering in parallel, then copying back sequentially.
In progressive/chunked decoding, a batch might contain only tiles gx=3..6 of a row, not gx=0..2. The first fragment covers columns [0, col_start_of_gx3), but tile gx=3 needs to write starting at its column offset within that fragment. We track item_col_offsets per fragment and adjust: rect.origin.0 = tile_origin.0 - fragment_col_offset.
All numbers are 8-thread, 4K JPEG XL images, best-of-3 reps with 1 warmup, on the same machine in the same session (WSL2, 32 logical cores). Comparing the safe fragment implementation against the previous unsafe baseline that used raw pointers for shared output access.
| Image | Unsafe baseline | Safe fragments | Change |
|---|---|---|---|
| city_4k_q75 | 209.2 MP/s | 206.8 MP/s | -1.1% |
| city_4k_q90 | 206.5 | 205.2 | -0.7% |
| forest_4k_q90 | 220.5 | 216.1 | -2.0% |
| landscape_4k_q90 | 231.7 | 231.4 | -0.1% |
| portrait_4k_q75 | 244.3 | 249.1 | +2.0% |
| portrait_4k_q90 | 239.2 | 241.8 | +1.1% |
| Average | -0.1% |
Portrait images are faster with the safe approach. Wide images show a small regression from the enum dispatch in row_mut (one extra branch per row access). On average: effectively zero cost.
We tried several approaches before landing here:
- Mutex + copy-back: -13.7% average
- Band-split only (no column fragments): -10.2%
- Hybrid (direct-write for tall, two-phase for wide): -4.0%
- Fragment-based (final): -0.1%
# Before
threads = ["rayon", "allow-unsafe"]
# After
threads = ["rayon"]The allow-unsafe feature existed solely because the parallel output path needed unsafe. With fragments, it doesn't. The crate now compiles with #![forbid(unsafe_code)] regardless of whether threading is enabled.
The key is decomposing your data before parallelism, not trying to share it during. split_at_mut() is the primitive that makes this work: it's the compiler-verified proof that two references don't alias.
The cost is some upfront work to compute split points and build the fragment grid. For large images with many tiles, this setup cost is negligible compared to the actual decode work. For tiny images with few tiles, you're not benefiting from parallelism anyway.
All output formats (U8, U16, F16, F32) go through the same parallel path. JxlOutputBuffer works in bytes; bytes_per_row accounts for sample width. No format-specific serial fallbacks.