Skip to content

Instantly share code, notes, and snippets.

@lilith
Last active March 3, 2026 11:24
Show Gist options
  • Select an option

  • Save lilith/8168a41459e23b7700f28afe44ae7057 to your computer and use it in GitHub Desktop.

Select an option

Save lilith/8168a41459e23b7700f28afe44ae7057 to your computer and use it in GitHub Desktop.
How zenjxl-decoder achieves #![forbid(unsafe_code)] in parallel decoding — zero-cost safe Rust vs raw pointers

How zenjxl-decoder achieves #![forbid(unsafe_code)] in parallel decoding

The problem

JPEG XL decoding is embarrassingly parallel at the tile level. A 4K image has dozens of independent tiles that can decode simultaneously. The hard part is the output: all those tiles write to the same output buffer, and Rust's borrow checker won't let multiple threads hold &mut references to the same buffer — even if they write to non-overlapping regions.

We wanted full tile-level parallelism with #![forbid(unsafe_code)].

The core insight

If you split a buffer into disjoint pieces before handing them to threads, Rust's type system guarantees safety for free. No raw pointers, no unsafe impl Send, no runtime checks. The challenge is doing this without copying.

BufferStorage: two representations, one interface

pub(crate) enum BufferStorage<'a> {
    Contiguous {
        data: &'a mut [u8],
        bytes_between_rows: usize,
    },
    Fragmented {
        rows: Vec<&'a mut [u8]>,
    },
}

A Contiguous buffer is the normal case: one flat allocation, rows at fixed stride. A Fragmented buffer is a collection of per-row mutable slices that may come from different locations in memory. Both expose the same row_mut(row) -> &mut [u8] interface.

Why two? When you split a buffer by rows, the sub-buffers are still contiguous (each is a sub-slice of the original). When you split by columns, they're not — column 0-255 of row 0 and column 0-255 of row 1 aren't adjacent in a row-major buffer. So column fragments use Fragmented storage: a Vec of per-row slices, each pointing into the original buffer.

split_into_tile_grid: the key decomposition

pub(crate) fn split_into_tile_grid(
    &mut self,
    split_rows: &[usize],             // row boundaries between bands
    split_cols_per_band: &[&[usize]],  // column boundaries per band
) -> Vec<Vec<JxlOutputBuffer<'_>>>

This splits a single output buffer into a 2D grid of fragments — rows split into bands, then each band split into column ranges. Every fragment is a JxlOutputBuffer with exclusive ownership of its region.

The implementation is a chain of split_at_mut() calls. split_at_mut is Rust's primitive for dividing one &mut [T] into two non-overlapping &mut [T]s. By chaining these, we decompose the entire buffer into arbitrarily many disjoint fragments, each provably non-aliasing at compile time.

The result: each tile gets its own JxlOutputBuffer fragment. Rayon processes all tiles in parallel. Each thread writes directly to the final output — no temporary buffers, no copy-back phase.

The parallel render path

Before spawning parallel work:

  1. Group tiles by gy (row band) and sort by gx (column) within each band
  2. Compute row split points at band boundaries and column split points at tile boundaries
  3. Call split_into_tile_grid() on each output channel buffer
  4. Move each fragment to its corresponding tile via Option::take()
  5. Track column offsets for coordinate mapping (needed when progressive decoding delivers partial tile batches)

Then par_iter_mut() over the fragments. Each thread renders its tile directly into the output. Zero copies.

Two-phase fallback

When tiles can't be cleanly band-split (e.g., a single-band image), we fall back to allocating per-tile Vec<u8> buffers, rendering in parallel, then copying back sequentially.

Coordinate mapping for progressive decoding

In progressive/chunked decoding, a batch might contain only tiles gx=3..6 of a row, not gx=0..2. The first fragment covers columns [0, col_start_of_gx3), but tile gx=3 needs to write starting at its column offset within that fragment. We track item_col_offsets per fragment and adjust: rect.origin.0 = tile_origin.0 - fragment_col_offset.

Performance

All numbers are 8-thread, 4K JPEG XL images, best-of-3 reps with 1 warmup, on the same machine in the same session (WSL2, 32 logical cores). Comparing the safe fragment implementation against the previous unsafe baseline that used raw pointers for shared output access.

Image Unsafe baseline Safe fragments Change
city_4k_q75 209.2 MP/s 206.8 MP/s -1.1%
city_4k_q90 206.5 205.2 -0.7%
forest_4k_q90 220.5 216.1 -2.0%
landscape_4k_q90 231.7 231.4 -0.1%
portrait_4k_q75 244.3 249.1 +2.0%
portrait_4k_q90 239.2 241.8 +1.1%
Average -0.1%

Portrait images are faster with the safe approach. Wide images show a small regression from the enum dispatch in row_mut (one extra branch per row access). On average: effectively zero cost.

We tried several approaches before landing here:

  • Mutex + copy-back: -13.7% average
  • Band-split only (no column fragments): -10.2%
  • Hybrid (direct-write for tall, two-phase for wide): -4.0%
  • Fragment-based (final): -0.1%

What changed in Cargo.toml

# Before
threads = ["rayon", "allow-unsafe"]

# After
threads = ["rayon"]

The allow-unsafe feature existed solely because the parallel output path needed unsafe. With fragments, it doesn't. The crate now compiles with #![forbid(unsafe_code)] regardless of whether threading is enabled.

The takeaway

The key is decomposing your data before parallelism, not trying to share it during. split_at_mut() is the primitive that makes this work: it's the compiler-verified proof that two references don't alias.

The cost is some upfront work to compute split points and build the fragment grid. For large images with many tiles, this setup cost is negligible compared to the actual decode work. For tiny images with few tiles, you're not benefiting from parallelism anyway.

All output formats (U8, U16, F16, F32) go through the same parallel path. JxlOutputBuffer works in bytes; bytes_per_row accounts for sample width. No format-specific serial fallbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment