msaroufim/mini-sgl.md

## mini-sgl.md

      
    Raw
  

              mini-sgl.md
            
          
    Benchmarking

Offline

Each LLM can override:

Max sequence length
Extend tokens: amount of fresh compute; in decode, each step is usually 1 extend token
CUDA graph max batch size: static buffers are captured per batch size from 1 to max, so larger values increase startup time and memory usage

Online benchmarking

In online benchmarking, traces are scaled to simulate requests arriving close together (high load) vs far apart (low load).
Docs page

Key components


API server: entry point for an OpenAI-compatible API
Tokenizer worker
Detokenizer worker
Scheduler worker: in TP mode, there is one scheduler per GPU

Because these are separate workers, they communicate using ZeroMQ for control messages and NCCL to exchange tensors.
Browsing important files


core.py defines core runtime structures: SamplingParams, Req, Batch, and Context.
Req is one user generation request and keeps per-request mutable state (lengths, cache progress, IDs, sampling policy).
Batch is a temporary grouping of multiple Req objects for a single model forward pass.
Batch.reqs is the list of requests processed together in that step.
SamplingParams is per request (Req.sampling_params), so requests in the same batch can have different decoding settings.
Context is global, stateful runtime context (shared backends/page table + active batch during forward), not a request.

Utils.py

This file is mostly utilities and logging; the most useful parts are helpers for managing ZMQ queues.
Tokenizer

This does not contain a tokenizer model implementation. It includes utilities (for example, handling Chinese-character streaming behavior), but mostly contains logic for scheduling tokenizer and detokenizer workers.
Server

When the server launches, it starts scheduler worker(s), tokenizer worker(s), and a detokenizer worker.
A key note: detokenization is stateful and requires a uid -> DecodeStatus map to stitch partial streamed text correctly. Detokenization is also typically fast (one token per step), whereas tokenization encodes the full prompt (system + user).
There is also a shell mode where users can enter prompts and run interactive sessions with the model (similar in spirit to a local chat REPL).
Naming note

Calling this repo "minimal" can be misleading unless we separate ownership clearly.

In repo: orchestration and glue code (scheduler, batching, KV/page-table bookkeeping, worker topology, backend selection).
Outside repo: heavy compute kernels and low-level CUDA ops come from external packages.

Attention


In repo:

python/minisgl/attention/__init__.py selects the backend (fa, fi, or hybrid fa,fi).
python/minisgl/attention/fa.py and python/minisgl/attention/fi.py build metadata and call into backend kernels.
python/minisgl/layers/attention.py calls ctx.attn_backend.forward(...).


Outside repo:

fa path calls sgl_kernel.flash_attn.flash_attn_with_kvcache (external sgl_kernel package).
fi path calls FlashInfer wrappers (external flashinfer-python package).


Key point: the repo owns control flow and metadata prep; external deps own the optimized kernel implementations.

Benchmark

Uses cuda graphs to benchmark and rmeove overhead of a single function of microkernels in perf.py
In benchmark/client.py metrics:

first_times: per-request time to first output chunk/token (deltas[0]) -> TTFT (Time to First Token).
accum_times: per-token times after the first token (deltas[1:]) -> TPOT (Time Per Output Token).
e2e_times: total request time (tics[-1] - tics[0]) -> end-to-end latency.
They differ mainly by which interval is measured on the same request timeline: TTFT (start->first token), TPOT (between later tokens), E2E (start->last token).

Distributed (distributed/impl.py)


Provides a tiny abstraction for TP collectives: all_reduce and all_gather.
Default backend uses torch.distributed.
Optional PyNCCL backend can be enabled and overrides the default via plugin dispatch.
TP size 1 is a fast no-op path.

Kernel


python/minisgl/kernel/ contains local wrappers around custom Triton/CUDA/C++ ops.
These are invoked from model/runtime paths, not just tests.

fused_moe_kernel_triton + moe_sum_reduce_triton:

Call path:

python/minisgl/moe/fused.py -> fused_experts_impl(...)
runs fused_moe_kernel_triton(...) twice (for w1 then w2)
then moe_sum_reduce_triton(...) to reduce across top-k experts


Inputs prepared before the kernel:

top-k routing (fused_topk)
token/expert block alignment (moe_align_block_size) to produce sorted_token_ids, expert_ids, and padded token count


What is fused in the Triton kernel:

token dispatch by expert + expert matmul + optional routed-weight multiply + write to output cache


Important boundary:

router/top-k selection is not done inside fused_moe_kernel; it happens before launch.


indexing (python/minisgl/kernel/index.py):

Primary runtime use:

python/minisgl/layers/embedding.py -> VocabParallelEmbedding.forward(...)
this replaces embedding gather (F.embedding) with custom gather


TP behavior:

in TP mode, passes vocab_range so out-of-range indices are zeroed in-kernel for vocab-parallel shards
result is then TP all-reduced


Why this kernel exists:

warp-level contiguous copy for large embedding rows + split-warp mode for wider rows
avoids generic advanced indexing overhead in hot embedding path


store_cache (python/minisgl/kernel/store.py):

Primary runtime use:

python/minisgl/kvcache/mha_pool.py -> MHAKVCache.store_kv(...)
writes current step K/V tensors into the backing KV cache via out_loc indices


Why this kernel exists:

one kernel performs indexed K and V writes with warp-level copy
optimized for KV cache update pattern in decode/prefill loops


fast_compare_key (python/minisgl/kernel/radix.py):

Primary runtime use:

python/minisgl/kvcache/radix_manager.py -> RadixTreeNode.get_match_len(...)
used while walking radix tree to find shared prefix length with incoming token IDs


Why this kernel exists:

C++ contiguous mismatch scan removes Python-level loop overhead in repeated prefix checks


init_pynccl (python/minisgl/kernel/pynccl.py):

Primary runtime use:

python/minisgl/distributed/impl.py -> enable_pynccl_distributed(...)
installs PyNCCLDistributedImpl plugin used by DistributedCommunicator


Runtime effect:

all_reduce and all_gather dispatch to PyNCCL communicator instead of torch distributed when enabled
uses a CPU process group once to broadcast NCCL unique ID, then performs collectives on GPU communicator


Layers, Models, and MoE

Layers, models, and moe are basically where the PyTorch model-specific pirate stuff is defined.
Message


python/minisgl/message/ defines what is sent between workers (typed message schema/protocol).
Queue logic in python/minisgl/utils/mp.py defines how messages are sent (ZMQ PUSH/PULL/PUB/SUB + msgpack transport).
In short: message = semantics/payload types; queue utils = delivery mechanism.

Engine


python/minisgl/engine/engine.py is the runtime wrapper around model/layers.
It initializes execution infrastructure: device/stream, distributed comms, model weights, KV cache, page table, attention/MoE backends, sampler, and CUDA-graph runner.
At runtime it executes batched forward (forward_batch), then samples next tokens and returns GPU/CPU token outputs.
It is not the model definition itself; models//layers/ define network structure, while engine sets up and runs that structure efficiently.
CUDA-graph path (decode fast path):

captures multiple fixed decode batch sizes at init, then replays when batch.is_decode and size is supported; otherwise falls back to eager forward.
scheduler pads decode batches to nearest captured size; real data is copied into fixed graph buffers before replay.


CUDA graph memory pool reuse:

capture loop reuses one pool handle across sizes (pool = graph.pool() then pass pool=...).
benefit: lower memory overhead/fragmentation when capturing many sizes.
tradeoff: less strict per-graph memory isolation than separate pools.


Scheduler


python/minisgl/scheduler/ is the runtime control loop between incoming user requests and model forward execution.
It owns request lifecycle + batching policy, while engine owns actual model compute.
Core idea: keep GPU busy by continuously selecting the next runnable batch from:

prefill queue (PrefillManager): ingest new prompts / chunked prompt extension
decode queue (DecodeManager): 1-step next-token generation for active requests


One-line memory summary:

PagedAttention = how KV is stored/allocated.
RadixManager = how KV is found/shared (which past KV to reuse).


Terminology (to avoid ambiguity):

KV cache pages: physical storage locations for K/V states.
CacheManager: allocator/bookkeeper for KV cache pages (free list, eviction, prefix-cache locks).
CacheManager scope: global shared structure per scheduler/worker process; it manages KV memory across batches/requests/sessions (not per-batch).
TableManager: per-request slot allocator (table_idx) + token/page mapping helpers.
page_table: indirection map from (table_idx, token_pos) -> kv_page_idx.
token_pool: scheduler-side staging of token IDs indexed by table_idx and token position.

High-level data flow:

Tokenizer/server sends UserMsg -> scheduler validates lengths and enqueues into prefill.
Scheduler builds a Batch, prepares metadata (positions, input/write mappings), allocates KV/page slots, and calls engine.forward_batch(...).
Next tokens are written into token pool/page table bookkeeping, converted into DetokenizeMsg, and sent back to detokenizer/tokenizer worker.
Finished or aborted requests free table slots and return KV pages into cache manager.

Files and responsibilities:

scheduler.py

main loop + message handling (UserMsg, AbortBackendMsg, ExitMsg)
scheduling policy: currently prefill-first, then decode fallback
stream overlap mode: overlaps "process last batch output" with "run next batch" to hide CPU-side latency
request finalization: EOS/max-token checks + resource free


prefill.py

pending request queue
budgeted prefill packing via PrefillAdder (token budget + estimated cache pressure)
chunked prefill (ChunkedReq) when prompt extension exceeds current prefill budget


cache.py

allocator over free KV page slots + evictable KV cache pages
prefix cache integration (match_prefix, insert_prefix) via cache backend manager
lock/unlock handles to prevent eviction races for in-flight requests


table.py

request slot allocator (table_idx) for max concurrent running requests
per-request token pool backing tensor (mirrors page-table indexing layout)


io.py

scheduler <-> tokenizer/detokenizer transport
single-rank and TP multi-rank message fanout/collection over ZMQ + CPU process-group sync
rank-0 is authoritative for ingress/egress in multi-rank mode


utils.py

lightweight scheduling data structures (PendingReq)


Important mechanics:

Prefill admission is constrained by both:

prefill token budget (max_extend_tokens)
estimated KV capacity after reserving in-flight decode tokens


Prefix cache reuse:

new request attempts prefix match on all but final prompt token
matched prefix KV pages are copied into the request's page-table entries, reducing fresh compute/storage


Resource ownership:

TableManager tracks request slots + page-table/token-pool indexing
CacheManager tracks KV-page free/evictable capacity and finished-request reinsertion


Idle behavior:

scheduler can block on incoming message and periodically run cache integrity checks


Chunked prefill:

Meaning: if a prompt's uncached prefix is larger than current prefill budget, scheduler pre-fills only a chunk now, then continues remaining prompt tokens in later prefill rounds.
Implementation hint: PrefillAdder creates a ChunkedReq when chunk_size < remain_len.
Why: prevents one very long prompt from monopolizing prefill budget and lets other requests make progress.
Effect: chunked prefill improves fairness/latency under load, at the cost of extra scheduling steps for long prompts.
Example: if remaining prompt tokens are 1200 and budget is 256, scheduler may process 256 now and leave 944 for later prefill iterations.

Overlap vs normal loop:

Normal loop: receive -> schedule -> forward -> process outputs (serial).
Overlap loop: while current batch runs on engine stream, process previous batch outputs on scheduler stream.
This pipelining improves utilization when CPU-side post-processing / message handling is non-trivial.

KVCache

python/minisgl/kvcache/ has two distinct layers:

Physical KV storage (BaseKVCache / MHAKVCache): where K/V tensors actually live on GPU.
Logical prefix-cache manager (BaseCacheManager + implementations): metadata/index structure used to reuse and evict cached prefixes.
Important scope note: the cache manager is global shared runtime state (per scheduler/worker process), not per-user state; it must arbitrate KV memory across all active and recent sessions.

Physical KV storage (mha_pool.py):

MHAKVCache allocates one large GPU tensor shaped roughly like:

[K_or_V, layer, page, page_size, local_kv_heads, head_dim]


During forward, kernels call store_kv(...) with out_loc indices; this writes newly produced K/V rows into page locations.
In TP mode, each rank stores only its local shard of KV heads (local_kv_heads = num_kv_heads / tp_size).

Prefix cache manager API (base.py):

match_prefix(input_ids) -> find longest reusable prefix and return:

a handle (for lock lifetime),
page indices for matched tokens.


lock_handle(handle) / unlock: protects matched prefix from eviction while request is in flight.
insert_prefix(input_ids, indices): add newly completed prefix mapping into cache.
evict(size): free page indices by evicting unprotected cached prefixes.

Two manager implementations:

naive_manager.py:

no prefix reuse, no eviction logic (mostly "always miss" behavior).
useful as a simple baseline.


radix_manager.py:

real prefix-sharing cache using a radix tree over token IDs.
this is what "radix caching" means here.


What radix caching actually is (in serving terms):

A radix tree stores token sequences by shared prefixes.
If many requests begin with the same token prefix (system prompt, template, repeated context), those prefix tokens map to one shared set of KV pages.
New requests "walk" the tree to find the longest matching cached prefix, then only compute the suffix (cache miss part).
If a partial mismatch happens mid-node, the node is split (_split_at) so the tree remains prefix-correct.

Why radix helps multi-user serving:

Many users share prompt boilerplate (chat template/system instructions), so prefix reuse reduces prefill FLOPs and KV writes.
Less prefill work means lower TTFT under load and more budget for decode.
Shared prefixes are reference-counted; active requests protect nodes from eviction.

Eviction and protection model (important for large multi-user setups):

Each radix node tracks:

ref_count (protected if > 0),
timestamp (recently touched time),
value (KV page indices for that token segment).


Eviction chooses leaf nodes with ref_count == 0, using a timestamp heap (LRU-like over leaves).
Protected nodes (in-flight requests) are not evicted; only evictable nodes contribute to freeable capacity.

How this maps to scheduler behavior:

Admission path (PrefillAdder):

calls match_prefix(...),
locks handle before using matched indices,
reserves KV capacity before admitting request/chunk.


Finish path (_free_req_resources -> free_and_cache_finished_req):

inserts final prefix mapping,
frees duplicate indices already in cache,
unlocks old handle.


Mental model for "KV cache in large multi-user setups":

Think of a shared KV page arena plus an index layer:

arena = physical KV tensors,
index = radix tree + page_table mappings + request slots.


Requests do not own contiguous buffers forever; they own logical mappings while active.
Under pressure, unprotected old prefixes are evicted; hot/shared prefixes stay resident longer.
Throughput/latency come from balancing:

prefill admission budget,
decode in-flight reservation,
prefix-hit rate,
eviction pressure.


Concrete multi-user example (radix + scheduler):

Suppose tokenized prompts are:

User A: [SYS, You, are, helpful, Q1]
User B: [SYS, You, are, helpful, Q2]
User C: [SYS, You, are, helpful, Q3]


Step 1 (User A arrives, cold cache):

match_prefix misses (prefix_len = 0), scheduler computes full prefill for A.
on finish, insert_prefix stores mapping from that token prefix to KV page indices.


Step 2 (User B arrives):

match_prefix finds long prefix [SYS, You, are, helpful].
scheduler locks the handle, reuses those matched KV pages, and computes only suffix token(s) for B (e.g., Q2 and beyond).
B's active request protects shared nodes (ref_count increments), so they cannot be evicted mid-flight.


Step 3 (User C arrives while A/B still decoding):

C gets same shared-prefix hit, so TTFT is mostly the non-shared suffix work.
decode manager keeps all runnable reqs active; prefill budget is not dominated by repeated system prompt work.


Step 4 (memory pressure event):

allocator needs pages; cache manager asks radix manager to evict unprotected leaves.
old/private suffixes with ref_count == 0 are evicted first; active shared prefix remains protected.


Step 5 (requests finish):

scheduler free path inserts any newly useful prefix segments, unlocks handles, and returns duplicate-or-unused pages to free pool.
once no request references a shared prefix, it becomes evictable again.


Practical takeaway from the example:

Radix caching converts "N users with same template" from near-duplicate prefill work into shared-prefix reuse.
In multi-user serving, correctness is maintained by lock/unlock + ref counts; efficiency comes from prefix hits + selective eviction.

Scheduler / KV cache manager: decides which blocks a sequence owns/uses (or shares), and builds/extends that sequence’s block table.
Paged-attention kernels: use that block table + the token positions to compute the exact (block, offset) to write new K/V, and to read all past K/V during attention.
No results found