Each LLM can override:
- Max sequence length
- Extend tokens: amount of fresh compute; in decode, each step is usually 1 extend token
- CUDA graph max batch size: static buffers are captured per batch size from 1 to max, so larger values increase startup time and memory usage
In online benchmarking, traces are scaled to simulate requests arriving close together (high load) vs far apart (low load).
- API server: entry point for an OpenAI-compatible API
- Tokenizer worker
- Detokenizer worker
- Scheduler worker: in TP mode, there is one scheduler per GPU
Because these are separate workers, they communicate using ZeroMQ for control messages and NCCL to exchange tensors.
core.pydefines core runtime structures:SamplingParams,Req,Batch, andContext.Reqis one user generation request and keeps per-request mutable state (lengths, cache progress, IDs, sampling policy).Batchis a temporary grouping of multipleReqobjects for a single model forward pass.Batch.reqsis the list of requests processed together in that step.SamplingParamsis per request (Req.sampling_params), so requests in the same batch can have different decoding settings.Contextis global, stateful runtime context (shared backends/page table + active batch during forward), not a request.
This file is mostly utilities and logging; the most useful parts are helpers for managing ZMQ queues.
This does not contain a tokenizer model implementation. It includes utilities (for example, handling Chinese-character streaming behavior), but mostly contains logic for scheduling tokenizer and detokenizer workers.
When the server launches, it starts scheduler worker(s), tokenizer worker(s), and a detokenizer worker.
A key note: detokenization is stateful and requires a uid -> DecodeStatus map to stitch partial streamed text correctly. Detokenization is also typically fast (one token per step), whereas tokenization encodes the full prompt (system + user).
There is also a shell mode where users can enter prompts and run interactive sessions with the model (similar in spirit to a local chat REPL).
Calling this repo "minimal" can be misleading unless we separate ownership clearly.
- In repo: orchestration and glue code (scheduler, batching, KV/page-table bookkeeping, worker topology, backend selection).
- Outside repo: heavy compute kernels and low-level CUDA ops come from external packages.
- In repo:
python/minisgl/attention/__init__.pyselects the backend (fa,fi, or hybridfa,fi).python/minisgl/attention/fa.pyandpython/minisgl/attention/fi.pybuild metadata and call into backend kernels.python/minisgl/layers/attention.pycallsctx.attn_backend.forward(...).
- Outside repo:
fapath callssgl_kernel.flash_attn.flash_attn_with_kvcache(externalsgl_kernelpackage).fipath calls FlashInfer wrappers (externalflashinfer-pythonpackage).
- Key point: the repo owns control flow and metadata prep; external deps own the optimized kernel implementations.
Uses cuda graphs to benchmark and rmeove overhead of a single function of microkernels in perf.py
In benchmark/client.py metrics:
first_times: per-request time to first output chunk/token (deltas[0]) -> TTFT (Time to First Token).accum_times: per-token times after the first token (deltas[1:]) -> TPOT (Time Per Output Token).e2e_times: total request time (tics[-1] - tics[0]) -> end-to-end latency.- They differ mainly by which interval is measured on the same request timeline: TTFT (start->first token), TPOT (between later tokens), E2E (start->last token).
- Provides a tiny abstraction for TP collectives:
all_reduceandall_gather. - Default backend uses
torch.distributed. - Optional PyNCCL backend can be enabled and overrides the default via plugin dispatch.
- TP size 1 is a fast no-op path.
python/minisgl/kernel/contains local wrappers around custom Triton/CUDA/C++ ops.- These are invoked from model/runtime paths, not just tests.
fused_moe_kernel_triton + moe_sum_reduce_triton:
- Call path:
python/minisgl/moe/fused.py->fused_experts_impl(...)- runs
fused_moe_kernel_triton(...)twice (forw1thenw2) - then
moe_sum_reduce_triton(...)to reduce across top-k experts
- Inputs prepared before the kernel:
- top-k routing (
fused_topk) - token/expert block alignment (
moe_align_block_size) to producesorted_token_ids,expert_ids, and padded token count
- top-k routing (
- What is fused in the Triton kernel:
- token dispatch by expert + expert matmul + optional routed-weight multiply + write to output cache
- Important boundary:
- router/top-k selection is not done inside
fused_moe_kernel; it happens before launch.
- router/top-k selection is not done inside
indexing (python/minisgl/kernel/index.py):
- Primary runtime use:
python/minisgl/layers/embedding.py->VocabParallelEmbedding.forward(...)- this replaces embedding gather (
F.embedding) with custom gather
- TP behavior:
- in TP mode, passes
vocab_rangeso out-of-range indices are zeroed in-kernel for vocab-parallel shards - result is then TP all-reduced
- in TP mode, passes
- Why this kernel exists:
- warp-level contiguous copy for large embedding rows + split-warp mode for wider rows
- avoids generic advanced indexing overhead in hot embedding path
store_cache (python/minisgl/kernel/store.py):
- Primary runtime use:
python/minisgl/kvcache/mha_pool.py->MHAKVCache.store_kv(...)- writes current step K/V tensors into the backing KV cache via
out_locindices
- Why this kernel exists:
- one kernel performs indexed K and V writes with warp-level copy
- optimized for KV cache update pattern in decode/prefill loops
fast_compare_key (python/minisgl/kernel/radix.py):
- Primary runtime use:
python/minisgl/kvcache/radix_manager.py->RadixTreeNode.get_match_len(...)- used while walking radix tree to find shared prefix length with incoming token IDs
- Why this kernel exists:
- C++ contiguous mismatch scan removes Python-level loop overhead in repeated prefix checks
init_pynccl (python/minisgl/kernel/pynccl.py):
- Primary runtime use:
python/minisgl/distributed/impl.py->enable_pynccl_distributed(...)- installs
PyNCCLDistributedImplplugin used byDistributedCommunicator
- Runtime effect:
all_reduceandall_gatherdispatch to PyNCCL communicator instead of torch distributed when enabled- uses a CPU process group once to broadcast NCCL unique ID, then performs collectives on GPU communicator
Layers, models, and moe are basically where the PyTorch model-specific pirate stuff is defined.
python/minisgl/message/defines what is sent between workers (typed message schema/protocol).- Queue logic in
python/minisgl/utils/mp.pydefines how messages are sent (ZMQ PUSH/PULL/PUB/SUB + msgpack transport). - In short:
message= semantics/payload types; queue utils = delivery mechanism.
python/minisgl/engine/engine.pyis the runtime wrapper around model/layers.- It initializes execution infrastructure: device/stream, distributed comms, model weights, KV cache, page table, attention/MoE backends, sampler, and CUDA-graph runner.
- At runtime it executes batched forward (
forward_batch), then samples next tokens and returns GPU/CPU token outputs. - It is not the model definition itself;
models//layers/define network structure, while engine sets up and runs that structure efficiently. - CUDA-graph path (decode fast path):
- captures multiple fixed decode batch sizes at init, then replays when
batch.is_decodeand size is supported; otherwise falls back to eager forward. - scheduler pads decode batches to nearest captured size; real data is copied into fixed graph buffers before replay.
- captures multiple fixed decode batch sizes at init, then replays when
- CUDA graph memory pool reuse:
- capture loop reuses one pool handle across sizes (
pool = graph.pool()then passpool=...). - benefit: lower memory overhead/fragmentation when capturing many sizes.
- tradeoff: less strict per-graph memory isolation than separate pools.
- capture loop reuses one pool handle across sizes (
python/minisgl/scheduler/is the runtime control loop between incoming user requests and model forward execution.- It owns request lifecycle + batching policy, while
engineowns actual model compute. - Core idea: keep GPU busy by continuously selecting the next runnable batch from:
- prefill queue (
PrefillManager): ingest new prompts / chunked prompt extension - decode queue (
DecodeManager): 1-step next-token generation for active requests
- prefill queue (
- One-line memory summary:
- PagedAttention = how KV is stored/allocated.
- RadixManager = how KV is found/shared (which past KV to reuse).
Terminology (to avoid ambiguity):
- KV cache pages: physical storage locations for K/V states.
CacheManager: allocator/bookkeeper for KV cache pages (free list, eviction, prefix-cache locks).CacheManagerscope: global shared structure per scheduler/worker process; it manages KV memory across batches/requests/sessions (not per-batch).TableManager: per-request slot allocator (table_idx) + token/page mapping helpers.page_table: indirection map from(table_idx, token_pos)->kv_page_idx.token_pool: scheduler-side staging of token IDs indexed bytable_idxand token position.
High-level data flow:
- Tokenizer/server sends
UserMsg-> scheduler validates lengths and enqueues into prefill. - Scheduler builds a
Batch, prepares metadata (positions, input/write mappings), allocates KV/page slots, and callsengine.forward_batch(...). - Next tokens are written into token pool/page table bookkeeping, converted into
DetokenizeMsg, and sent back to detokenizer/tokenizer worker. - Finished or aborted requests free table slots and return KV pages into cache manager.
Files and responsibilities:
scheduler.py- main loop + message handling (
UserMsg,AbortBackendMsg,ExitMsg) - scheduling policy: currently prefill-first, then decode fallback
- stream overlap mode: overlaps "process last batch output" with "run next batch" to hide CPU-side latency
- request finalization: EOS/max-token checks + resource free
- main loop + message handling (
prefill.py- pending request queue
- budgeted prefill packing via
PrefillAdder(token budget + estimated cache pressure) - chunked prefill (
ChunkedReq) when prompt extension exceeds current prefill budget
cache.py- allocator over free KV page slots + evictable KV cache pages
- prefix cache integration (
match_prefix,insert_prefix) via cache backend manager - lock/unlock handles to prevent eviction races for in-flight requests
table.py- request slot allocator (
table_idx) for max concurrent running requests - per-request token pool backing tensor (mirrors page-table indexing layout)
- request slot allocator (
io.py- scheduler <-> tokenizer/detokenizer transport
- single-rank and TP multi-rank message fanout/collection over ZMQ + CPU process-group sync
- rank-0 is authoritative for ingress/egress in multi-rank mode
utils.py- lightweight scheduling data structures (
PendingReq)
- lightweight scheduling data structures (
Important mechanics:
- Prefill admission is constrained by both:
- prefill token budget (
max_extend_tokens) - estimated KV capacity after reserving in-flight decode tokens
- prefill token budget (
- Prefix cache reuse:
- new request attempts prefix match on all but final prompt token
- matched prefix KV pages are copied into the request's page-table entries, reducing fresh compute/storage
- Resource ownership:
TableManagertracks request slots + page-table/token-pool indexingCacheManagertracks KV-page free/evictable capacity and finished-request reinsertion
- Idle behavior:
- scheduler can block on incoming message and periodically run cache integrity checks
Chunked prefill:
- Meaning: if a prompt's uncached prefix is larger than current prefill budget, scheduler pre-fills only a chunk now, then continues remaining prompt tokens in later prefill rounds.
- Implementation hint:
PrefillAddercreates aChunkedReqwhenchunk_size < remain_len. - Why: prevents one very long prompt from monopolizing prefill budget and lets other requests make progress.
- Effect: chunked prefill improves fairness/latency under load, at the cost of extra scheduling steps for long prompts.
- Example: if remaining prompt tokens are 1200 and budget is 256, scheduler may process 256 now and leave 944 for later prefill iterations.
Overlap vs normal loop:
- Normal loop: receive -> schedule -> forward -> process outputs (serial).
- Overlap loop: while current batch runs on engine stream, process previous batch outputs on scheduler stream.
- This pipelining improves utilization when CPU-side post-processing / message handling is non-trivial.
python/minisgl/kvcache/ has two distinct layers:
- Physical KV storage (
BaseKVCache/MHAKVCache): where K/V tensors actually live on GPU. - Logical prefix-cache manager (
BaseCacheManager+ implementations): metadata/index structure used to reuse and evict cached prefixes. - Important scope note: the cache manager is global shared runtime state (per scheduler/worker process), not per-user state; it must arbitrate KV memory across all active and recent sessions.
Physical KV storage (mha_pool.py):
MHAKVCacheallocates one large GPU tensor shaped roughly like:[K_or_V, layer, page, page_size, local_kv_heads, head_dim]
- During forward, kernels call
store_kv(...)without_locindices; this writes newly produced K/V rows into page locations. - In TP mode, each rank stores only its local shard of KV heads (
local_kv_heads = num_kv_heads / tp_size).
Prefix cache manager API (base.py):
match_prefix(input_ids)-> find longest reusable prefix and return:- a handle (for lock lifetime),
- page indices for matched tokens.
lock_handle(handle)/unlock: protects matched prefix from eviction while request is in flight.insert_prefix(input_ids, indices): add newly completed prefix mapping into cache.evict(size): free page indices by evicting unprotected cached prefixes.
Two manager implementations:
naive_manager.py:- no prefix reuse, no eviction logic (mostly "always miss" behavior).
- useful as a simple baseline.
radix_manager.py:- real prefix-sharing cache using a radix tree over token IDs.
- this is what "radix caching" means here.
What radix caching actually is (in serving terms):
- A radix tree stores token sequences by shared prefixes.
- If many requests begin with the same token prefix (system prompt, template, repeated context), those prefix tokens map to one shared set of KV pages.
- New requests "walk" the tree to find the longest matching cached prefix, then only compute the suffix (cache miss part).
- If a partial mismatch happens mid-node, the node is split (
_split_at) so the tree remains prefix-correct.
Why radix helps multi-user serving:
- Many users share prompt boilerplate (chat template/system instructions), so prefix reuse reduces prefill FLOPs and KV writes.
- Less prefill work means lower TTFT under load and more budget for decode.
- Shared prefixes are reference-counted; active requests protect nodes from eviction.
Eviction and protection model (important for large multi-user setups):
- Each radix node tracks:
ref_count(protected if > 0),timestamp(recently touched time),value(KV page indices for that token segment).
- Eviction chooses leaf nodes with
ref_count == 0, using a timestamp heap (LRU-like over leaves). - Protected nodes (in-flight requests) are not evicted; only evictable nodes contribute to freeable capacity.
How this maps to scheduler behavior:
- Admission path (
PrefillAdder):- calls
match_prefix(...), - locks handle before using matched indices,
- reserves KV capacity before admitting request/chunk.
- calls
- Finish path (
_free_req_resources->free_and_cache_finished_req):- inserts final prefix mapping,
- frees duplicate indices already in cache,
- unlocks old handle.
Mental model for "KV cache in large multi-user setups":
- Think of a shared KV page arena plus an index layer:
- arena = physical KV tensors,
- index = radix tree + page_table mappings + request slots.
- Requests do not own contiguous buffers forever; they own logical mappings while active.
- Under pressure, unprotected old prefixes are evicted; hot/shared prefixes stay resident longer.
- Throughput/latency come from balancing:
- prefill admission budget,
- decode in-flight reservation,
- prefix-hit rate,
- eviction pressure.
Concrete multi-user example (radix + scheduler):
- Suppose tokenized prompts are:
- User A:
[SYS, You, are, helpful, Q1] - User B:
[SYS, You, are, helpful, Q2] - User C:
[SYS, You, are, helpful, Q3]
- User A:
- Step 1 (User A arrives, cold cache):
match_prefixmisses (prefix_len = 0), scheduler computes full prefill for A.- on finish,
insert_prefixstores mapping from that token prefix to KV page indices.
- Step 2 (User B arrives):
match_prefixfinds long prefix[SYS, You, are, helpful].- scheduler locks the handle, reuses those matched KV pages, and computes only suffix token(s) for B (e.g.,
Q2and beyond). - B's active request protects shared nodes (
ref_countincrements), so they cannot be evicted mid-flight.
- Step 3 (User C arrives while A/B still decoding):
- C gets same shared-prefix hit, so TTFT is mostly the non-shared suffix work.
- decode manager keeps all runnable reqs active; prefill budget is not dominated by repeated system prompt work.
- Step 4 (memory pressure event):
- allocator needs pages; cache manager asks radix manager to evict unprotected leaves.
- old/private suffixes with
ref_count == 0are evicted first; active shared prefix remains protected.
- Step 5 (requests finish):
- scheduler free path inserts any newly useful prefix segments, unlocks handles, and returns duplicate-or-unused pages to free pool.
- once no request references a shared prefix, it becomes evictable again.
Practical takeaway from the example:
- Radix caching converts "N users with same template" from near-duplicate prefill work into shared-prefix reuse.
- In multi-user serving, correctness is maintained by lock/unlock + ref counts; efficiency comes from prefix hits + selective eviction.
Scheduler / KV cache manager: decides which blocks a sequence owns/uses (or shares), and builds/extends that sequence’s block table.
Paged-attention kernels: use that block table + the token positions to compute the exact (block, offset) to write new K/V, and to read all past K/V during attention.