Each LLM can override:
- Max sequence length
- Extend tokens: amount of fresh compute; in decode, each step is usually 1 extend token
- CUDA graph max batch size: static buffers are captured per batch size from 1 to max, so larger values increase startup time and memory usage