Skip to content

Instantly share code, notes, and snippets.

@kiya00
Created June 10, 2025 13:24
Show Gist options
  • Select an option

  • Save kiya00/6f3b7d95102c23806b1281f0b191f923 to your computer and use it in GitHub Desktop.

Select an option

Save kiya00/6f3b7d95102c23806b1281f0b191f923 to your computer and use it in GitHub Desktop.
root@53acaad1b40e:/app/tensorrt_llm# trtllm-bench --model $MODEL_ID throughput --dataset /tmp/synthetic_128_128.txt --backend autodeploy
2025-06-10 13:11:02,295 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.21.0rc0
[06/10/2025-13:11:02] [TRT-LLM] [I] Preparing to run throughput benchmark...
Parse safetensors files: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 22.03it/s]
[06/10/2025-13:11:03] [TRT-LLM] [I]
===========================================================
= DATASET DETAILS
===========================================================
Dataset Path: /tmp/synthetic_128_128.txt
Number of Sequences: 3000
-- Percentiles statistics ---------------------------------
Input Output Seq. Length
-----------------------------------------------------------
MIN: 128.0000 128.0000 256.0000
MAX: 128.0000 128.0000 256.0000
AVG: 128.0000 128.0000 256.0000
P50: 128.0000 128.0000 256.0000
P90: 128.0000 128.0000 256.0000
P95: 128.0000 128.0000 256.0000
P99: 128.0000 128.0000 256.0000
===========================================================
Fetching 17 files: 100%|████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 17951.45it/s]
Parse safetensors files: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 7.68it/s]
[06/10/2025-13:11:05] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[06/10/2025-13:11:05] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[06/10/2025-13:11:05] [TRT-LLM] [I] Estimated engine size: 14.96 GB
[06/10/2025-13:11:05] [TRT-LLM] [I] Estimated total available memory for KV cache: 64.69 GB
[06/10/2025-13:11:05] [TRT-LLM] [I] Estimated total KV cache memory: 61.46 GB
[06/10/2025-13:11:05] [TRT-LLM] [I] Estimated max number of requests in KV cache memory: 1966.57
[06/10/2025-13:11:05] [TRT-LLM] [I] Estimated max batch size (after fine-tune): 2048
[06/10/2025-13:11:05] [TRT-LLM] [I] Estimated max num tokens (after fine-tune): 4096
[06/10/2025-13:11:05] [TRT-LLM] [I] Max batch size and max num tokens not provided. Using heuristics or pre-defined settings: max_batch_size=2048, max_num_tokens=4096.
[06/10/2025-13:11:05] [TRT-LLM] [I] Setting PyTorch max sequence length to 256
[06/10/2025-13:11:05] [TRT-LLM] [I] Setting up throughput benchmark.
[06/10/2025-13:11:05] [TRT-LLM] [W] Using default gpus_per_node: 8
[06/10/2025-13:11:05] [TRT-LLM] [I] Set nccl_plugin to None.
[06/10/2025-13:11:05] [TRT-LLM] [I] AutoDeployConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 256, 512, 1024, 2048], cuda_graph_max_batch_size=2048, cuda_graph_padding_enabled=True, disable_overlap_scheduler=True, moe_max_num_tokens=None, moe_load_balancer=None, attn_backend='FlashInfer', moe_backend='CUTLASS', mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=True, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>, model_factory='AutoModelForCausalLM', model_kwargs={'use_cache': False}, mla_backend='MultiHeadLatentAttention', skip_loading_weights=True, free_mem_ratio=0.8)
rank 0 using MpiPoolSession to spawn MPI processes
[06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
2025-06-10 13:11:14,503 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.21.0rc0
[TensorRT-LLM][INFO] Refreshed the MPI local session
[06/10/2025-13:11:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Initializing for: lib='OMPI', local_rank=0, world_size=1, port=40407
[06/10/2025-13:11:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] max_seq_len=256, max_batch_size=2048, tokens_per_block=32, max_num_tokens=4096
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acceleration on top of oneDNN is available for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
[06/10/2025-13:11:22] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] MoE Pattern Matching
[06/10/2025-13:11:23] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Match explicit(HF) style RoPE
[06/10/2025-13:11:23] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Match Complex style RoPE
[06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Match RoPE layout to bsnd
[06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Eliminating redundant transpose operations
[06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] RoPE optimization
[06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Sharding graph for TP
[06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device
[06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Sharding graph for EP
[06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device
[06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Sharding graph for BMM
[06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device
[06/10/2025-13:11:26] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Loading and initializing weights.
[06/10/2025-13:11:27] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] MoE fusion
[06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Fusing allreduce, residual, and rmsnorm
[06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] GEMM+Collective fusion
[06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Replacing attn op attention.bsnd_grouped_sdpa with backend FlashInferAttention
************************************+++++++++
[06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Setting up caches + moving info args to device
[06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory: 66589097984, Total memory: 84929347584
[06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Current cache size: 536870912, Current num pages: 128
[06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory before forward pass: 66589097984
2025-06-10 13:11:29,856 - INFO - flashinfer.jit: Loading JIT ops: rope
2025-06-10 13:11:29,868 - INFO - flashinfer.jit: Finished loading JIT ops: rope
2025-06-10 13:11:29,870 - INFO - flashinfer.jit: Loading JIT ops: page
2025-06-10 13:11:29,879 - INFO - flashinfer.jit: Finished loading JIT ops: page
2025-06-10 13:11:29,885 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-06-10 13:11:29,895 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
[06/10/2025-13:11:30] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory after forward pass: 64806518784
[06/10/2025-13:11:30] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Memory for forward pass: 1782579200
[06/10/2025-13:11:33] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] After all_gather - new_num_pages: 12488
[06/10/2025-13:11:33] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Fusion before compiling...
[06/10/2025-13:11:33] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Compiling for torch-opt backend...
[06/10/2025-13:11:38] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 1
[06/10/2025-13:11:43] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 2
[06/10/2025-13:11:48] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 4
[06/10/2025-13:11:52] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 8
[06/10/2025-13:11:57] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 16
[06/10/2025-13:12:02] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 24
[06/10/2025-13:12:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 32
[06/10/2025-13:12:11] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 40
[rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] torch._dynamo hit config.recompile_limit (8)
[rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] function: 'forward' (<eval_with_key>.61:4)
[rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] last reason: 0/7: tensor 'L['input_ids']' size mismatch at index 0. expected 32, actual 40
[rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
[06/10/2025-13:12:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 48
[06/10/2025-13:12:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 56
[06/10/2025-13:12:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 64
[06/10/2025-13:12:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 72
[06/10/2025-13:12:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 80
[06/10/2025-13:12:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 88
[06/10/2025-13:12:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 96
[06/10/2025-13:12:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 104
[06/10/2025-13:12:15] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 112
[06/10/2025-13:12:15] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 120
[06/10/2025-13:12:15] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 128
[06/10/2025-13:12:16] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 256
[06/10/2025-13:12:16] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 512
[06/10/2025-13:12:17] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 1024
[06/10/2025-13:12:17] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 2048
[06/10/2025-13:12:18] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Compile time with backend torch-opt: 44.524837 seconds
[06/10/2025-13:12:18] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Using fake cache manager with head_dim=0 and num pages: 12488
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 8 [window size=256]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.00 GiB for max tokens in paged KV cache (399616).
[06/10/2025-13:12:18] [TRT-LLM] [I] Setting up for warmup...
[06/10/2025-13:12:18] [TRT-LLM] [I] Running warmup.
[06/10/2025-13:12:18] [TRT-LLM] [I] Starting benchmarking async task.
[06/10/2025-13:12:18] [TRT-LLM] [I] Starting benchmark...
[06/10/2025-13:12:18] [TRT-LLM] [I] Request submission complete. [count=2, time=0.0000s, rate=166694.45 req/s]
[06/10/2025-13:12:20] [TRT-LLM] [I] Benchmark complete.
[06/10/2025-13:12:20] [TRT-LLM] [I] Stopping LLM backend.
[06/10/2025-13:12:20] [TRT-LLM] [I] Cancelling all 0 tasks to complete.
[06/10/2025-13:12:20] [TRT-LLM] [I] All tasks cancelled.
[06/10/2025-13:12:20] [TRT-LLM] [I] LLM Backend stopped.
[06/10/2025-13:12:20] [TRT-LLM] [I] Worker task cancelled.
[06/10/2025-13:12:20] [TRT-LLM] [I] Warmup done.
[06/10/2025-13:12:20] [TRT-LLM] [I] No log path provided, skipping logging.
[06/10/2025-13:12:20] [TRT-LLM] [I] Starting benchmarking async task.
[06/10/2025-13:12:20] [TRT-LLM] [I] Starting benchmark...
[06/10/2025-13:12:20] [TRT-LLM] [I] Request submission complete. [count=3000, time=0.0014s, rate=2126744.02 req/s]
Traceback (most recent call last):
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1681, in _update_requests
self.sampler.update_requests(sample_state)
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/sampler.py", line 240, in update_requests
state.sampler_event.synchronize()
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 227, in synchronize
super().synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[06/10/2025-13:12:52] [TRT-LLM] [E] Encountered an error in sampling: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2025-06-10 13:12:53] ERROR base_events.py:1821: Task exception was never retrieved
future: <Task finished name='Task-2074' coro=<LlmManager.process_request() done, defined at /app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/bench/benchmark/utils/asynchronous.py:44> exception=RequestError('CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')>
Traceback (most recent call last):
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/bench/benchmark/utils/asynchronous.py", line 65, in process_request
response: RequestOutput = await output.aresult()
^^^^^^^^^^^^^^^^^^^^^^
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 491, in aresult
await self._aresult_step()
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 469, in _aresult_step
self._handle_response(response)
File "/usr/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 358, in _handle_response
GenerationResultBase._handle_response(self, response)
File "/usr/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 328, in _handle_response
handler(response.error_msg)
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/executor.py", line 260, in _handle_background_error
raise RequestError(error)
tensorrt_llm.executor.utils.RequestError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
### ...... repeated
[06/10/2025-13:12:53] [TRT-LLM] [I] Benchmark complete.
[06/10/2025-13:12:53] [TRT-LLM] [I] Stopping LLM backend.
[06/10/2025-13:12:53] [TRT-LLM] [I] Cancelling all 0 tasks to complete.
[06/10/2025-13:12:53] [TRT-LLM] [I] All tasks cancelled.
[06/10/2025-13:12:53] [TRT-LLM] [I] LLM Backend stopped.
[06/10/2025-13:12:53] [TRT-LLM] [I] Worker task cancelled.
[2025-06-10 13:12:53] ERROR base_events.py:1821: Task exception was never retrieved
future: <Task finished name='Task-3012' coro=<LlmManager.process_request() done, defined at /app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/bench/benchmark/utils/asynchronous.py:44> exception=RequestError('CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')>
Traceback (most recent call last):
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/bench/benchmark/utils/asynchronous.py", line 65, in process_request
response: RequestOutput = await output.aresult()
^^^^^^^^^^^^^^^^^^^^^^
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 491, in aresult
await self._aresult_step()
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 469, in _aresult_step
self._handle_response(response)
File "/usr/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 358, in _handle_response
GenerationResultBase._handle_response(self, response)
File "/usr/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 328, in _handle_response
handler(response.error_msg)
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/executor.py", line 260, in _handle_background_error
raise RequestError(error)
tensorrt_llm.executor.utils.RequestError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[06/10/2025-13:12:53] [TRT-LLM] [I] Benchmark done. Reporting results...
[06/10/2025-13:12:53] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[06/10/2025-13:12:53] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[06/10/2025-13:12:53] [TRT-LLM] [I]
===========================================================
= PYTORCH BACKEND
===========================================================
Model: meta-llama/Llama-3.1-8B
Model Path: None
TensorRT-LLM Version: 0.21.0rc0
Dtype: bfloat16
KV Cache Dtype: None
Quantization: None
===========================================================
= REQUEST DETAILS
===========================================================
Number of requests: 1861
Number of concurrent requests: 1175.6937
Average Input Length (tokens): 128.0000
Average Output Length (tokens): 128.0000
===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size: 1
PP Size: 1
EP Size: None
Max Runtime Batch Size: 2048
Max Runtime Tokens: 4096
Scheduling Policy: GUARANTEED_NO_EVICT
KV Memory Percentage: 90.00%
Issue Rate (req/sec): 1.2198E+15
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec): 58.0015
Total Output Throughput (tokens/sec): 7424.1968
Total Token Throughput (tokens/sec): 14848.3935
Total Latency (ms): 32085.3565
Average request latency (ms): 20270.0443
Per User Output Throughput [w/ ctx] (tps/user): 6.5900
Per GPU Output Throughput (tps/gpu): 7424.1968
-- Request Latency Breakdown (ms) -----------------------
[Latency] P50 : 19025.3826
[Latency] P90 : 29316.4204
[Latency] P95 : 29601.7633
[Latency] P99 : 29817.4049
[Latency] MINIMUM: 14846.8434
[Latency] MAXIMUM: 29838.9014
[Latency] AVERAGE: 20270.0443
===========================================================
= DATASET DETAILS
===========================================================
Dataset Path: /tmp/synthetic_128_128.txt
Number of Sequences: 3000
-- Percentiles statistics ---------------------------------
Input Output Seq. Length
-----------------------------------------------------------
MIN: 128.0000 128.0000 256.0000
MAX: 128.0000 128.0000 256.0000
AVG: 128.0000 128.0000 256.0000
P50: 128.0000 128.0000 256.0000
P90: 128.0000 128.0000 256.0000
P95: 128.0000 128.0000 256.0000
P99: 128.0000 128.0000 256.0000
===========================================================
[06/10/2025-13:12:53] [TRT-LLM] [I] Thread proxy_dispatch_result_thread stopped.
[06/10/2025-13:12:53] [TRT-LLM] [I] Thread proxy_dispatch_kv_cache_events_thread stopped.
[06/10/2025-13:12:53] [TRT-LLM] [I] Thread proxy_dispatch_stats_thread stopped.
[06/10/2025-13:12:53] [TRT-LLM] [I] Thread await_response_thread stopped.
[06/10/2025-13:12:53] [TRT-LLM] [I] Thread dispatch_stats_thread stopped.
[06/10/2025-13:12:53] [TRT-LLM] [I] Thread dispatch_kv_cache_events_thread stopped.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f5b281d55e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f5b2816a4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f5b282a02a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xb7d311 (0x7f5a7af23311 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xb794eb (0x7f5a7af1f4eb in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xb80c04 (0x7f5a7af26c04 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x44c162 (0x7f5ade257162 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f5b281aff39 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x703468 (0x7f5ade50e468 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x703890 (0x7f5ade50e890 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x579cf2]
frame #11: /usr/bin/python() [0x59f0b9]
frame #12: /usr/bin/python() [0x579d52]
frame #13: /usr/bin/python() [0x59f0b9]
frame #14: /usr/bin/python() [0x5f7c29]
frame #15: /usr/bin/python() [0x5e3574]
frame #16: _PyEval_EvalFrameDefault + 0x1080 (0x5d79c0 in /usr/bin/python)
frame #17: /usr/bin/python() [0x54cd32]
frame #18: _PyEval_EvalFrameDefault + 0x4c1b (0x5db55b in /usr/bin/python)
frame #19: /usr/bin/python() [0x54cd32]
frame #20: /usr/bin/python() [0x6f826c]
frame #21: /usr/bin/python() [0x6b917c]
frame #22: <unknown function> + 0x9caa4 (0x7f5d61afcaa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #23: __clone + 0x44 (0x7f5d61b89a34 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[53acaad1b40e:117344] *** Process received signal ***
[53acaad1b40e:117344] Signal: Aborted (6)
[53acaad1b40e:117344] Signal code: (-6)
[53acaad1b40e:117344] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f5d61aa5330]
[53acaad1b40e:117344] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7f5d61afeb2c]
[53acaad1b40e:117344] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7f5d61aa527e]
[53acaad1b40e:117344] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7f5d61a888ff]
[53acaad1b40e:117344] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x7f5b28663ff5]
[53acaad1b40e:117344] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x7f5b286790da]
[53acaad1b40e:117344] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_call_terminate+0x33)[0x7f5b286638e6]
[53acaad1b40e:117344] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x31a)[0x7f5b286788ba]
[53acaad1b40e:117344] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x22b06)[0x7f5b29e0cb06]
[53acaad1b40e:117344] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f5b29e0d5cd]
[53acaad1b40e:117344] [10] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so(+0xb810b8)[0x7f5a7af270b8]
[53acaad1b40e:117344] [11] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x44c162)[0x7f5ade257162]
[53acaad1b40e:117344] [12] /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7f5b281aff39]
[53acaad1b40e:117344] [13] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x703468)[0x7f5ade50e468]
[53acaad1b40e:117344] [14] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x703890)[0x7f5ade50e890]
[53acaad1b40e:117344] [15] /usr/bin/python[0x579cf2]
[53acaad1b40e:117344] [16] /usr/bin/python[0x59f0b9]
[53acaad1b40e:117344] [17] /usr/bin/python[0x579d52]
[53acaad1b40e:117344] [18] /usr/bin/python[0x59f0b9]
[53acaad1b40e:117344] [19] /usr/bin/python[0x5f7c29]
[53acaad1b40e:117344] [20] /usr/bin/python[0x5e3574]
[53acaad1b40e:117344] [21] /usr/bin/python(_PyEval_EvalFrameDefault+0x1080)[0x5d79c0]
[53acaad1b40e:117344] [22] /usr/bin/python[0x54cd32]
[53acaad1b40e:117344] [23] /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b)[0x5db55b]
[53acaad1b40e:117344] [24] /usr/bin/python[0x54cd32]
[53acaad1b40e:117344] [25] /usr/bin/python[0x6f826c]
[53acaad1b40e:117344] [26] /usr/bin/python[0x6b917c]
[53acaad1b40e:117344] [27] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f5d61afcaa4]
[53acaad1b40e:117344] [28] /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44)[0x7f5d61b89a34]
[53acaad1b40e:117344] *** End of error message ***
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
^C^C
Aborted!
^CException ignored in atexit callback: <function shutdown_compile_workers at 0x7f273c060b80>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/async_compile.py", line 113, in shutdown_compile_workers
pool.shutdown()
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 239, in shutdown
self.process.wait(300)
File "/usr/lib/python3.12/subprocess.py", line 1264, in wait
return self._wait(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 2047, in _wait
time.sleep(delay)
KeyboardInterrupt:
--------------------------------------------------------------------------
(null) noticed that process rank 0 with PID 0 on node 53acaad1b40e exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment