Created
June 10, 2025 13:24
-
-
Save kiya00/6f3b7d95102c23806b1281f0b191f923 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| root@53acaad1b40e:/app/tensorrt_llm# trtllm-bench --model $MODEL_ID throughput --dataset /tmp/synthetic_128_128.txt --backend autodeploy | |
| 2025-06-10 13:11:02,295 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend | |
| [TensorRT-LLM] TensorRT-LLM version: 0.21.0rc0 | |
| [06/10/2025-13:11:02] [TRT-LLM] [I] Preparing to run throughput benchmark... | |
| Parse safetensors files: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 22.03it/s] | |
| [06/10/2025-13:11:03] [TRT-LLM] [I] | |
| =========================================================== | |
| = DATASET DETAILS | |
| =========================================================== | |
| Dataset Path: /tmp/synthetic_128_128.txt | |
| Number of Sequences: 3000 | |
| -- Percentiles statistics --------------------------------- | |
| Input Output Seq. Length | |
| ----------------------------------------------------------- | |
| MIN: 128.0000 128.0000 256.0000 | |
| MAX: 128.0000 128.0000 256.0000 | |
| AVG: 128.0000 128.0000 256.0000 | |
| P50: 128.0000 128.0000 256.0000 | |
| P90: 128.0000 128.0000 256.0000 | |
| P95: 128.0000 128.0000 256.0000 | |
| P99: 128.0000 128.0000 256.0000 | |
| =========================================================== | |
| Fetching 17 files: 100%|████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 17951.45it/s] | |
| Parse safetensors files: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 7.68it/s] | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto" | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization. | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Estimated engine size: 14.96 GB | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Estimated total available memory for KV cache: 64.69 GB | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Estimated total KV cache memory: 61.46 GB | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Estimated max number of requests in KV cache memory: 1966.57 | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Estimated max batch size (after fine-tune): 2048 | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Estimated max num tokens (after fine-tune): 4096 | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Max batch size and max num tokens not provided. Using heuristics or pre-defined settings: max_batch_size=2048, max_num_tokens=4096. | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Setting PyTorch max sequence length to 256 | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Setting up throughput benchmark. | |
| [06/10/2025-13:11:05] [TRT-LLM] [W] Using default gpus_per_node: 8 | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Set nccl_plugin to None. | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] AutoDeployConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 256, 512, 1024, 2048], cuda_graph_max_batch_size=2048, cuda_graph_padding_enabled=True, disable_overlap_scheduler=True, moe_max_num_tokens=None, moe_load_balancer=None, attn_backend='FlashInfer', moe_backend='CUTLASS', mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=True, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>, model_factory='AutoModelForCausalLM', model_kwargs={'use_cache': False}, mla_backend='MultiHeadLatentAttention', skip_loading_weights=True, free_mem_ratio=0.8) | |
| rank 0 using MpiPoolSession to spawn MPI processes | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue | |
| [06/10/2025-13:11:05] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue | |
| 2025-06-10 13:11:14,503 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend | |
| [TensorRT-LLM] TensorRT-LLM version: 0.21.0rc0 | |
| [TensorRT-LLM][INFO] Refreshed the MPI local session | |
| [06/10/2025-13:11:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Initializing for: lib='OMPI', local_rank=0, world_size=1, port=40407 | |
| [06/10/2025-13:11:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] max_seq_len=256, max_batch_size=2048, tokens_per_block=32, max_num_tokens=4096 | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acceleration on top of oneDNN is available for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| [06/10/2025-13:11:22] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] MoE Pattern Matching | |
| [06/10/2025-13:11:23] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Match explicit(HF) style RoPE | |
| [06/10/2025-13:11:23] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Match Complex style RoPE | |
| [06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Match RoPE layout to bsnd | |
| [06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Eliminating redundant transpose operations | |
| [06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] RoPE optimization | |
| [06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Sharding graph for TP | |
| [06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device | |
| [06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Sharding graph for EP | |
| [06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device | |
| [06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Sharding graph for BMM | |
| [06/10/2025-13:11:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device | |
| [06/10/2025-13:11:26] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Loading and initializing weights. | |
| [06/10/2025-13:11:27] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] MoE fusion | |
| [06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Fusing allreduce, residual, and rmsnorm | |
| [06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] GEMM+Collective fusion | |
| [06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Replacing attn op attention.bsnd_grouped_sdpa with backend FlashInferAttention | |
| ************************************+++++++++ | |
| [06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Setting up caches + moving info args to device | |
| [06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory: 66589097984, Total memory: 84929347584 | |
| [06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Current cache size: 536870912, Current num pages: 128 | |
| [06/10/2025-13:11:28] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory before forward pass: 66589097984 | |
| 2025-06-10 13:11:29,856 - INFO - flashinfer.jit: Loading JIT ops: rope | |
| 2025-06-10 13:11:29,868 - INFO - flashinfer.jit: Finished loading JIT ops: rope | |
| 2025-06-10 13:11:29,870 - INFO - flashinfer.jit: Loading JIT ops: page | |
| 2025-06-10 13:11:29,879 - INFO - flashinfer.jit: Finished loading JIT ops: page | |
| 2025-06-10 13:11:29,885 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False | |
| 2025-06-10 13:11:29,895 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False | |
| [06/10/2025-13:11:30] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory after forward pass: 64806518784 | |
| [06/10/2025-13:11:30] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Memory for forward pass: 1782579200 | |
| [06/10/2025-13:11:33] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] After all_gather - new_num_pages: 12488 | |
| [06/10/2025-13:11:33] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Fusion before compiling... | |
| [06/10/2025-13:11:33] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Compiling for torch-opt backend... | |
| [06/10/2025-13:11:38] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 1 | |
| [06/10/2025-13:11:43] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 2 | |
| [06/10/2025-13:11:48] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 4 | |
| [06/10/2025-13:11:52] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 8 | |
| [06/10/2025-13:11:57] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 16 | |
| [06/10/2025-13:12:02] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 24 | |
| [06/10/2025-13:12:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 32 | |
| [06/10/2025-13:12:11] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 40 | |
| [rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] torch._dynamo hit config.recompile_limit (8) | |
| [rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] function: 'forward' (<eval_with_key>.61:4) | |
| [rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] last reason: 0/7: tensor 'L['input_ids']' size mismatch at index 0. expected 32, actual 40 | |
| [rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". | |
| [rank0]:W0610 13:12:11.582000 117344 torch/_dynamo/convert_frame.py:961] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. | |
| [06/10/2025-13:12:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 48 | |
| [06/10/2025-13:12:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 56 | |
| [06/10/2025-13:12:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 64 | |
| [06/10/2025-13:12:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 72 | |
| [06/10/2025-13:12:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 80 | |
| [06/10/2025-13:12:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 88 | |
| [06/10/2025-13:12:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 96 | |
| [06/10/2025-13:12:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 104 | |
| [06/10/2025-13:12:15] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 112 | |
| [06/10/2025-13:12:15] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 120 | |
| [06/10/2025-13:12:15] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 128 | |
| [06/10/2025-13:12:16] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 256 | |
| [06/10/2025-13:12:16] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 512 | |
| [06/10/2025-13:12:17] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 1024 | |
| [06/10/2025-13:12:17] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 2048 | |
| [06/10/2025-13:12:18] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Compile time with backend torch-opt: 44.524837 seconds | |
| [06/10/2025-13:12:18] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Using fake cache manager with head_dim=0 and num pages: 12488 | |
| [TensorRT-LLM][INFO] Max KV cache pages per sequence: 8 [window size=256] | |
| [TensorRT-LLM][INFO] Number of tokens per block: 32. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.00 GiB for max tokens in paged KV cache (399616). | |
| [06/10/2025-13:12:18] [TRT-LLM] [I] Setting up for warmup... | |
| [06/10/2025-13:12:18] [TRT-LLM] [I] Running warmup. | |
| [06/10/2025-13:12:18] [TRT-LLM] [I] Starting benchmarking async task. | |
| [06/10/2025-13:12:18] [TRT-LLM] [I] Starting benchmark... | |
| [06/10/2025-13:12:18] [TRT-LLM] [I] Request submission complete. [count=2, time=0.0000s, rate=166694.45 req/s] | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] Benchmark complete. | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] Stopping LLM backend. | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] Cancelling all 0 tasks to complete. | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] All tasks cancelled. | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] LLM Backend stopped. | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] Worker task cancelled. | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] Warmup done. | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] No log path provided, skipping logging. | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] Starting benchmarking async task. | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] Starting benchmark... | |
| [06/10/2025-13:12:20] [TRT-LLM] [I] Request submission complete. [count=3000, time=0.0014s, rate=2126744.02 req/s] | |
| Traceback (most recent call last): | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1681, in _update_requests | |
| self.sampler.update_requests(sample_state) | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/sampler.py", line 240, in update_requests | |
| state.sampler_event.synchronize() | |
| File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 227, in synchronize | |
| super().synchronize() | |
| RuntimeError: CUDA error: an illegal memory access was encountered | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| [06/10/2025-13:12:52] [TRT-LLM] [E] Encountered an error in sampling: CUDA error: an illegal memory access was encountered | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| [2025-06-10 13:12:53] ERROR base_events.py:1821: Task exception was never retrieved | |
| future: <Task finished name='Task-2074' coro=<LlmManager.process_request() done, defined at /app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/bench/benchmark/utils/asynchronous.py:44> exception=RequestError('CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')> | |
| Traceback (most recent call last): | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/bench/benchmark/utils/asynchronous.py", line 65, in process_request | |
| response: RequestOutput = await output.aresult() | |
| ^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 491, in aresult | |
| await self._aresult_step() | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 469, in _aresult_step | |
| self._handle_response(response) | |
| File "/usr/lib/python3.12/contextlib.py", line 81, in inner | |
| return func(*args, **kwds) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 358, in _handle_response | |
| GenerationResultBase._handle_response(self, response) | |
| File "/usr/lib/python3.12/contextlib.py", line 81, in inner | |
| return func(*args, **kwds) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 328, in _handle_response | |
| handler(response.error_msg) | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/executor.py", line 260, in _handle_background_error | |
| raise RequestError(error) | |
| tensorrt_llm.executor.utils.RequestError: CUDA error: an illegal memory access was encountered | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| ### ...... repeated | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Benchmark complete. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Stopping LLM backend. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Cancelling all 0 tasks to complete. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] All tasks cancelled. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] LLM Backend stopped. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Worker task cancelled. | |
| [2025-06-10 13:12:53] ERROR base_events.py:1821: Task exception was never retrieved | |
| future: <Task finished name='Task-3012' coro=<LlmManager.process_request() done, defined at /app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/bench/benchmark/utils/asynchronous.py:44> exception=RequestError('CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')> | |
| Traceback (most recent call last): | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/bench/benchmark/utils/asynchronous.py", line 65, in process_request | |
| response: RequestOutput = await output.aresult() | |
| ^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 491, in aresult | |
| await self._aresult_step() | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 469, in _aresult_step | |
| self._handle_response(response) | |
| File "/usr/lib/python3.12/contextlib.py", line 81, in inner | |
| return func(*args, **kwds) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 358, in _handle_response | |
| GenerationResultBase._handle_response(self, response) | |
| File "/usr/lib/python3.12/contextlib.py", line 81, in inner | |
| return func(*args, **kwds) | |
| ^^^^^^^^^^^^^^^^^^^ | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/result.py", line 328, in _handle_response | |
| handler(response.error_msg) | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/executor/executor.py", line 260, in _handle_background_error | |
| raise RequestError(error) | |
| tensorrt_llm.executor.utils.RequestError: CUDA error: an illegal memory access was encountered | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Benchmark done. Reporting results... | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto" | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] | |
| =========================================================== | |
| = PYTORCH BACKEND | |
| =========================================================== | |
| Model: meta-llama/Llama-3.1-8B | |
| Model Path: None | |
| TensorRT-LLM Version: 0.21.0rc0 | |
| Dtype: bfloat16 | |
| KV Cache Dtype: None | |
| Quantization: None | |
| =========================================================== | |
| = REQUEST DETAILS | |
| =========================================================== | |
| Number of requests: 1861 | |
| Number of concurrent requests: 1175.6937 | |
| Average Input Length (tokens): 128.0000 | |
| Average Output Length (tokens): 128.0000 | |
| =========================================================== | |
| = WORLD + RUNTIME INFORMATION | |
| =========================================================== | |
| TP Size: 1 | |
| PP Size: 1 | |
| EP Size: None | |
| Max Runtime Batch Size: 2048 | |
| Max Runtime Tokens: 4096 | |
| Scheduling Policy: GUARANTEED_NO_EVICT | |
| KV Memory Percentage: 90.00% | |
| Issue Rate (req/sec): 1.2198E+15 | |
| =========================================================== | |
| = PERFORMANCE OVERVIEW | |
| =========================================================== | |
| Request Throughput (req/sec): 58.0015 | |
| Total Output Throughput (tokens/sec): 7424.1968 | |
| Total Token Throughput (tokens/sec): 14848.3935 | |
| Total Latency (ms): 32085.3565 | |
| Average request latency (ms): 20270.0443 | |
| Per User Output Throughput [w/ ctx] (tps/user): 6.5900 | |
| Per GPU Output Throughput (tps/gpu): 7424.1968 | |
| -- Request Latency Breakdown (ms) ----------------------- | |
| [Latency] P50 : 19025.3826 | |
| [Latency] P90 : 29316.4204 | |
| [Latency] P95 : 29601.7633 | |
| [Latency] P99 : 29817.4049 | |
| [Latency] MINIMUM: 14846.8434 | |
| [Latency] MAXIMUM: 29838.9014 | |
| [Latency] AVERAGE: 20270.0443 | |
| =========================================================== | |
| = DATASET DETAILS | |
| =========================================================== | |
| Dataset Path: /tmp/synthetic_128_128.txt | |
| Number of Sequences: 3000 | |
| -- Percentiles statistics --------------------------------- | |
| Input Output Seq. Length | |
| ----------------------------------------------------------- | |
| MIN: 128.0000 128.0000 256.0000 | |
| MAX: 128.0000 128.0000 256.0000 | |
| AVG: 128.0000 128.0000 256.0000 | |
| P50: 128.0000 128.0000 256.0000 | |
| P90: 128.0000 128.0000 256.0000 | |
| P95: 128.0000 128.0000 256.0000 | |
| P99: 128.0000 128.0000 256.0000 | |
| =========================================================== | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Thread proxy_dispatch_result_thread stopped. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Thread proxy_dispatch_kv_cache_events_thread stopped. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Thread proxy_dispatch_stats_thread stopped. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Thread await_response_thread stopped. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Thread dispatch_stats_thread stopped. | |
| [06/10/2025-13:12:53] [TRT-LLM] [I] Thread dispatch_kv_cache_events_thread stopped. | |
| terminate called after throwing an instance of 'c10::Error' | |
| what(): CUDA error: an illegal memory access was encountered | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): | |
| frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f5b281d55e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) | |
| frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f5b2816a4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) | |
| frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f5b282a02a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so) | |
| frame #3: <unknown function> + 0xb7d311 (0x7f5a7af23311 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) | |
| frame #4: <unknown function> + 0xb794eb (0x7f5a7af1f4eb in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) | |
| frame #5: <unknown function> + 0xb80c04 (0x7f5a7af26c04 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so) | |
| frame #6: <unknown function> + 0x44c162 (0x7f5ade257162 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so) | |
| frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f5b281aff39 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so) | |
| frame #8: <unknown function> + 0x703468 (0x7f5ade50e468 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so) | |
| frame #9: <unknown function> + 0x703890 (0x7f5ade50e890 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so) | |
| frame #10: /usr/bin/python() [0x579cf2] | |
| frame #11: /usr/bin/python() [0x59f0b9] | |
| frame #12: /usr/bin/python() [0x579d52] | |
| frame #13: /usr/bin/python() [0x59f0b9] | |
| frame #14: /usr/bin/python() [0x5f7c29] | |
| frame #15: /usr/bin/python() [0x5e3574] | |
| frame #16: _PyEval_EvalFrameDefault + 0x1080 (0x5d79c0 in /usr/bin/python) | |
| frame #17: /usr/bin/python() [0x54cd32] | |
| frame #18: _PyEval_EvalFrameDefault + 0x4c1b (0x5db55b in /usr/bin/python) | |
| frame #19: /usr/bin/python() [0x54cd32] | |
| frame #20: /usr/bin/python() [0x6f826c] | |
| frame #21: /usr/bin/python() [0x6b917c] | |
| frame #22: <unknown function> + 0x9caa4 (0x7f5d61afcaa4 in /usr/lib/x86_64-linux-gnu/libc.so.6) | |
| frame #23: __clone + 0x44 (0x7f5d61b89a34 in /usr/lib/x86_64-linux-gnu/libc.so.6) | |
| [53acaad1b40e:117344] *** Process received signal *** | |
| [53acaad1b40e:117344] Signal: Aborted (6) | |
| [53acaad1b40e:117344] Signal code: (-6) | |
| [53acaad1b40e:117344] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f5d61aa5330] | |
| [53acaad1b40e:117344] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7f5d61afeb2c] | |
| [53acaad1b40e:117344] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7f5d61aa527e] | |
| [53acaad1b40e:117344] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7f5d61a888ff] | |
| [53acaad1b40e:117344] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x7f5b28663ff5] | |
| [53acaad1b40e:117344] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x7f5b286790da] | |
| [53acaad1b40e:117344] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_call_terminate+0x33)[0x7f5b286638e6] | |
| [53acaad1b40e:117344] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x31a)[0x7f5b286788ba] | |
| [53acaad1b40e:117344] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x22b06)[0x7f5b29e0cb06] | |
| [53acaad1b40e:117344] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f5b29e0d5cd] | |
| [53acaad1b40e:117344] [10] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so(+0xb810b8)[0x7f5a7af270b8] | |
| [53acaad1b40e:117344] [11] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x44c162)[0x7f5ade257162] | |
| [53acaad1b40e:117344] [12] /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so(_ZN3c1010TensorImplD0Ev+0x9)[0x7f5b281aff39] | |
| [53acaad1b40e:117344] [13] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x703468)[0x7f5ade50e468] | |
| [53acaad1b40e:117344] [14] /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so(+0x703890)[0x7f5ade50e890] | |
| [53acaad1b40e:117344] [15] /usr/bin/python[0x579cf2] | |
| [53acaad1b40e:117344] [16] /usr/bin/python[0x59f0b9] | |
| [53acaad1b40e:117344] [17] /usr/bin/python[0x579d52] | |
| [53acaad1b40e:117344] [18] /usr/bin/python[0x59f0b9] | |
| [53acaad1b40e:117344] [19] /usr/bin/python[0x5f7c29] | |
| [53acaad1b40e:117344] [20] /usr/bin/python[0x5e3574] | |
| [53acaad1b40e:117344] [21] /usr/bin/python(_PyEval_EvalFrameDefault+0x1080)[0x5d79c0] | |
| [53acaad1b40e:117344] [22] /usr/bin/python[0x54cd32] | |
| [53acaad1b40e:117344] [23] /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b)[0x5db55b] | |
| [53acaad1b40e:117344] [24] /usr/bin/python[0x54cd32] | |
| [53acaad1b40e:117344] [25] /usr/bin/python[0x6f826c] | |
| [53acaad1b40e:117344] [26] /usr/bin/python[0x6b917c] | |
| [53acaad1b40e:117344] [27] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7f5d61afcaa4] | |
| [53acaad1b40e:117344] [28] /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44)[0x7f5d61b89a34] | |
| [53acaad1b40e:117344] *** End of error message *** | |
| -------------------------------------------------------------------------- | |
| Child job 2 terminated normally, but 1 process returned | |
| a non-zero exit code. Per user-direction, the job has been aborted. | |
| -------------------------------------------------------------------------- | |
| ^C^C | |
| Aborted! | |
| ^CException ignored in atexit callback: <function shutdown_compile_workers at 0x7f273c060b80> | |
| Traceback (most recent call last): | |
| File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/async_compile.py", line 113, in shutdown_compile_workers | |
| pool.shutdown() | |
| File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 239, in shutdown | |
| self.process.wait(300) | |
| File "/usr/lib/python3.12/subprocess.py", line 1264, in wait | |
| return self._wait(timeout=timeout) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/lib/python3.12/subprocess.py", line 2047, in _wait | |
| time.sleep(delay) | |
| KeyboardInterrupt: | |
| -------------------------------------------------------------------------- | |
| (null) noticed that process rank 0 with PID 0 on node 53acaad1b40e exited on signal 6 (Aborted). | |
| -------------------------------------------------------------------------- | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment