Skip to content

Instantly share code, notes, and snippets.

@kiya00
Created June 11, 2025 15:12
Show Gist options
  • Select an option

  • Save kiya00/43da4d15008dcab6dbe6b768ea6ef52c to your computer and use it in GitHub Desktop.

Select an option

Save kiya00/43da4d15008dcab6dbe6b768ea6ef52c to your computer and use it in GitHub Desktop.
root@6e61d1d8b02e:/app/tensorrt_llm# trtllm-bench --model $MODEL_ID throughput --dataset /tmp/syntoy
2025-06-11 10:08:54,023 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.21.0rc1
[06/11/2025-10:08:54] [TRT-LLM] [I] Preparing to run throughput benchmark...
Parse safetensors files: 100%|████████████████████████████████████████████████████████████████████
[06/11/2025-10:08:55] [TRT-LLM] [I]
===========================================================
= DATASET DETAILS
===========================================================
Dataset Path: /tmp/synthetic_128_128.txt
Number of Sequences: 3000
-- Percentiles statistics ---------------------------------
Input Output Seq. Length
-----------------------------------------------------------
MIN: 128.0000 128.0000 256.0000
MAX: 128.0000 128.0000 256.0000
AVG: 128.0000 128.0000 256.0000
P50: 128.0000 128.0000 256.0000
P90: 128.0000 128.0000 256.0000
P95: 128.0000 128.0000 256.0000
P99: 128.0000 128.0000 256.0000
===========================================================
Parse safetensors files: 100%|████████████████████████████████████████████████████████████████████
[06/11/2025-10:08:56] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[06/11/2025-10:08:56] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quant
[06/11/2025-10:08:56] [TRT-LLM] [I] Estimated engine size: 14.96 GB
[06/11/2025-10:08:56] [TRT-LLM] [I] Estimated total available memory for KV cache: 64.69 GB
[06/11/2025-10:08:56] [TRT-LLM] [I] Estimated total KV cache memory: 61.46 GB
[06/11/2025-10:08:56] [TRT-LLM] [I] Estimated max number of requests in KV cache memory: 1966.57
[06/11/2025-10:08:56] [TRT-LLM] [I] Estimated max batch size (after fine-tune): 2048
[06/11/2025-10:08:56] [TRT-LLM] [I] Estimated max num tokens (after fine-tune): 4096
[06/11/2025-10:08:56] [TRT-LLM] [I] Max batch size and max num tokens not provided. Using heuristi_size=2048, max_num_tokens=4096.
[06/11/2025-10:08:56] [TRT-LLM] [I] Setting PyTorch max sequence length to 256
[06/11/2025-10:08:56] [TRT-LLM] [I] Setting up throughput benchmark.
[06/11/2025-10:08:56] [TRT-LLM] [W] Using default gpus_per_node: 8
[06/11/2025-10:08:56] [TRT-LLM] [I] Set nccl_plugin to None.
[06/11/2025-10:08:56] [TRT-LLM] [I] model='meta-llama/Llama-3.1-8B' tokenizer=None tokenizer_mode=_remote_code=True tensor_parallel_size=1 dtype='auto' revision=None tokenizer_revision=None pipeliize=1 gpus_per_node=8 moe_cluster_parallel_size=-1 moe_tensor_parallel_size=-1 moe_expert_parallel_config={} load_format=<LoadFormat.AUTO: 0> enable_lora=False max_lora_rank=None max_loras=4 max_compt_adapter=False max_prompt_adapter_token=0 quant_config=QuantConfig(quant_algo=None, kv_cache_qhquant_val=0.5, clamp_val=None, use_meta_recipe=False, has_zero_point=False, pre_quant_scale=Falseig=KvCacheConfig(enable_block_reuse=False, max_tokens=None, max_attention_window=None, sink_token_=0.9, host_cache_size=None, onboard_blocks=True, cross_kv_cache_fraction=None, secondary_offload_mze=0, enable_partial_reuse=True, copy_on_partial_reuse=True) enable_chunked_prefill=False guided_drocessor=None iter_stats_max_iterations=None request_stats_max_iterations=None peft_cache_config=Ncapacity_scheduler_policy=<CapacitySchedulerPolicy.GUARANTEED_NO_EVICT: 'GUARANTEED_NO_EVICT'>, coatch_config=DynamicBatchConfig(enable_batch_size_tuning=True, enable_max_num_tokens_tuning=False, 28)) cache_transceiver_config=None speculative_config=None batching_type=<BatchingType.INFLIGHT: 'max_batch_size=2048 max_input_len=1024 max_seq_len=256 max_beam_width=1 max_num_tokens=4096 backenits=False num_postprocess_workers=0 postprocess_tokenizer_dir=None reasoning_parser=None decoding_nfig=BuildConfig(max_input_len=1024, max_seq_len=None, opt_batch_size=8, max_batch_size=2048, max_t_num_tokens=None, max_prompt_embedding_table_size=0, kv_cache_type=None, gather_context_logits=Fastrongly_typed=True, force_num_profiles=None, profiling_verbosity='layer_names_only', enable_debuglative_decoding_mode=<SpeculativeDecodingMode.NONE: 1>, use_refit=False, input_timing_cache=None, ra_config=LoraConfig(lora_dir=[], lora_ckpt_source='hf', max_lora_rank=64, lora_target_modules=[],_loras=4, max_cpu_loras=4), auto_parallel_config=AutoParallelConfig(world_size=1, gpus_per_node=8, sharding_cost_model=<CostModel.ALPHA_BETA: 'alpha_beta'>, comm_cost_model=<CostModel.ALPHA_BETA: elism=False, enable_shard_unbalanced_shape=False, enable_shard_dynamic_shape=False, enable_reduce_ug_mode=False, infer_shape=True, validation_mode=False, same_buffer_io={}, same_spec_io={}, sharde, parallel_config_cache=None, profile_cache=None, dump_path=None, debug_outputs=[]), weight_sparsigin_config=PluginConfig(_dtype='float16', _bert_attention_plugin='auto', _gpt_attention_plugin='auisable_gemm_plugin=False, _gemm_swiglu_plugin=None, _fp8_rowwise_gemm_plugin=None, _qserve_gemm_plcl_plugin=None, _lora_plugin=None, _dora_plugin=False, _weight_only_groupwise_quant_matmul_plugin=n=None, _smooth_quant_plugins=True, _smooth_quant_gemm_plugin=None, _layernorm_quantization_pluginone, _quantize_per_token_plugin=False, _quantize_tensor_plugin=False, _moe_plugin='auto', _mamba_cm_plugin=None, _low_latency_gemm_swiglu_plugin=None, _gemm_allreduce_plugin=None, _context_fmha=Tr, _paged_kv_cache=None, _remove_input_padding=True, _norm_quant_fusion=False, _reduce_fusion=Falseck=32, _use_paged_context_fmha=True, _use_fp8_context_fmha=True, _fuse_fp4_quant=False, _multiple_treamingllm=False, _manage_weights=False, _use_fused_mlp=True, _pp_reduce_scatter=False), use_stri024, dry_run=False, visualize_network=None, monitor_memory=False, use_mrope=False) use_cuda_graph= 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 256, 512, 1024, 2048] cuda_grading_enabled=True disable_overlap_scheduler=False moe_max_num_tokens=None moe_load_balancer=None a'CUTLASS' mixed_sampler=False enable_trtllm_sampler=False kv_cache_dtype='auto' use_kv_cache=True ter_req_stats=False print_iter_log=False torch_compile_enabled=True torch_compile_fullgraph=True torch_compile_piecewise_cuda_graph=False torch_compile_enable_userbuffers=True autotuner_enabled=Tr auto_deploy_config=None enable_min_latency=False model_factory='AutoModelForCausalLM' model_kwarga_backend='MultiHeadLatentAttention' skip_loading_weights=False free_mem_ratio=0.8 simple_shard_on_device=None extended_runtime_perf_knob_config=ExtendedRuntimePerfKnobConfig(multi_block_mode=True cuda_graph_mode=True, cuda_graph_cache_size=1000) parallel_config=_ParallelConfig(tp_size=1, pp_se_cluster_size=-1, moe_tp_size=-1, moe_ep_size=-1, cp_config={}, enable_attention_dp=False, auto_ps=None) model_format=<_ModelFormatKind.HF: 0> speculative_model=None
rank 0 using MpiPoolSession to spawn MPI processes
[06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
[06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_que
2025-06-11 10:09:05,257 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.21.0rc1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[06/11/2025-10:09:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Initializing for: lib='OMPI', local_rank=
[06/11/2025-10:09:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] max_seq_len=256, max_batch_size=2048, att
/root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/d04e592bb4f6aa9cfee91e2e20
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
[06/11/2025-10:09:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] No quantization to do.
[06/11/2025-10:09:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 0 MoE Patterns
[06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 64 repeat_kv patterns
[06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 0 eager attention patterns
[06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 32 grouped attention patterns
[06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 32 causal mask attention patterns
[06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found and matched 32 attention layouts
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
/usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.)
torch._C._set_onednn_allow_tf32(_allow_tf32)
[06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found and matched 32 RoPE patterns
[06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Match RoPE layout to bsnd
[06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 32 RoPE layout matches
[06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found and eliminated 192 redundant transp
[06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 32 RoPE optimizations
[06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device
[06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device
[06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device
[06/11/2025-10:09:15] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Loading and initializing weights.
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 0 allreduce+residual+rmsnorm fusion
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 0 GEMM+Collective fusions
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 2 input nodes and 1 output nodes
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Added 4 new input nodes for cached attent
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Replaced 32 attention.bsnd_grouped_sdpa o_cache
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Initialized 65 caches for cached attentio
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory ratio: 0.8
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory (MB): 59648 , Total memory (M
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Current cache size: 536870912, Current nu
[06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory before forward pass: 62545788
2025-06-11 10:09:22,734 - INFO - flashinfer.jit: Loading JIT ops: rope
2025-06-11 10:09:22,746 - INFO - flashinfer.jit: Finished loading JIT ops: rope
2025-06-11 10:09:22,748 - INFO - flashinfer.jit: Loading JIT ops: page
2025-06-11 10:09:22,757 - INFO - flashinfer.jit: Finished loading JIT ops: page
2025-06-11 10:09:22,763 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtyptype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_Fal
2025-06-11 10:09:22,773 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_c_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_
[06/11/2025-10:09:22] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory after forward pass: 624723886
[06/11/2025-10:09:22] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Memory for forward pass: 73400320
[06/11/2025-10:09:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] After all_gather - new_num_pages: 6021
[06/11/2025-10:09:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Fusion before compiling...
[06/11/2025-10:09:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Compiling for torch-opt backend...
[06/11/2025-10:09:29] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 1
[06/11/2025-10:09:34] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 2
[06/11/2025-10:09:39] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 4
[06/11/2025-10:09:43] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 8
[06/11/2025-10:09:48] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 16
[06/11/2025-10:09:53] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 24
[06/11/2025-10:09:58] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 32
[06/11/2025-10:10:02] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 40
[rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] torch._dynamo hit c
[rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] function: 'forwa
[rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] last reason: 0/7ch at index 0. expected 32, actual 40
[rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] To log all recompililes".
[rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] To diagnose recompig/docs/main/torch.compiler_troubleshooting.html.
[06/11/2025-10:10:03] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 48
[06/11/2025-10:10:03] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 56
[06/11/2025-10:10:04] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 64
[06/11/2025-10:10:04] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 72
[06/11/2025-10:10:05] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 80
[06/11/2025-10:10:05] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 88
[06/11/2025-10:10:05] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 96
[06/11/2025-10:10:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 104
[06/11/2025-10:10:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 112
[06/11/2025-10:10:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 120
[06/11/2025-10:10:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 128
[06/11/2025-10:10:08] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 256
[06/11/2025-10:10:08] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 512
[06/11/2025-10:10:08] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 1024
[06/11/2025-10:10:09] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 2048
[06/11/2025-10:10:09] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Compile time with backend torch-opt: 45.1
[06/11/2025-10:10:10] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Using fake cache manager with head_dim=0
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 4 [window size=256]
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.00 GiB for max tokens in paged KV cache (385344)
[06/11/2025-10:10:10] [TRT-LLM] [I] Setting up for warmup...
[06/11/2025-10:10:10] [TRT-LLM] [I] Running warmup.
[06/11/2025-10:10:10] [TRT-LLM] [I] Starting benchmarking async task.
[06/11/2025-10:10:10] [TRT-LLM] [I] Starting benchmark...
[06/11/2025-10:10:10] [TRT-LLM] [I] Request submission complete. [count=2, time=0.0000s, rate=1698
[06/11/2025-10:10:12] [TRT-LLM] [I] Benchmark complete.
[06/11/2025-10:10:12] [TRT-LLM] [I] Stopping LLM backend.
[06/11/2025-10:10:12] [TRT-LLM] [I] Cancelling all 0 tasks to complete.
[06/11/2025-10:10:12] [TRT-LLM] [I] All tasks cancelled.
[06/11/2025-10:10:12] [TRT-LLM] [I] LLM Backend stopped.
[06/11/2025-10:10:12] [TRT-LLM] [I] Worker task cancelled.
[06/11/2025-10:10:12] [TRT-LLM] [I] Warmup done.
[06/11/2025-10:10:12] [TRT-LLM] [I] No log path provided, skipping logging.
[06/11/2025-10:10:12] [TRT-LLM] [I] Starting benchmarking async task.
[06/11/2025-10:10:12] [TRT-LLM] [I] Starting benchmark...
[06/11/2025-10:10:12] [TRT-LLM] [I] Request submission complete. [count=3000, time=0.0013s, rate=2
Traceback (most recent call last):
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1612,
outputs = forward(scheduled_requests, self.resource_manager,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1602,
return self.model_engine.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py", line
last_logit_only = self._prepare_inputs(scheduled_requests, resource_manager, new_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py", line
new_tokens_list = new_tokens.cpu().tolist() if new_tokens is not None else None
^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace belo
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[06/11/2025-10:10:24] [TRT-LLM] [E] Encountered an error in forward function: CUDA error: an illeg
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace belo
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1686,
self.sampler.update_requests(sample_state)
File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/sampler.py", line 241, in up
state.sampler_event.synchronize()
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 227, in synchronize
super().synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace belo
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[06/11/2025-10:10:24] [TRT-LLM] [E] Encountered an error in sampling: CUDA error: an illegal memor
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace belo
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
....
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment