Created
June 11, 2025 15:12
-
-
Save kiya00/43da4d15008dcab6dbe6b768ea6ef52c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| root@6e61d1d8b02e:/app/tensorrt_llm# trtllm-bench --model $MODEL_ID throughput --dataset /tmp/syntoy | |
| 2025-06-11 10:08:54,023 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend | |
| [TensorRT-LLM] TensorRT-LLM version: 0.21.0rc1 | |
| [06/11/2025-10:08:54] [TRT-LLM] [I] Preparing to run throughput benchmark... | |
| Parse safetensors files: 100%|████████████████████████████████████████████████████████████████████ | |
| [06/11/2025-10:08:55] [TRT-LLM] [I] | |
| =========================================================== | |
| = DATASET DETAILS | |
| =========================================================== | |
| Dataset Path: /tmp/synthetic_128_128.txt | |
| Number of Sequences: 3000 | |
| -- Percentiles statistics --------------------------------- | |
| Input Output Seq. Length | |
| ----------------------------------------------------------- | |
| MIN: 128.0000 128.0000 256.0000 | |
| MAX: 128.0000 128.0000 256.0000 | |
| AVG: 128.0000 128.0000 256.0000 | |
| P50: 128.0000 128.0000 256.0000 | |
| P90: 128.0000 128.0000 256.0000 | |
| P95: 128.0000 128.0000 256.0000 | |
| P99: 128.0000 128.0000 256.0000 | |
| =========================================================== | |
| Parse safetensors files: 100%|████████████████████████████████████████████████████████████████████ | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto" | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quant | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Estimated engine size: 14.96 GB | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Estimated total available memory for KV cache: 64.69 GB | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Estimated total KV cache memory: 61.46 GB | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Estimated max number of requests in KV cache memory: 1966.57 | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Estimated max batch size (after fine-tune): 2048 | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Estimated max num tokens (after fine-tune): 4096 | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Max batch size and max num tokens not provided. Using heuristi_size=2048, max_num_tokens=4096. | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Setting PyTorch max sequence length to 256 | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Setting up throughput benchmark. | |
| [06/11/2025-10:08:56] [TRT-LLM] [W] Using default gpus_per_node: 8 | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Set nccl_plugin to None. | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] model='meta-llama/Llama-3.1-8B' tokenizer=None tokenizer_mode=_remote_code=True tensor_parallel_size=1 dtype='auto' revision=None tokenizer_revision=None pipeliize=1 gpus_per_node=8 moe_cluster_parallel_size=-1 moe_tensor_parallel_size=-1 moe_expert_parallel_config={} load_format=<LoadFormat.AUTO: 0> enable_lora=False max_lora_rank=None max_loras=4 max_compt_adapter=False max_prompt_adapter_token=0 quant_config=QuantConfig(quant_algo=None, kv_cache_qhquant_val=0.5, clamp_val=None, use_meta_recipe=False, has_zero_point=False, pre_quant_scale=Falseig=KvCacheConfig(enable_block_reuse=False, max_tokens=None, max_attention_window=None, sink_token_=0.9, host_cache_size=None, onboard_blocks=True, cross_kv_cache_fraction=None, secondary_offload_mze=0, enable_partial_reuse=True, copy_on_partial_reuse=True) enable_chunked_prefill=False guided_drocessor=None iter_stats_max_iterations=None request_stats_max_iterations=None peft_cache_config=Ncapacity_scheduler_policy=<CapacitySchedulerPolicy.GUARANTEED_NO_EVICT: 'GUARANTEED_NO_EVICT'>, coatch_config=DynamicBatchConfig(enable_batch_size_tuning=True, enable_max_num_tokens_tuning=False, 28)) cache_transceiver_config=None speculative_config=None batching_type=<BatchingType.INFLIGHT: 'max_batch_size=2048 max_input_len=1024 max_seq_len=256 max_beam_width=1 max_num_tokens=4096 backenits=False num_postprocess_workers=0 postprocess_tokenizer_dir=None reasoning_parser=None decoding_nfig=BuildConfig(max_input_len=1024, max_seq_len=None, opt_batch_size=8, max_batch_size=2048, max_t_num_tokens=None, max_prompt_embedding_table_size=0, kv_cache_type=None, gather_context_logits=Fastrongly_typed=True, force_num_profiles=None, profiling_verbosity='layer_names_only', enable_debuglative_decoding_mode=<SpeculativeDecodingMode.NONE: 1>, use_refit=False, input_timing_cache=None, ra_config=LoraConfig(lora_dir=[], lora_ckpt_source='hf', max_lora_rank=64, lora_target_modules=[],_loras=4, max_cpu_loras=4), auto_parallel_config=AutoParallelConfig(world_size=1, gpus_per_node=8, sharding_cost_model=<CostModel.ALPHA_BETA: 'alpha_beta'>, comm_cost_model=<CostModel.ALPHA_BETA: elism=False, enable_shard_unbalanced_shape=False, enable_shard_dynamic_shape=False, enable_reduce_ug_mode=False, infer_shape=True, validation_mode=False, same_buffer_io={}, same_spec_io={}, sharde, parallel_config_cache=None, profile_cache=None, dump_path=None, debug_outputs=[]), weight_sparsigin_config=PluginConfig(_dtype='float16', _bert_attention_plugin='auto', _gpt_attention_plugin='auisable_gemm_plugin=False, _gemm_swiglu_plugin=None, _fp8_rowwise_gemm_plugin=None, _qserve_gemm_plcl_plugin=None, _lora_plugin=None, _dora_plugin=False, _weight_only_groupwise_quant_matmul_plugin=n=None, _smooth_quant_plugins=True, _smooth_quant_gemm_plugin=None, _layernorm_quantization_pluginone, _quantize_per_token_plugin=False, _quantize_tensor_plugin=False, _moe_plugin='auto', _mamba_cm_plugin=None, _low_latency_gemm_swiglu_plugin=None, _gemm_allreduce_plugin=None, _context_fmha=Tr, _paged_kv_cache=None, _remove_input_padding=True, _norm_quant_fusion=False, _reduce_fusion=Falseck=32, _use_paged_context_fmha=True, _use_fp8_context_fmha=True, _fuse_fp4_quant=False, _multiple_treamingllm=False, _manage_weights=False, _use_fused_mlp=True, _pp_reduce_scatter=False), use_stri024, dry_run=False, visualize_network=None, monitor_memory=False, use_mrope=False) use_cuda_graph= 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 256, 512, 1024, 2048] cuda_grading_enabled=True disable_overlap_scheduler=False moe_max_num_tokens=None moe_load_balancer=None a'CUTLASS' mixed_sampler=False enable_trtllm_sampler=False kv_cache_dtype='auto' use_kv_cache=True ter_req_stats=False print_iter_log=False torch_compile_enabled=True torch_compile_fullgraph=True torch_compile_piecewise_cuda_graph=False torch_compile_enable_userbuffers=True autotuner_enabled=Tr auto_deploy_config=None enable_min_latency=False model_factory='AutoModelForCausalLM' model_kwarga_backend='MultiHeadLatentAttention' skip_loading_weights=False free_mem_ratio=0.8 simple_shard_on_device=None extended_runtime_perf_knob_config=ExtendedRuntimePerfKnobConfig(multi_block_mode=True cuda_graph_mode=True, cuda_graph_cache_size=1000) parallel_config=_ParallelConfig(tp_size=1, pp_se_cluster_size=-1, moe_tp_size=-1, moe_ep_size=-1, cp_config={}, enable_attention_dp=False, auto_ps=None) model_format=<_ModelFormatKind.HF: 0> speculative_model=None | |
| rank 0 using MpiPoolSession to spawn MPI processes | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue | |
| [06/11/2025-10:08:56] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_que | |
| 2025-06-11 10:09:05,257 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend | |
| [TensorRT-LLM] TensorRT-LLM version: 0.21.0rc1 | |
| [TensorRT-LLM][INFO] Refreshed the MPI local session | |
| [06/11/2025-10:09:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Initializing for: lib='OMPI', local_rank= | |
| [06/11/2025-10:09:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] max_seq_len=256, max_batch_size=2048, att | |
| /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/d04e592bb4f6aa9cfee91e2e20 | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| [06/11/2025-10:09:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] No quantization to do. | |
| [06/11/2025-10:09:12] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 0 MoE Patterns | |
| [06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 64 repeat_kv patterns | |
| [06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 0 eager attention patterns | |
| [06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 32 grouped attention patterns | |
| [06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 32 causal mask attention patterns | |
| [06/11/2025-10:09:13] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found and matched 32 attention layouts | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| /usr/local/lib/python3.12/dist-packages/torch/backends/mkldnn/__init__.py:78: UserWarning: TF32 acble for Intel GPUs. The current Torch version does not have Intel GPU Support. (Triggered internalTen/Context.cpp:148.) | |
| torch._C._set_onednn_allow_tf32(_allow_tf32) | |
| [06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found and matched 32 RoPE patterns | |
| [06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Match RoPE layout to bsnd | |
| [06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 32 RoPE layout matches | |
| [06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found and eliminated 192 redundant transp | |
| [06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 32 RoPE optimizations | |
| [06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device | |
| [06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device | |
| [06/11/2025-10:09:14] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Skipping sharding for single device | |
| [06/11/2025-10:09:15] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Loading and initializing weights. | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 0 allreduce+residual+rmsnorm fusion | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 0 GEMM+Collective fusions | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Found 2 input nodes and 1 output nodes | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Added 4 new input nodes for cached attent | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Replaced 32 attention.bsnd_grouped_sdpa o_cache | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Initialized 65 caches for cached attentio | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory ratio: 0.8 | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory (MB): 59648 , Total memory (M | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Current cache size: 536870912, Current nu | |
| [06/11/2025-10:09:21] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory before forward pass: 62545788 | |
| 2025-06-11 10:09:22,734 - INFO - flashinfer.jit: Loading JIT ops: rope | |
| 2025-06-11 10:09:22,746 - INFO - flashinfer.jit: Finished loading JIT ops: rope | |
| 2025-06-11 10:09:22,748 - INFO - flashinfer.jit: Loading JIT ops: page | |
| 2025-06-11 10:09:22,757 - INFO - flashinfer.jit: Finished loading JIT ops: page | |
| 2025-06-11 10:09:22,763 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtyptype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_Fal | |
| 2025-06-11 10:09:22,773 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_c_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_ | |
| [06/11/2025-10:09:22] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Free memory after forward pass: 624723886 | |
| [06/11/2025-10:09:22] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Memory for forward pass: 73400320 | |
| [06/11/2025-10:09:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] After all_gather - new_num_pages: 6021 | |
| [06/11/2025-10:09:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Fusion before compiling... | |
| [06/11/2025-10:09:24] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Compiling for torch-opt backend... | |
| [06/11/2025-10:09:29] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 1 | |
| [06/11/2025-10:09:34] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 2 | |
| [06/11/2025-10:09:39] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 4 | |
| [06/11/2025-10:09:43] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 8 | |
| [06/11/2025-10:09:48] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 16 | |
| [06/11/2025-10:09:53] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 24 | |
| [06/11/2025-10:09:58] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 32 | |
| [06/11/2025-10:10:02] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 40 | |
| [rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] torch._dynamo hit c | |
| [rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] function: 'forwa | |
| [rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] last reason: 0/7ch at index 0. expected 32, actual 40 | |
| [rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] To log all recompililes". | |
| [rank0]:W0611 10:10:02.995000 117966 torch/_dynamo/convert_frame.py:961] [0/8] To diagnose recompig/docs/main/torch.compiler_troubleshooting.html. | |
| [06/11/2025-10:10:03] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 48 | |
| [06/11/2025-10:10:03] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 56 | |
| [06/11/2025-10:10:04] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 64 | |
| [06/11/2025-10:10:04] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 72 | |
| [06/11/2025-10:10:05] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 80 | |
| [06/11/2025-10:10:05] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 88 | |
| [06/11/2025-10:10:05] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 96 | |
| [06/11/2025-10:10:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 104 | |
| [06/11/2025-10:10:06] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 112 | |
| [06/11/2025-10:10:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 120 | |
| [06/11/2025-10:10:07] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 128 | |
| [06/11/2025-10:10:08] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 256 | |
| [06/11/2025-10:10:08] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 512 | |
| [06/11/2025-10:10:08] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 1024 | |
| [06/11/2025-10:10:09] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Capturing graph for batch size: 2048 | |
| [06/11/2025-10:10:09] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Compile time with backend torch-opt: 45.1 | |
| [06/11/2025-10:10:10] [TRT-LLM AUTO-DEPLOY] [RANK 0] [I] Using fake cache manager with head_dim=0 | |
| [TensorRT-LLM][INFO] Max KV cache pages per sequence: 4 [window size=256] | |
| [TensorRT-LLM][INFO] Number of tokens per block: 64. | |
| [TensorRT-LLM][INFO] [MemUsageChange] Allocated 0.00 GiB for max tokens in paged KV cache (385344) | |
| [06/11/2025-10:10:10] [TRT-LLM] [I] Setting up for warmup... | |
| [06/11/2025-10:10:10] [TRT-LLM] [I] Running warmup. | |
| [06/11/2025-10:10:10] [TRT-LLM] [I] Starting benchmarking async task. | |
| [06/11/2025-10:10:10] [TRT-LLM] [I] Starting benchmark... | |
| [06/11/2025-10:10:10] [TRT-LLM] [I] Request submission complete. [count=2, time=0.0000s, rate=1698 | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] Benchmark complete. | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] Stopping LLM backend. | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] Cancelling all 0 tasks to complete. | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] All tasks cancelled. | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] LLM Backend stopped. | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] Worker task cancelled. | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] Warmup done. | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] No log path provided, skipping logging. | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] Starting benchmarking async task. | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] Starting benchmark... | |
| [06/11/2025-10:10:12] [TRT-LLM] [I] Request submission complete. [count=3000, time=0.0013s, rate=2 | |
| Traceback (most recent call last): | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1612, | |
| outputs = forward(scheduled_requests, self.resource_manager, | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner | |
| result = func(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1602, | |
| return self.model_engine.forward( | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate | |
| return func(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py", line | |
| last_logit_only = self._prepare_inputs(scheduled_requests, resource_manager, new_tokens) | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner | |
| result = func(*args, **kwargs) | |
| ^^^^^^^^^^^^^^^^^^^^^ | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py", line | |
| new_tokens_list = new_tokens.cpu().tolist() if new_tokens is not None else None | |
| ^^^^^^^^^^^^^^^^ | |
| RuntimeError: CUDA error: an illegal memory access was encountered | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace belo | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| [06/11/2025-10:10:24] [TRT-LLM] [E] Encountered an error in forward function: CUDA error: an illeg | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace belo | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| Traceback (most recent call last): | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1686, | |
| self.sampler.update_requests(sample_state) | |
| File "/app/tensorrt_llm/TensorRT-LLM/tensorrt_llm/_torch/pyexecutor/sampler.py", line 241, in up | |
| state.sampler_event.synchronize() | |
| File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 227, in synchronize | |
| super().synchronize() | |
| RuntimeError: CUDA error: an illegal memory access was encountered | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace belo | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| [06/11/2025-10:10:24] [TRT-LLM] [E] Encountered an error in sampling: CUDA error: an illegal memor | |
| CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace belo | |
| For debugging consider passing CUDA_LAUNCH_BLOCKING=1 | |
| Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | |
| .... |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment