Skip to content

Instantly share code, notes, and snippets.

@namgyu-youn
Created January 8, 2026 18:08
Show Gist options
  • Select an option

  • Save namgyu-youn/0dca97ff669cfebfcb3af522ae10ea83 to your computer and use it in GitHub Desktop.

Select an option

Save namgyu-youn/0dca97ff669cfebfcb3af522ae10ea83 to your computer and use it in GitHub Desktop.
[TorchAO] SmoothQuant benchmark error in vLLM
./benchmarks/quantization/measure_accuracy_and_performance.sh smoothquant_int8 meta-llama/Llama-3.2-1B
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info
torch.__version__='2.9.0+cu128'
torch.cuda.get_device_name()='NVIDIA A100 80GB PCIe MIG 2g.20gb'
torchao.__version__='0.15.0'
vllm.__version__='0.13.0'
processing quant_recipe smoothquant_int8
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info
Running model_id='meta-llama/Llama-3.2-1B' with quant_recipe_name='smoothquant_int8'
Quantizing model with config: SmoothQuantConfig(base_config=Int8DynamicActivationInt8WeightConfig(layout=PlainLayout(), act_mapping_type=<MappingType.SYMMETRIC: 1>, weight_only_decode=False, granula
rity=PerRow(dim=-1), set_inductor_config=True, version=2), step='prepare_for_loading', alpha=0.5) [2026-01-08 18:04:08] INFO modeling.py:987: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use
more memory (at your own risk). /home/elicer/ao/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 't
f32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision)
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=512, bias=False)
(v_proj): Linear(in_features=2048, out_features=512, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
saved model_id='meta-llama/Llama-3.2-1B', quant_recipe_name='smoothquant_int8' to model_output_dir='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/'
checkpoint size: 2.488941689 GB
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info
[2026-01-08 18:04:30] WARNING __main__.py:369: --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
[2026-01-08 18:04:30] INFO __main__.py:465: Selected Tasks: ['winogrande']
[2026-01-08 18:04:30] INFO evaluator.py:202: Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
[2026-01-08 18:04:30] INFO evaluator.py:240: Initializing hf model, with arguments: {'pretrained': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/'}
[2026-01-08 18:04:30] INFO huggingface.py:158: Using device 'cuda:0'
The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-
Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. /home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1, please check the depreca
tion warning warnings.warn(
[2026-01-08 18:04:31] INFO huggingface.py:420: Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
/home/elicer/ao/.venv/lib/python3.10/site-packages/transformers/quantizers/auto.py:239: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you'r
e loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used. warnings.warn(warning_msg)
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
Generating train split: 100%|██████████| 40398/40398 [00:00<00:00, 1741218.89 examples/s]
Generating test split: 100%|██████████| 1767/1767 [00:00<00:00, 598508.86 examples/s]
Generating validation split: 100%|██████████| 1267/1267 [00:00<00:00, 453622.12 examples/s]
[2026-01-08 18:04:44] INFO __init__.py:695: Selected tasks:
[2026-01-08 18:04:44] INFO __init__.py:686: Task: winogrande (winogrande/default.yaml)
[2026-01-08 18:04:44] INFO task.py:434: Building contexts for winogrande on rank 0...
100%|██████████| 100/100 [00:00<00:00, 51482.80it/s]
[2026-01-08 18:04:44] INFO evaluator.py:574: Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 200/200 [00:03<00:00, 54.01it/s]
[2026-01-08 18:04:49] INFO evaluation_tracker.py:209: Saving results aggregated
hf (pretrained=benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr|
|----------|------:|------|-----:|------|---|----:|---|-----:|
|winogrande| 1|none | 0|acc |↑ | 0.63|± |0.0485|
benchmarking vllm prefill performance with --num_prompts 8 --input_len 1024 --output_len 32 --max_model_len 1056
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info
The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-
Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. When dataset path is not set, it will default to random dataset
INFO 01-08 18:05:02 [datasets.py:612] Sampling input_len from [1023, 1023] and output_len from [32, 32]
INFO 01-08 18:05:02 [utils.py:253] non-default args: {'tokenizer': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', 'dtype': 'bfloat16', 'max_model_len': 1056, 'enable_lor
a': None, 'reasoning_parser_plugin': '', 'model': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/'} INFO 01-08 18:05:12 [model.py:514] Resolved architecture: LlamaForCausalLM
INFO 01-08 18:05:12 [model.py:1661] Using max model len 1056
INFO 01-08 18:05:13 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=8192.
/home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1, please check the depreca
tion warning warnings.warn(
The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-
Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info
(EngineCore_DP0 pid=3286) INFO 01-08 18:05:23 [core.py:93] Initializing a V1 LLM engine (v0.13.0) with config: model='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', specu
lative_config=None, tokenizer='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1056, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None} (EngineCore_DP0 pid=3286) INFO 01-08 18:05:23 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.2.100:33409 backend=nccl
(EngineCore_DP0 pid=3286) INFO 01-08 18:05:23 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=3286) INFO 01-08 18:05:25 [gpu_model_runner.py:3562] Starting to load model benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/...
(EngineCore_DP0 pid=3286) /home/elicer/ao/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudn
n.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) (EngineCore_DP0 pid=3286) _C._set_float32_matmul_precision(precision)
(EngineCore_DP0 pid=3286) INFO 01-08 18:05:50 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=3286) /home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1
, please check the deprecation warning (EngineCore_DP0 pid=3286) warnings.warn(
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] super().__init__(
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self._init_executor()
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.driver_worker.load_model()
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.model = model_loader.load_model(
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.load_weights(model, model_config)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 640, in load_weights
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return loader.load_weights(
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_
model_load_weights (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] yield from self._load_module(
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 497, in load_weights
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] weight_loader(param, loaded_weight, shard_id)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 1238, in weight_loader
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] param_data.copy_(loaded_weight)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 634, in _dispatch__torch_function__
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return func(*args, **kwargs)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 650, in _dispatch__torch_dispatch__
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return cls._ATEN_OP_TABLE[cls][func](func, types, args, kwargs)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 413, in wrapper
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return _func(f, types, args, kwargs)
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 544, in _
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] if _same_metadata(self, src):
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 503, in _same_metadata
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] _tensor_shape_match = all(
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 504, in <genexpr>
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] getattr(self, t_name).shape == getattr(src, t_name).shape
(EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] AttributeError: 'Tensor' object has no attribute 'qdata'
(EngineCore_DP0 pid=3286) Process EngineCore_DP0:
(EngineCore_DP0 pid=3286) Traceback (most recent call last):
(EngineCore_DP0 pid=3286) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=3286) self.run()
(EngineCore_DP0 pid=3286) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=3286) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=3286) raise e
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=3286) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=3286) super().__init__(
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=3286) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=3286) self._init_executor()
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=3286) self.driver_worker.load_model()
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=3286) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=3286) self.model = model_loader.load_model(
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=3286) self.load_weights(model, model_config)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights
(EngineCore_DP0 pid=3286) loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 640, in load_weights
(EngineCore_DP0 pid=3286) return loader.load_weights(
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=3286) return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=3286) autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=3286) yield from self._load_module(
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=3286) loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 497, in load_weights
(EngineCore_DP0 pid=3286) weight_loader(param, loaded_weight, shard_id)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 1238, in weight_loader
(EngineCore_DP0 pid=3286) param_data.copy_(loaded_weight)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 634, in _dispatch__torch_function__
(EngineCore_DP0 pid=3286) return func(*args, **kwargs)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 650, in _dispatch__torch_dispatch__
(EngineCore_DP0 pid=3286) return cls._ATEN_OP_TABLE[cls][func](func, types, args, kwargs)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 413, in wrapper
(EngineCore_DP0 pid=3286) return _func(f, types, args, kwargs)
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 544, in _
(EngineCore_DP0 pid=3286) if _same_metadata(self, src):
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 503, in _same_metadata
(EngineCore_DP0 pid=3286) _tensor_shape_match = all(
(EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 504, in <genexpr>
(EngineCore_DP0 pid=3286) getattr(self, t_name).shape == getattr(src, t_name).shape
(EngineCore_DP0 pid=3286) AttributeError: 'Tensor' object has no attribute 'qdata'
Loading pt checkpoint shards: 0% Completed | 0/1 [00:01<?, ?it/s]
(EngineCore_DP0 pid=3286)
[rank0]:[W108 18:05:53.558061877 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://p
ytorch.org/docs/stable/distributed.html#shutdown (function operator()) Traceback (most recent call last):
File "/home/elicer/ao/.venv/bin/vllm", line 10, in <module>
sys.exit(main())
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
args.dispatch_function(args)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/benchmark/throughput.py", line 21, in cmd
main(args)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 730, in main
elapsed_time, request_outputs = run_vllm(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 51, in run_vllm
llm = LLM(**dataclasses.asdict(engine_args))
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 351, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 183, in from_engine_args
return cls(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 109, in __init__
self.engine_core = EngineCoreClient.make_client(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 93, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 648, in __init__
super().__init__(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 477, in __init__
with launch_core_engines(vllm_config, executor_class, log_stats) as (
File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines
wait_for_engine_startup(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup
raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
benchmarking vllm decode performance with --num_prompts 32 --input_len 32 --output_len 512 --max_model_len 544
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info
The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-
Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. When dataset path is not set, it will default to random dataset
INFO 01-08 18:06:06 [datasets.py:612] Sampling input_len from [31, 31] and output_len from [512, 512]
INFO 01-08 18:06:06 [utils.py:253] non-default args: {'tokenizer': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', 'dtype': 'bfloat16', 'max_model_len': 544, 'enable_lora
': None, 'reasoning_parser_plugin': '', 'model': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/'} INFO 01-08 18:06:06 [model.py:514] Resolved architecture: LlamaForCausalLM
INFO 01-08 18:06:06 [model.py:1661] Using max model len 544
INFO 01-08 18:06:07 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=8192.
/home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1, please check the depreca
tion warning warnings.warn(
The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-
Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info
(EngineCore_DP0 pid=3790) INFO 01-08 18:06:17 [core.py:93] Initializing a V1 LLM engine (v0.13.0) with config: model='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', specu
lative_config=None, tokenizer='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=544, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None} (EngineCore_DP0 pid=3790) INFO 01-08 18:06:18 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.2.100:36661 backend=nccl
(EngineCore_DP0 pid=3790) INFO 01-08 18:06:18 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=3790) INFO 01-08 18:06:18 [gpu_model_runner.py:3562] Starting to load model benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/...
(EngineCore_DP0 pid=3790) /home/elicer/ao/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudn
n.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) (EngineCore_DP0 pid=3790) _C._set_float32_matmul_precision(precision)
(EngineCore_DP0 pid=3790) INFO 01-08 18:06:19 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=3790) /home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1
, please check the deprecation warning (EngineCore_DP0 pid=3790) warnings.warn(
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] super().__init__(
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self._init_executor()
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.driver_worker.load_model()
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.model = model_loader.load_model(
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.load_weights(model, model_config)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 640, in load_weights
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return loader.load_weights(
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_
model_load_weights (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] yield from self._load_module(
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 497, in load_weights
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] weight_loader(param, loaded_weight, shard_id)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 1238, in weight_loader
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] param_data.copy_(loaded_weight)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 634, in _dispatch__torch_function__
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return func(*args, **kwargs)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 650, in _dispatch__torch_dispatch__
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return cls._ATEN_OP_TABLE[cls][func](func, types, args, kwargs)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 413, in wrapper
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return _func(f, types, args, kwargs)
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 544, in _
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] if _same_metadata(self, src):
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 503, in _same_metadata
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] _tensor_shape_match = all(
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 504, in <genexpr>
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] getattr(self, t_name).shape == getattr(src, t_name).shape
(EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] AttributeError: 'Tensor' object has no attribute 'qdata'
(EngineCore_DP0 pid=3790) Process EngineCore_DP0:
(EngineCore_DP0 pid=3790) Traceback (most recent call last):
(EngineCore_DP0 pid=3790) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=3790) self.run()
(EngineCore_DP0 pid=3790) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=3790) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 870, in run_engine_core
(EngineCore_DP0 pid=3790) raise e
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=3790) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=3790) super().__init__(
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=3790) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=3790) self._init_executor()
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=3790) self.driver_worker.load_model()
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model
(EngineCore_DP0 pid=3790) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model
(EngineCore_DP0 pid=3790) self.model = model_loader.load_model(
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=3790) self.load_weights(model, model_config)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights
(EngineCore_DP0 pid=3790) loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 640, in load_weights
(EngineCore_DP0 pid=3790) return loader.load_weights(
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=3790) return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=3790) autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=3790) yield from self._load_module(
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=3790) loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 497, in load_weights
(EngineCore_DP0 pid=3790) weight_loader(param, loaded_weight, shard_id)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 1238, in weight_loader
(EngineCore_DP0 pid=3790) param_data.copy_(loaded_weight)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 634, in _dispatch__torch_function__
(EngineCore_DP0 pid=3790) return func(*args, **kwargs)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 650, in _dispatch__torch_dispatch__
(EngineCore_DP0 pid=3790) return cls._ATEN_OP_TABLE[cls][func](func, types, args, kwargs)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 413, in wrapper
(EngineCore_DP0 pid=3790) return _func(f, types, args, kwargs)
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 544, in _
(EngineCore_DP0 pid=3790) if _same_metadata(self, src):
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 503, in _same_metadata
(EngineCore_DP0 pid=3790) _tensor_shape_match = all(
(EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 504, in <genexpr>
(EngineCore_DP0 pid=3790) getattr(self, t_name).shape == getattr(src, t_name).shape
(EngineCore_DP0 pid=3790) AttributeError: 'Tensor' object has no attribute 'qdata'
Loading pt checkpoint shards: 0% Completed | 0/1 [00:01<?, ?it/s]
(EngineCore_DP0 pid=3790)
[rank0]:[W108 18:06:22.467192896 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://p
ytorch.org/docs/stable/distributed.html#shutdown (function operator()) Traceback (most recent call last):
File "/home/elicer/ao/.venv/bin/vllm", line 10, in <module>
sys.exit(main())
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
args.dispatch_function(args)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/benchmark/throughput.py", line 21, in cmd
main(args)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 730, in main
elapsed_time, request_outputs = run_vllm(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 51, in run_vllm
llm = LLM(**dataclasses.asdict(engine_args))
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 351, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 183, in from_engine_args
return cls(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 109, in __init__
self.engine_core = EngineCoreClient.make_client(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 93, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 648, in __init__
super().__init__(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 477, in __init__
with launch_core_engines(vllm_config, executor_class, log_stats) as (
File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines
wait_for_engine_startup(
File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup
raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Library Versions:
================================================================================
torch.__version__: 2.9.0+cu128
torch.cuda.get_device_name(): NVIDIA A100 80GB PCIe MIG 2g.20gb
torchao.__version__: 0.15.0
vllm.__version__: 0.13.0
Quantization Recipe Results:
================================================================================
+------------------+--------------+--------------+--------------+--------------+-----------+----------+-----------+-----------+
| Recipe | Checkpoint | Wikitext | Winogrande | Winogrande | Prefill | Decode | Speedup | Speedup |
| | (GB) | Perplexity | Acc | Stderr | toks/s | toks/s | Prefill | Decode |
+==================+==============+==============+==============+==============+===========+==========+===========+===========+
| smoothquant_int8 | 2.49 | | 0.63 | 0.0485 | | | | |
+------------------+--------------+--------------+--------------+--------------+-----------+----------+-----------+-----------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment