Created
January 8, 2026 18:08
-
-
Save namgyu-youn/0dca97ff669cfebfcb3af522ae10ea83 to your computer and use it in GitHub Desktop.
[TorchAO] SmoothQuant benchmark error in vLLM
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ./benchmarks/quantization/measure_accuracy_and_performance.sh smoothquant_int8 meta-llama/Llama-3.2-1B | |
| Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info | |
| torch.__version__='2.9.0+cu128' | |
| torch.cuda.get_device_name()='NVIDIA A100 80GB PCIe MIG 2g.20gb' | |
| torchao.__version__='0.15.0' | |
| vllm.__version__='0.13.0' | |
| processing quant_recipe smoothquant_int8 | |
| Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info | |
| Running model_id='meta-llama/Llama-3.2-1B' with quant_recipe_name='smoothquant_int8' | |
| Quantizing model with config: SmoothQuantConfig(base_config=Int8DynamicActivationInt8WeightConfig(layout=PlainLayout(), act_mapping_type=<MappingType.SYMMETRIC: 1>, weight_only_decode=False, granula | |
| rity=PerRow(dim=-1), set_inductor_config=True, version=2), step='prepare_for_loading', alpha=0.5) [2026-01-08 18:04:08] INFO modeling.py:987: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use | |
| more memory (at your own risk). /home/elicer/ao/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 't | |
| f32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) _C._set_float32_matmul_precision(precision) | |
| LlamaForCausalLM( | |
| (model): LlamaModel( | |
| (embed_tokens): Embedding(128256, 2048) | |
| (layers): ModuleList( | |
| (0-15): 16 x LlamaDecoderLayer( | |
| (self_attn): LlamaAttention( | |
| (q_proj): Linear(in_features=2048, out_features=2048, bias=False) | |
| (k_proj): Linear(in_features=2048, out_features=512, bias=False) | |
| (v_proj): Linear(in_features=2048, out_features=512, bias=False) | |
| (o_proj): Linear(in_features=2048, out_features=2048, bias=False) | |
| ) | |
| (mlp): LlamaMLP( | |
| (gate_proj): Linear(in_features=2048, out_features=8192, bias=False) | |
| (up_proj): Linear(in_features=2048, out_features=8192, bias=False) | |
| (down_proj): Linear(in_features=8192, out_features=2048, bias=False) | |
| (act_fn): SiLUActivation() | |
| ) | |
| (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05) | |
| (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05) | |
| ) | |
| ) | |
| (norm): LlamaRMSNorm((2048,), eps=1e-05) | |
| (rotary_emb): LlamaRotaryEmbedding() | |
| ) | |
| (lm_head): Linear(in_features=2048, out_features=128256, bias=False) | |
| ) | |
| saved model_id='meta-llama/Llama-3.2-1B', quant_recipe_name='smoothquant_int8' to model_output_dir='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' | |
| checkpoint size: 2.488941689 GB | |
| Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info | |
| [2026-01-08 18:04:30] WARNING __main__.py:369: --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT. | |
| [2026-01-08 18:04:30] INFO __main__.py:465: Selected Tasks: ['winogrande'] | |
| [2026-01-08 18:04:30] INFO evaluator.py:202: Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234 | |
| [2026-01-08 18:04:30] INFO evaluator.py:240: Initializing hf model, with arguments: {'pretrained': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/'} | |
| [2026-01-08 18:04:30] INFO huggingface.py:158: Using device 'cuda:0' | |
| The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B- | |
| Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. /home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1, please check the depreca | |
| tion warning warnings.warn( | |
| [2026-01-08 18:04:31] INFO huggingface.py:420: Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'} | |
| /home/elicer/ao/.venv/lib/python3.10/site-packages/transformers/quantizers/auto.py:239: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you'r | |
| e loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used. warnings.warn(warning_msg) | |
| The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues. | |
| Generating train split: 100%|██████████| 40398/40398 [00:00<00:00, 1741218.89 examples/s] | |
| Generating test split: 100%|██████████| 1767/1767 [00:00<00:00, 598508.86 examples/s] | |
| Generating validation split: 100%|██████████| 1267/1267 [00:00<00:00, 453622.12 examples/s] | |
| [2026-01-08 18:04:44] INFO __init__.py:695: Selected tasks: | |
| [2026-01-08 18:04:44] INFO __init__.py:686: Task: winogrande (winogrande/default.yaml) | |
| [2026-01-08 18:04:44] INFO task.py:434: Building contexts for winogrande on rank 0... | |
| 100%|██████████| 100/100 [00:00<00:00, 51482.80it/s] | |
| [2026-01-08 18:04:44] INFO evaluator.py:574: Running loglikelihood requests | |
| Running loglikelihood requests: 100%|██████████| 200/200 [00:03<00:00, 54.01it/s] | |
| [2026-01-08 18:04:49] INFO evaluation_tracker.py:209: Saving results aggregated | |
| hf (pretrained=benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/), gen_kwargs: (None), limit: 100.0, num_fewshot: None, batch_size: 1 | |
| | Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr| | |
| |----------|------:|------|-----:|------|---|----:|---|-----:| | |
| |winogrande| 1|none | 0|acc |↑ | 0.63|± |0.0485| | |
| benchmarking vllm prefill performance with --num_prompts 8 --input_len 1024 --output_len 32 --max_model_len 1056 | |
| Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info | |
| The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B- | |
| Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. When dataset path is not set, it will default to random dataset | |
| INFO 01-08 18:05:02 [datasets.py:612] Sampling input_len from [1023, 1023] and output_len from [32, 32] | |
| INFO 01-08 18:05:02 [utils.py:253] non-default args: {'tokenizer': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', 'dtype': 'bfloat16', 'max_model_len': 1056, 'enable_lor | |
| a': None, 'reasoning_parser_plugin': '', 'model': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/'} INFO 01-08 18:05:12 [model.py:514] Resolved architecture: LlamaForCausalLM | |
| INFO 01-08 18:05:12 [model.py:1661] Using max model len 1056 | |
| INFO 01-08 18:05:13 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=8192. | |
| /home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1, please check the depreca | |
| tion warning warnings.warn( | |
| The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B- | |
| Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info | |
| (EngineCore_DP0 pid=3286) INFO 01-08 18:05:23 [core.py:93] Initializing a V1 LLM engine (v0.13.0) with config: model='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', specu | |
| lative_config=None, tokenizer='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1056, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None} (EngineCore_DP0 pid=3286) INFO 01-08 18:05:23 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.2.100:33409 backend=nccl | |
| (EngineCore_DP0 pid=3286) INFO 01-08 18:05:23 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0 | |
| (EngineCore_DP0 pid=3286) INFO 01-08 18:05:25 [gpu_model_runner.py:3562] Starting to load model benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/... | |
| (EngineCore_DP0 pid=3286) /home/elicer/ao/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudn | |
| n.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) (EngineCore_DP0 pid=3286) _C._set_float32_matmul_precision(precision) | |
| (EngineCore_DP0 pid=3286) INFO 01-08 18:05:50 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION') | |
| (EngineCore_DP0 pid=3286) /home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1 | |
| , please check the deprecation warning (EngineCore_DP0 pid=3286) warnings.warn( | |
| Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] EngineCore failed to start. | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] Traceback (most recent call last): | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] engine_core = EngineCoreProc(*args, **kwargs) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 637, in __init__ | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] super().__init__( | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 102, in __init__ | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.model_executor = executor_class(vllm_config) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__ | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self._init_executor() | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.driver_worker.load_model() | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.model_runner.load_model(eep_scale_up=eep_scale_up) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.model = model_loader.load_model( | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] self.load_weights(model, model_config) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] loaded_weights = model.load_weights(self.get_all_weights(model_config, model)) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 640, in load_weights | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return loader.load_weights( | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_ | |
| model_load_weights (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return original_load_weights(auto_weight_loader, weights, mapper=mapper) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] autoloaded_weights = set(self._load_module("", self.module, weights)) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] yield from self._load_module( | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] loaded_params = module_load_weights(weights) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 497, in load_weights | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] weight_loader(param, loaded_weight, shard_id) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 1238, in weight_loader | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] param_data.copy_(loaded_weight) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 634, in _dispatch__torch_function__ | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return func(*args, **kwargs) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 650, in _dispatch__torch_dispatch__ | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return cls._ATEN_OP_TABLE[cls][func](func, types, args, kwargs) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 413, in wrapper | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] return _func(f, types, args, kwargs) | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 544, in _ | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] if _same_metadata(self, src): | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 503, in _same_metadata | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] _tensor_shape_match = all( | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 504, in <genexpr> | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] getattr(self, t_name).shape == getattr(src, t_name).shape | |
| (EngineCore_DP0 pid=3286) ERROR 01-08 18:05:53 [core.py:866] AttributeError: 'Tensor' object has no attribute 'qdata' | |
| (EngineCore_DP0 pid=3286) Process EngineCore_DP0: | |
| (EngineCore_DP0 pid=3286) Traceback (most recent call last): | |
| (EngineCore_DP0 pid=3286) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap | |
| (EngineCore_DP0 pid=3286) self.run() | |
| (EngineCore_DP0 pid=3286) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run | |
| (EngineCore_DP0 pid=3286) self._target(*self._args, **self._kwargs) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 870, in run_engine_core | |
| (EngineCore_DP0 pid=3286) raise e | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core | |
| (EngineCore_DP0 pid=3286) engine_core = EngineCoreProc(*args, **kwargs) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 637, in __init__ | |
| (EngineCore_DP0 pid=3286) super().__init__( | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 102, in __init__ | |
| (EngineCore_DP0 pid=3286) self.model_executor = executor_class(vllm_config) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__ | |
| (EngineCore_DP0 pid=3286) self._init_executor() | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor | |
| (EngineCore_DP0 pid=3286) self.driver_worker.load_model() | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model | |
| (EngineCore_DP0 pid=3286) self.model_runner.load_model(eep_scale_up=eep_scale_up) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model | |
| (EngineCore_DP0 pid=3286) self.model = model_loader.load_model( | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model | |
| (EngineCore_DP0 pid=3286) self.load_weights(model, model_config) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights | |
| (EngineCore_DP0 pid=3286) loaded_weights = model.load_weights(self.get_all_weights(model_config, model)) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 640, in load_weights | |
| (EngineCore_DP0 pid=3286) return loader.load_weights( | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights | |
| (EngineCore_DP0 pid=3286) return original_load_weights(auto_weight_loader, weights, mapper=mapper) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights | |
| (EngineCore_DP0 pid=3286) autoloaded_weights = set(self._load_module("", self.module, weights)) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module | |
| (EngineCore_DP0 pid=3286) yield from self._load_module( | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module | |
| (EngineCore_DP0 pid=3286) loaded_params = module_load_weights(weights) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 497, in load_weights | |
| (EngineCore_DP0 pid=3286) weight_loader(param, loaded_weight, shard_id) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 1238, in weight_loader | |
| (EngineCore_DP0 pid=3286) param_data.copy_(loaded_weight) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 634, in _dispatch__torch_function__ | |
| (EngineCore_DP0 pid=3286) return func(*args, **kwargs) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 650, in _dispatch__torch_dispatch__ | |
| (EngineCore_DP0 pid=3286) return cls._ATEN_OP_TABLE[cls][func](func, types, args, kwargs) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 413, in wrapper | |
| (EngineCore_DP0 pid=3286) return _func(f, types, args, kwargs) | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 544, in _ | |
| (EngineCore_DP0 pid=3286) if _same_metadata(self, src): | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 503, in _same_metadata | |
| (EngineCore_DP0 pid=3286) _tensor_shape_match = all( | |
| (EngineCore_DP0 pid=3286) File "/home/elicer/ao/torchao/utils.py", line 504, in <genexpr> | |
| (EngineCore_DP0 pid=3286) getattr(self, t_name).shape == getattr(src, t_name).shape | |
| (EngineCore_DP0 pid=3286) AttributeError: 'Tensor' object has no attribute 'qdata' | |
| Loading pt checkpoint shards: 0% Completed | 0/1 [00:01<?, ?it/s] | |
| (EngineCore_DP0 pid=3286) | |
| [rank0]:[W108 18:05:53.558061877 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://p | |
| ytorch.org/docs/stable/distributed.html#shutdown (function operator()) Traceback (most recent call last): | |
| File "/home/elicer/ao/.venv/bin/vllm", line 10, in <module> | |
| sys.exit(main()) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main | |
| args.dispatch_function(args) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/benchmark/throughput.py", line 21, in cmd | |
| main(args) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 730, in main | |
| elapsed_time, request_outputs = run_vllm( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 51, in run_vllm | |
| llm = LLM(**dataclasses.asdict(engine_args)) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 351, in __init__ | |
| self.llm_engine = LLMEngine.from_engine_args( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 183, in from_engine_args | |
| return cls( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 109, in __init__ | |
| self.engine_core = EngineCoreClient.make_client( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 93, in make_client | |
| return SyncMPClient(vllm_config, executor_class, log_stats) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 648, in __init__ | |
| super().__init__( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 477, in __init__ | |
| with launch_core_engines(vllm_config, executor_class, log_stats) as ( | |
| File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__ | |
| next(self.gen) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines | |
| wait_for_engine_startup( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup | |
| raise RuntimeError( | |
| RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} | |
| benchmarking vllm decode performance with --num_prompts 32 --input_len 32 --output_len 512 --max_model_len 544 | |
| Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info | |
| The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B- | |
| Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. When dataset path is not set, it will default to random dataset | |
| INFO 01-08 18:06:06 [datasets.py:612] Sampling input_len from [31, 31] and output_len from [512, 512] | |
| INFO 01-08 18:06:06 [utils.py:253] non-default args: {'tokenizer': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', 'dtype': 'bfloat16', 'max_model_len': 544, 'enable_lora | |
| ': None, 'reasoning_parser_plugin': '', 'model': 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/'} INFO 01-08 18:06:06 [model.py:514] Resolved architecture: LlamaForCausalLM | |
| INFO 01-08 18:06:06 [model.py:1661] Using max model len 544 | |
| INFO 01-08 18:06:07 [scheduler.py:230] Chunked prefill is enabled with max_num_batched_tokens=8192. | |
| /home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1, please check the depreca | |
| tion warning warnings.warn( | |
| The tokenizer you are loading from 'benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B- | |
| Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu128 for torchao version 0.15.0 Please see https://github.com/pytorch/ao/issues/2919 for more info | |
| (EngineCore_DP0 pid=3790) INFO 01-08 18:06:17 [core.py:93] Initializing a V1 LLM engine (v0.13.0) with config: model='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', specu | |
| lative_config=None, tokenizer='benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=544, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None} (EngineCore_DP0 pid=3790) INFO 01-08 18:06:18 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.2.100:36661 backend=nccl | |
| (EngineCore_DP0 pid=3790) INFO 01-08 18:06:18 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0 | |
| (EngineCore_DP0 pid=3790) INFO 01-08 18:06:18 [gpu_model_runner.py:3562] Starting to load model benchmarks/data/quantized_model/meta-llama/Llama-3.2-1B-smoothquant_int8/... | |
| (EngineCore_DP0 pid=3790) /home/elicer/ao/.venv/lib/python3.10/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudn | |
| n.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.) (EngineCore_DP0 pid=3790) _C._set_float32_matmul_precision(precision) | |
| (EngineCore_DP0 pid=3790) INFO 01-08 18:06:19 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION') | |
| (EngineCore_DP0 pid=3790) /home/elicer/ao/torchao/core/config.py:253: UserWarning: Stored version is not the same as current default version of the config: stored_version=2, current_default_version=1 | |
| , please check the deprecation warning (EngineCore_DP0 pid=3790) warnings.warn( | |
| Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] EngineCore failed to start. | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] Traceback (most recent call last): | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] engine_core = EngineCoreProc(*args, **kwargs) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 637, in __init__ | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] super().__init__( | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 102, in __init__ | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.model_executor = executor_class(vllm_config) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__ | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self._init_executor() | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.driver_worker.load_model() | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.model_runner.load_model(eep_scale_up=eep_scale_up) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.model = model_loader.load_model( | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] self.load_weights(model, model_config) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] loaded_weights = model.load_weights(self.get_all_weights(model_config, model)) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 640, in load_weights | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return loader.load_weights( | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_ | |
| model_load_weights (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return original_load_weights(auto_weight_loader, weights, mapper=mapper) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] autoloaded_weights = set(self._load_module("", self.module, weights)) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] yield from self._load_module( | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] loaded_params = module_load_weights(weights) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 497, in load_weights | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] weight_loader(param, loaded_weight, shard_id) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 1238, in weight_loader | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] param_data.copy_(loaded_weight) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 634, in _dispatch__torch_function__ | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return func(*args, **kwargs) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 650, in _dispatch__torch_dispatch__ | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return cls._ATEN_OP_TABLE[cls][func](func, types, args, kwargs) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 413, in wrapper | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] return _func(f, types, args, kwargs) | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 544, in _ | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] if _same_metadata(self, src): | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 503, in _same_metadata | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] _tensor_shape_match = all( | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] File "/home/elicer/ao/torchao/utils.py", line 504, in <genexpr> | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] getattr(self, t_name).shape == getattr(src, t_name).shape | |
| (EngineCore_DP0 pid=3790) ERROR 01-08 18:06:21 [core.py:866] AttributeError: 'Tensor' object has no attribute 'qdata' | |
| (EngineCore_DP0 pid=3790) Process EngineCore_DP0: | |
| (EngineCore_DP0 pid=3790) Traceback (most recent call last): | |
| (EngineCore_DP0 pid=3790) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap | |
| (EngineCore_DP0 pid=3790) self.run() | |
| (EngineCore_DP0 pid=3790) File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run | |
| (EngineCore_DP0 pid=3790) self._target(*self._args, **self._kwargs) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 870, in run_engine_core | |
| (EngineCore_DP0 pid=3790) raise e | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 857, in run_engine_core | |
| (EngineCore_DP0 pid=3790) engine_core = EngineCoreProc(*args, **kwargs) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 637, in __init__ | |
| (EngineCore_DP0 pid=3790) super().__init__( | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 102, in __init__ | |
| (EngineCore_DP0 pid=3790) self.model_executor = executor_class(vllm_config) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__ | |
| (EngineCore_DP0 pid=3790) self._init_executor() | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor | |
| (EngineCore_DP0 pid=3790) self.driver_worker.load_model() | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 289, in load_model | |
| (EngineCore_DP0 pid=3790) self.model_runner.load_model(eep_scale_up=eep_scale_up) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3581, in load_model | |
| (EngineCore_DP0 pid=3790) self.model = model_loader.load_model( | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model | |
| (EngineCore_DP0 pid=3790) self.load_weights(model, model_config) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/default_loader.py", line 305, in load_weights | |
| (EngineCore_DP0 pid=3790) loaded_weights = model.load_weights(self.get_all_weights(model_config, model)) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 640, in load_weights | |
| (EngineCore_DP0 pid=3790) return loader.load_weights( | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights | |
| (EngineCore_DP0 pid=3790) return original_load_weights(auto_weight_loader, weights, mapper=mapper) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights | |
| (EngineCore_DP0 pid=3790) autoloaded_weights = set(self._load_module("", self.module, weights)) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module | |
| (EngineCore_DP0 pid=3790) yield from self._load_module( | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module | |
| (EngineCore_DP0 pid=3790) loaded_params = module_load_weights(weights) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 497, in load_weights | |
| (EngineCore_DP0 pid=3790) weight_loader(param, loaded_weight, shard_id) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 1238, in weight_loader | |
| (EngineCore_DP0 pid=3790) param_data.copy_(loaded_weight) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 634, in _dispatch__torch_function__ | |
| (EngineCore_DP0 pid=3790) return func(*args, **kwargs) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 650, in _dispatch__torch_dispatch__ | |
| (EngineCore_DP0 pid=3790) return cls._ATEN_OP_TABLE[cls][func](func, types, args, kwargs) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 413, in wrapper | |
| (EngineCore_DP0 pid=3790) return _func(f, types, args, kwargs) | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 544, in _ | |
| (EngineCore_DP0 pid=3790) if _same_metadata(self, src): | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 503, in _same_metadata | |
| (EngineCore_DP0 pid=3790) _tensor_shape_match = all( | |
| (EngineCore_DP0 pid=3790) File "/home/elicer/ao/torchao/utils.py", line 504, in <genexpr> | |
| (EngineCore_DP0 pid=3790) getattr(self, t_name).shape == getattr(src, t_name).shape | |
| (EngineCore_DP0 pid=3790) AttributeError: 'Tensor' object has no attribute 'qdata' | |
| Loading pt checkpoint shards: 0% Completed | 0/1 [00:01<?, ?it/s] | |
| (EngineCore_DP0 pid=3790) | |
| [rank0]:[W108 18:06:22.467192896 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://p | |
| ytorch.org/docs/stable/distributed.html#shutdown (function operator()) Traceback (most recent call last): | |
| File "/home/elicer/ao/.venv/bin/vllm", line 10, in <module> | |
| sys.exit(main()) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main | |
| args.dispatch_function(args) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/cli/benchmark/throughput.py", line 21, in cmd | |
| main(args) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 730, in main | |
| elapsed_time, request_outputs = run_vllm( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/benchmarks/throughput.py", line 51, in run_vllm | |
| llm = LLM(**dataclasses.asdict(engine_args)) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 351, in __init__ | |
| self.llm_engine = LLMEngine.from_engine_args( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 183, in from_engine_args | |
| return cls( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 109, in __init__ | |
| self.engine_core = EngineCoreClient.make_client( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 93, in make_client | |
| return SyncMPClient(vllm_config, executor_class, log_stats) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 648, in __init__ | |
| super().__init__( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 477, in __init__ | |
| with launch_core_engines(vllm_config, executor_class, log_stats) as ( | |
| File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__ | |
| next(self.gen) | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines | |
| wait_for_engine_startup( | |
| File "/home/elicer/ao/.venv/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup | |
| raise RuntimeError( | |
| RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} | |
| Library Versions: | |
| ================================================================================ | |
| torch.__version__: 2.9.0+cu128 | |
| torch.cuda.get_device_name(): NVIDIA A100 80GB PCIe MIG 2g.20gb | |
| torchao.__version__: 0.15.0 | |
| vllm.__version__: 0.13.0 | |
| Quantization Recipe Results: | |
| ================================================================================ | |
| +------------------+--------------+--------------+--------------+--------------+-----------+----------+-----------+-----------+ | |
| | Recipe | Checkpoint | Wikitext | Winogrande | Winogrande | Prefill | Decode | Speedup | Speedup | | |
| | | (GB) | Perplexity | Acc | Stderr | toks/s | toks/s | Prefill | Decode | | |
| +==================+==============+==============+==============+==============+===========+==========+===========+===========+ | |
| | smoothquant_int8 | 2.49 | | 0.63 | 0.0485 | | | | | | |
| +------------------+--------------+--------------+--------------+--------------+-----------+----------+-----------+-----------+ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment