Skip to content

Instantly share code, notes, and snippets.

@yiliu30
Created January 21, 2026 03:20
Show Gist options
  • Select an option

  • Save yiliu30/f154a16fbb43457a2a10b78c3f9595da to your computer and use it in GitHub Desktop.

Select an option

Save yiliu30/f154a16fbb43457a2a10b78c3f9595da to your computer and use it in GitHub Desktop.
model_path="/dataset/auto-round/qwen_moe/"
taskname=gsm8k
taskname=longbench_hotpotqa
timestamp=$(date +%Y%m%d_%H%M%S)
model_path="/storage/yiliu7/meta-llama/Llama-3.1-8B-Instruct"
output_log_file_name="${taskname}_${timestamp}"
MAX_MODEL_LEN=40960
max_length=${MAX_MODEL_LEN}
taskname=gsm8k
taskname=longbench_hotpotqa
taskname=longbench2_govt_single
taskname=longbench
# taskname=longbench
# taskname=longbench
max_gen_toks=2048
EVAL_LOG_NAME="eval_${taskname}_${timestamp}"
mkdir -p benchmark_logs
# VLLM_ATTENTION_BACKEND=TORCH_SDPA \
VLLM_KERNEL_OVERRIDE_BATCH_INVARIANT=1 \
VLLM_ENABLE_V1_MULTIPROCESSING=0 \
VLLM_ALLREDUCE_USE_SYMM_MEM=0 NCCL_NVLS_ENABLE=0 \
HF_ALLOW_CODE_EVAL=1 \
lm_eval --model vllm \
--tasks $taskname \
--model_args pretrained=${model_path},trust_remote_code=True,dtype=bfloat16,max_model_len=${max_length},tensor_parallel_size=4,gpu_memory_utilization=0.75,enable_prefix_caching=False \
--confirm_run_unsafe_code \
--seed 42 \
--batch_size 128 \
--apply_chat_template \
--gen_kwargs '{"temperature":'0.0'}' \
--output_path "benchmark_logs/$EVAL_LOG_NAME" \
2>&1 | tee "benchmark_logs/${EVAL_LOG_NAME}.log"
@yiliu30
Copy link
Author

yiliu30 commented Jan 21, 2026

# vllm (pretrained=/storage/yiliu7/meta-llama/Llama-3.1-8B-Instruct,trust_remote_code=True,dtype=bfloat16,max_model_len=40960,tensor_parallel_size=2,gpu_memory_utilization=0.75,enable_prefix_caching=False), gen_kwargs: ({'temperature': 0.0}), limit: None, num_fewshot: None, batch_size: 128
# |              Tasks               |Version|Filter|n-shot|       Metric       |   |Value |   |Stderr|
# |----------------------------------|------:|------|-----:|--------------------|---|-----:|---|-----:|
# | - Code Completion                |      0|none  |      |score               |↑  |0.1807|±  |0.0029|
# |  - longbench_lcc                 |      5|none  |     0|code_sim_score      |↑  |0.1885|±  |0.0036|
# |  - longbench_repobench-p         |      5|none  |     0|code_sim_score      |↑  |0.1729|±  |0.0047|
# | - Few-shot Learning              |      0|none  |      |score               |↑  |0.4591|±  |0.0101|
# |  - longbench_lsht                |      5|none  |     0|classification_score|↑  |0.0050|±  |0.0050|
# |  - longbench_samsum              |      5|none  |     0|rouge_score         |↑  |0.2244|±  |0.0141|
# |  - longbench_trec                |      5|none  |     0|classification_score|↑  |0.6900|±  |0.0328|
# |  - longbench_triviaqa            |      5|none  |     0|qa_f1_score         |↑  |0.9172|±  |0.0179|
# | - Multi-Document QA              |      0|none  |      |score               |↑  |0.4250|±  |0.0139|
# |  - longbench_2wikimqa            |      5|none  |     0|qa_f1_score         |↑  |0.4966|±  |0.0327|
# |  - longbench_dureader            |      5|none  |     0|rouge_zh_score      |↑  |0.3045|±  |0.0132|
# |  - longbench_hotpotqa            |      5|none  |     0|qa_f1_score         |↑  |0.5808|±  |0.0309|
# |  - longbench_musique             |      5|none  |     0|qa_f1_score         |↑  |0.3181|±  |0.0300|
# | - Single-Document QA             |      0|none  |      |score               |↑  |0.4780|±  |0.0125|
# |  - longbench_multifieldqa_en     |      5|none  |     0|qa_f1_score         |↑  |0.5649|±  |0.0277|
# |  - longbench_multifieldqa_zh     |      5|none  |     0|qa_f1_zh_score      |↑  |0.6089|±  |0.0256|
# |  - longbench_narrativeqa         |      5|none  |     0|qa_f1_score         |↑  |0.2836|±  |0.0220|
# |  - longbench_qasper              |      5|none  |     0|qa_f1_score         |↑  |0.4545|±  |0.0253|
# | - Summarization                  |      0|none  |      |score               |↑  |0.2322|±  |0.0034|
# |  - longbench_gov_report          |      5|none  |     0|rouge_score         |↑  |0.2713|±  |0.0100|
# |  - longbench_multi_news          |      5|none  |     0|rouge_score         |↑  |0.2656|±  |0.0051|
# |  - longbench_qmsum               |      5|none  |     0|rouge_score         |↑  |0.2543|±  |0.0052|
# |  - longbench_vcsum               |      5|none  |     0|rouge_zh_score      |↑  |0.1378|±  |0.0058|
# | - Synthetic Tasks                |      0|none  |      |score               |↑  |0.6863|±  |0.0082|
# |  - longbench_passage_count       |      5|none  |     0|count_score         |↑  |0.1008|±  |0.0213|
# |  - longbench_passage_retrieval_en|      5|none  |     0|retrieval_score     |↑  |1.0000|±  |0.0000|
# |  - longbench_passage_retrieval_zh|      5|none  |     0|retrieval_zh_score  |↑  |0.9579|±  |0.0123|

# |       Groups        |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
# |---------------------|------:|------|------|------|---|-----:|---|-----:|
# | - Code Completion   |      0|none  |      |score |↑  |0.1807|±  |0.0029|
# | - Few-shot Learning |      0|none  |      |score |↑  |0.4591|±  |0.0101|
# | - Multi-Document QA |      0|none  |      |score |↑  |0.4250|±  |0.0139|
# | - Single-Document QA|      0|none  |      |score |↑  |0.4780|±  |0.0125|
# | - Summarization     |      0|none  |      |score |↑  |0.2322|±  |0.0034|
# | - Synthetic Tasks   |      0|none  |      |score |↑  |0.6863|±  |0.0082|

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment