baberabb/results.md

## results.md

      
    Raw
  

              results.md
            
          
    Mathvista 100 samples

llama: 	51.5
hf-multimodal (pretrained=meta-llama/Llama-3.2-11B-Vision-Instruct), gen_kwargs: (None), limit: 50.0, num_fewshot: None, batch_size: 8


Tasks
Version
Filter
n-shot
Metric

Value

Stderr


mathvista
1
extract_answer
0
acc
↑
0.46
±
0.0712


hf-multimodal (pretrained=llava-hf/llava-onevision-qwen2-7b-ov-chat-hf), gen_kwargs: (None), limit: 50.0, num_fewshot: None, batch_size: 8


Tasks
Version
Filter
n-shot
Metric

Value

Stderr


mathvista
1
extract_answer
0
acc
↑
0.62
±
0.0693


ChartQA

hf-multimodal (pretrained=llava-hf/llava-onevision-qwen2-7b-ov-chat-hf,attn_implementation=flash_attention_2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8


Tasks
Version
Filter
n-shot
Metric

Value

Stderr


chartqa
0
none
0
anywhere_accuracy
↑
0.7136
±
0.0090


none
0
exact_match
↑
0.5912
±
0.0098


none
0
relaxed_accuracy
↑
0.6712
±
0.0094


llama: 83.4
hf-multimodal (pretrained=meta-llama/Llama-3.2-11B-Vision-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8


Tasks
Version
Filter
n-shot
Metric

Value

Stderr


chartqa_llama
0
none
0
anywhere_accuracy
↑
0.7704
±
0.0084


none
0
exact_match
↑
0.0776
±
0.0054


none
0
relaxed_accuracy
↑
0.7132
±
0.0090


ai2d

llama: 91.1
hf-multimodal (pretrained=meta-llama/Llama-3.2-11B-Vision-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8


Tasks
Version
Filter
n-shot
Metric

Value

Stderr


ai2d
0
flexible-extract
0
exact_match
↑
0.7529
±
0.0078


strict-match
0
exact_match
↑
0.5058
±
0.0090
Tasks	Version	Filter	Metric		Value		Stderr
chartqa	0	none	anywhere_accuracy	↑	0.7136	±	0.0090
		none	exact_match	↑	0.5912	±	0.0098
		none	relaxed_accuracy	↑	0.6712	±	0.0094
Tasks	Version	Filter	Metric		Value		Stderr
chartqa_llama	0	none	anywhere_accuracy	↑	0.7704	±	0.0084
		none	exact_match	↑	0.0776	±	0.0054
		none	relaxed_accuracy	↑	0.7132	±	0.0090
Tasks	Version	Filter	n-shot	Metric		Value		Stderr
ai2d	0	flexible-extract	0	exact_match	↑	0.7529	±	0.0078
		strict-match	0	exact_match	↑	0.5058	±	0.0090