| Model | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
|---|---|---|---|---|---|---|
| mera-mix-4x7B | 65.7 | 84.73 | Error: File does not exist | 51.03 | 79.48 | 66.34 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| arc_challenge | 1 | acc,none | 0.62 | ||
| acc_stderr,none | 0.01 | ||||
| acc_norm,none | 0.66 | ||||
| acc_norm_stderr,none | 0.01 | ||||
| alias | arc_challenge |
Average: 65.7%
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| hellaswag | 1 | acc,none | 0.66 | ||
| acc_stderr,none | 0 | ||||
| acc_norm,none | 0.85 | ||||
| acc_norm_stderr,none | 0 | ||||
| alias | hellaswag |
Average: 84.73%
Average: Error: File does not exist%
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| truthfulqa | N/A | bleu_max,none | 30.01 | ||
| bleu_max_stderr,none | 0.82 | ||||
| rouge2_acc,none | 0.42 | ||||
| rouge2_acc_stderr,none | 0.02 | ||||
| bleu_diff,none | 2.98 | ||||
| bleu_diff_stderr,none | 0.94 | ||||
| rouge2_max,none | 42.78 | ||||
| rouge2_max_stderr,none | 1.02 | ||||
| rougeL_max,none | 53.62 | ||||
| rougeL_max_stderr,none | 0.87 | ||||
| rougeL_diff,none | 4.03 | ||||
| rougeL_diff_stderr,none | 1.21 | ||||
| acc,none | 0.43 | ||||
| acc_stderr,none | 0.01 | ||||
| rouge1_max,none | 56.89 | ||||
| rouge1_max_stderr,none | 0.84 | ||||
| bleu_acc,none | 0.47 | ||||
| bleu_acc_stderr,none | 0.02 | ||||
| rouge2_diff,none | 3.64 | ||||
| rouge2_diff_stderr,none | 1.33 | ||||
| rougeL_acc,none | 0.46 | ||||
| rougeL_acc_stderr,none | 0.02 | ||||
| rouge1_diff,none | 4.62 | ||||
| rouge1_diff_stderr,none | 1.19 | ||||
| rouge1_acc,none | 0.47 | ||||
| rouge1_acc_stderr,none | 0.02 | ||||
| alias | truthfulqa | ||||
| truthfulqa_gen | 3 | bleu_max,none | 30.01 | ||
| bleu_max_stderr,none | 0.82 | ||||
| bleu_acc,none | 0.47 | ||||
| bleu_acc_stderr,none | 0.02 | ||||
| bleu_diff,none | 2.98 | ||||
| bleu_diff_stderr,none | 0.94 | ||||
| rouge1_max,none | 56.89 | ||||
| rouge1_max_stderr,none | 0.84 | ||||
| rouge1_acc,none | 0.47 | ||||
| rouge1_acc_stderr,none | 0.02 | ||||
| rouge1_diff,none | 4.62 | ||||
| rouge1_diff_stderr,none | 1.19 | ||||
| rouge2_max,none | 42.78 | ||||
| rouge2_max_stderr,none | 1.02 | ||||
| rouge2_acc,none | 0.42 | ||||
| rouge2_acc_stderr,none | 0.02 | ||||
| rouge2_diff,none | 3.64 | ||||
| rouge2_diff_stderr,none | 1.33 | ||||
| rougeL_max,none | 53.62 | ||||
| rougeL_max_stderr,none | 0.87 | ||||
| rougeL_acc,none | 0.46 | ||||
| rougeL_acc_stderr,none | 0.02 | ||||
| rougeL_diff,none | 4.03 | ||||
| rougeL_diff_stderr,none | 1.21 | ||||
| alias | - truthfulqa_gen | ||||
| truthfulqa_mc1 | 2 | acc,none | 0.35 | ||
| acc_stderr,none | 0.02 | ||||
| alias | - truthfulqa_mc1 | ||||
| truthfulqa_mc2 | 2 | acc,none | 0.51 | ||
| acc_stderr,none | 0.02 | ||||
| alias | - truthfulqa_mc2 |
Average: 51.03%
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| winogrande | 1 | acc,none | 0.79 | ||
| acc_stderr,none | 0.01 | ||||
| alias | winogrande |
Average: 79.48%
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| gsm8k | 3 | exact_match,strict-match | 0.66 | ||
| exact_match_stderr,strict-match | 0.01 | ||||
| exact_match,flexible-extract | 0.62 | ||||
| exact_match_stderr,flexible-extract | 0.01 | ||||
| alias | gsm8k |
Average: 66.34%
Average score: Not available due to errors
Elapsed time: 07:16:25