| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| dolphin-2.8-mistral-7b-v02 | 38.99 | 72.22 | 51.96 | 40.41 | 50.9 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 21.65 | ± | 2.59 |
| acc_norm | 20.47 | ± | 2.54 | ||
| agieval_logiqa_en | 0 | acc | 35.79 | ± | 1.88 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| dolphin-2.8-mistral-7b-v02 | 38.99 | 72.22 | 51.96 | 40.41 | 50.9 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 21.65 | ± | 2.59 |
| acc_norm | 20.47 | ± | 2.54 | ||
| agieval_logiqa_en | 0 | acc | 35.79 | ± | 1.88 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| Hermes-2-Pro-Mistral-7B | 44.54 | 71.2 | 59.12 | 41.9 | 54.19 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 23.23 | ± | 2.65 |
| acc_norm | 22.83 | ± | 2.64 | ||
| agieval_logiqa_en | 0 | acc | 38.40 | ± | 1.91 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| Einstein-v4-7B | 37.83 | 67.52 | 55.56 | 38.78 | 49.92 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 23.62 | ± | 2.67 |
| acc_norm | 22.83 | ± | 2.64 | ||
| agieval_logiqa_en | 0 | acc | 37.33 | ± | 1.90 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| AlphaMonarch-daser | 45.48 | 76.95 | 78.46 | 50.21 | 62.77 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 28.35 | ± | 2.83 |
| acc_norm | 26.38 | ± | 2.77 | ||
| agieval_logiqa_en | 0 | acc | 38.71 | ± | 1.91 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| FrankenMonarch-7B | 45.1 | 75.53 | 73.86 | 46.79 | 60.32 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 25.59 | ± | 2.74 |
| acc_norm | 25.98 | ± | 2.76 | ||
| agieval_logiqa_en | 0 | acc | 39.02 | ± | 1.91 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| UltraMerge-7B | 44.36 | 77.15 | 78.47 | 49.35 | 62.33 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 27.56 | ± | 2.81 |
| acc_norm | 23.23 | ± | 2.65 | ||
| agieval_logiqa_en | 0 | acc | 39.48 | ± | 1.92 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| Beyonder-4x7B-v3 | 45.85 | 76.67 | 74.98 | 50.12 | 61.91 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 26.38 | ± | 2.77 |
| acc_norm | 24.02 | ± | 2.69 | ||
| agieval_logiqa_en | 0 | acc | 39.48 | ± | 1.92 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| Kunoichi-DPO-v2-7B | 44.79 | 75.05 | 65.68 | 47.65 | 58.29 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 26.38 | ± | 2.77 |
| acc_norm | 24.02 | ± | 2.69 | ||
| agieval_logiqa_en | 0 | acc | 38.71 | ± | 1.91 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| CodeNinja-1.0-OpenChat-7B | 39.98 | 71.77 | 48.73 | 40.92 | 50.35 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 27.17 | ± | 2.80 |
| acc_norm | 26.38 | ± | 2.77 | ||
| agieval_logiqa_en | 0 | acc | 38.10 | ± | 1.90 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| AlphaMonarch-dora | 45.42 | 76.93 | 78.48 | 50.18 | 62.75 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 28.35 | ± | 2.83 |
| acc_norm | 26.38 | ± | 2.77 | ||
| agieval_logiqa_en | 0 | acc | 38.71 | ± | 1.91 |