| Model | AGIEval | GPT4All | TruthfulQA | Bigbench |
|---|---|---|---|---|
| Llama-3.2-3B | 25.76 | Error: File does not exist | 39.22 | 34.61 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 20.87 | ± | 2.55 |
| acc_norm | 23.23 | ± | 2.65 | ||
| agieval_logiqa_en | 0 | acc | 23.96 | ± | 1.67 |
| #!/usr/bin/env python3 | |
| """ | |
| Refactored Q&A Dataset Generation Script | |
| ======================================== | |
| Features: | |
| - Separate configuration for generator vs. judge (API keys, endpoints, and models). | |
| - EnvironmentΓÇÉvariable and CLIΓÇÉdriven configuration. | |
| - Consistent use of pathlib for file paths. | |
| - Modular logging with debug mode. |
| import os | |
| import requests | |
| import random | |
| import logging | |
| import re | |
| import time | |
| import json | |
| import matplotlib | |
| matplotlib.use('Agg') # Set the backend to 'Agg' before importing pyplot | |
| import matplotlib.pyplot as plt |
| #!/bin/bash | |
| # Functions | |
| install_basic_packages() { | |
| echo "Installing basic packages..." | |
| apt update -y && apt install -y screen nano git git-lfs speedometer htop libaio-dev || { | |
| echo "Failed to install basic packages" >&2 | |
| exit 1 | |
| } |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench |
|---|---|---|---|---|
| Llama-3.2-3B | 25.76 | Error: File does not exist | 39.22 | 34.61 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 20.87 | ± | 2.55 |
| acc_norm | 23.23 | ± | 2.65 | ||
| agieval_logiqa_en | 0 | acc | 23.96 | ± | 1.67 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench |
|---|---|---|---|---|
| Llama-3.2-3B-DPO | 27.06 | Error: File does not exist | 58.93 | 34.96 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 18.90 | ± | 2.46 |
| acc_norm | 20.87 | ± | 2.55 | ||
| agieval_logiqa_en | 0 | acc | 26.11 | ± | 1.72 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench |
|---|---|---|---|---|
| Llama3-8B-function-calling-uncensored-dareties | 39.15 | Error: File does not exist | 54.99 | 42.52 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 24.41 | ± | 2.70 |
| acc_norm | 23.23 | ± | 2.65 | ||
| agieval_logiqa_en | 0 | acc | 34.56 | ± | 1.87 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench |
|---|---|---|---|---|
| Llama3-8B-function-calling-dpo-slerp | 39.52 | Error: File does not exist | 56.01 | 42.8 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 25.98 | ± | 2.76 |
| acc_norm | 23.62 | ± | 2.67 | ||
| agieval_logiqa_en | 0 | acc | 38.25 | ± | 1.91 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench |
|---|---|---|---|---|
| Hermes-3-Llama-3.1-8B | 41.51 | Error: File does not exist | 58.61 | 43.08 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 26.38 | ± | 2.77 |
| acc_norm | 25.20 | ± | 2.73 | ||
| agieval_logiqa_en | 0 | acc | 39.02 | ± | 1.91 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench |
|---|---|---|---|---|
| Llama3-8B-DPO | 41.87 | Error: File does not exist | 71.38 | 44.5 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 21.65 | ± | 2.59 |
| acc_norm | 20.47 | ± | 2.54 | ||
| agieval_logiqa_en | 0 | acc | 40.71 | ± | 1.93 |
| Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
|---|---|---|---|---|---|
| Phi-3-mini-4k-instruct | 44.44 | 71.88 | 57.77 | 41.9 | 54 |
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 29.13 | ± | 2.86 |
| acc_norm | 28.74 | ± | 2.85 | ||
| agieval_logiqa_en | 0 | acc | 42.86 | ± | 1.94 |