Created
October 14, 2025 12:16
-
-
Save faabian/39d0577c1f8a462ccfb4fd67644ea0e6 to your computer and use it in GitHub Desktop.
Replication of "Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """ | |
| Quick and dirty replication of | |
| "Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline" (https://arxiv.org/abs/2507.15855) | |
| using LangGraph. | |
| Export GOOGLE_API_KEY to run. | |
| Change the model, question and constants directly in the code (no CLI). | |
| """ | |
| import asyncio | |
| from typing import TypedDict | |
| # from langchain_openai import ChatOpenAI | |
| from langchain_google_genai import ChatGoogleGenerativeAI | |
| from langgraph.graph import END, StateGraph | |
| # This dictionary defines the structure of the data that flows through the graph. | |
| class GraphState(TypedDict): | |
| """ | |
| Represents the state of our graph. | |
| Attributes: | |
| problem_statement: The initial mathematical problem to be solved. | |
| current_solution: The latest version of the solution. | |
| bug_report: The feedback from the verifier on the current solution. | |
| iterations: The number of refinement cycles performed. | |
| verification_history: A list to track consecutive verification passes or failures. | |
| verdict: The final verdict from the verifier (e.g., "Correct", "Invalid"). | |
| """ | |
| problem_statement: str | |
| current_solution: str | |
| bug_report: str | |
| iterations: int | |
| verification_history: list[str] # "pass", "minor", "major" | |
| verdict: str | |
| SOLVER_PROMPT = """ | |
| ### Core Instructions ### | |
| * **Rigor is Paramount:** Your primary goal is to produce a complete and rigorously justified solution. Every step in your solution must be logically sound and clearly explained. A correct final answer derived from flawed or incomplete reasoning is considered a failure. | |
| * **Honesty About Completeness:** If you cannot find a complete solution, you must **not** guess or create a solution that appears correct but contains hidden flaws or justification gaps. Instead, you should present only significant partial results that you can rigorously prove. A partial result is considered significant if it represents a substantial advancement toward a full solution. Examples include: | |
| * Proving a key lemma. | |
| * Fully resolving one or more cases within a logically sound case-based proof. | |
| * Establishing a critical property of the mathematical objects in the problem. | |
| * For an optimization problem, proving an upper or lower bound without proving that this bound is achievable. | |
| * **Use TeX for All Mathematics:** All mathematical variables, expressions, and relations must be enclosed in TeX delimiters (e.g., 'Let $n$ be an integer.'). | |
| ### Output Format ### | |
| Your response MUST be structured into the following sections, in this exact order. | |
| **1. Summary** | |
| Provide a concise overview of your findings. This section must contain two parts: | |
| * **a. Verdict:** State clearly whether you have found a complete solution or a partial solution. | |
| * **For a complete solution:** State the final answer, e.g., "I have successfully solved the problem. The final answer is..." | |
| * **For a partial solution:** State the main rigorous conclusion(s) you were able to prove, e.g., "I have not found a complete solution, but I have rigorously proven that..." | |
| * **b. Method Sketch:** Present a high-level, conceptual outline of your solution. This sketch should allow an expert to understand the logical flow of your argument without reading the full detail. It should include: | |
| * A narrative of your overall strategy. | |
| * The full and precise mathematical statements of any key lemmas or major intermediate results. | |
| * If applicable, describe any key constructions or case splits that form the backbone of your argument. | |
| **2. Detailed Solution** | |
| Present the full, step-by-step mathematical proof. Each step must be logically justified and clearly explained. The level of detail should be sufficient for an expert to verify the correctness of your reasoning without needing to fill in any gaps. This section must contain ONLY the complete, rigorous proof, free of any internal commentary, alternative approaches, or failed attempts. | |
| ### Self-Correction Instruction ### | |
| Before finalizing your output, carefully review your "Method Sketch" and "Detailed Solution" to ensure they are clean, rigorous, and strictly adhere to all instructions provided above. Verify that every statement contributes directly to the final, coherent mathematical argument. | |
| """.strip() | |
| VERIFIER_PROMPT = """ | |
| You are an expert mathematician and a meticulous grader for an International Mathematical Olympiad (IMO) level exam. Your primary task is to rigorously verify the provided mathematical solution. A solution is to be judged correct **only if every step is rigorously justified.** A solution that arrives at a correct final answer through flawed reasoning, educated guesses, or with gaps in its arguments must be flagged as incorrect or incomplete. | |
| ### Instructions ### | |
| **1. Core Instructions** | |
| * Your sole task is to find and report all issues in the provided solution. You must act as a **verifier**, NOT a solver. **Do NOT attempt to correct the errors or fill the gaps you find.** | |
| * You must perform a **step-by-step** check of the entire solution. This analysis will be presented in a **Detailed Verification Log**, where you justify your assessment of each step: for correct steps, a brief justification suffices; for steps with errors or gaps, you must provide a detailed explanation. | |
| **2. How to Handle Issues in the Solution** | |
| When you identify an issue in a step, you MUST first classify it into one of the following two categories and then follow the specified procedure. | |
| * **a. Critical Error:** | |
| This is any error that breaks the logical chain of the proof. This includes both **logical fallacies** (e.g., claiming that 'A>B, C>D' implies 'A-C>B-D') and **factual errors** (e.g., a calculation error like ‘2+3=6‘). | |
| * **Procedure:** | |
| * Explain the specific error and state that it **invalidates the current line of reasoning**. | |
| * Do NOT check any further steps that rely on this error. | |
| * You MUST, however, scan the rest of the solution to identify and verify any fully independent parts. For example, if a proof is split into multiple cases, an error in one case does not prevent you from checking the other cases. | |
| * **b. Justification Gap:** | |
| This is for steps where the conclusion may be correct, but the provided argument is incomplete, hand-wavy, or lacks sufficient rigor. | |
| * **Procedure:** | |
| * Explain the gap in the justification. | |
| * State that you will **assume the step's conclusion is true** for the sake of argument. | |
| * Then, proceed to verify all subsequent steps to check if the remainder of the argument is sound. | |
| **3. Output Format** | |
| Your response MUST be structured into two main sections: a **Summary** followed by the **Detailed Verification Log**. | |
| * **a. Summary** | |
| This section MUST be at the very beginning of your response. It must contain two components: | |
| * **Final Verdict**: A single, clear sentence declaring the overall validity of the solution. For example: "The solution is correct," "The solution contains a Critical Error and is therefore invalid," or "The solution’s approach is viable but contains several Justification Gaps." | |
| * **List of Findings**: A bulleted list that summarizes **every** issue you discovered. For each finding, you must provide: | |
| * **Location:** A direct quote of the key phrase or equation where the issue occurs. | |
| * **Issue:** A brief description of the problem and its classification (**Critical Error** or **Justification Gap**). | |
| * **b. Detailed Verification Log** | |
| Following the summary, provide the full, step-by-step verification log as defined in the Core Instructions. When you refer to a specific part of the solution, **quote the relevant text** to make your reference clear before providing your detailed analysis of that part. | |
| **Example of the Required Summary Format** | |
| *This is a generic example to illustrate the required format. Your findings must be based on the actual solution provided below.* | |
| **Final Verdict:** The solution is **invalid** because it contains a Critical Error. | |
| **List of Findings:** | |
| * **Location:** "By interchanging the limit and the integral, we get..." | |
| * **Issue:** Justification Gap - The solution interchanges a limit and an integral without providing justification, such as proving uniform convergence. | |
| * **Location:** "From $A > B$ and $C > D$, it follows that $A-C > B-D$" | |
| * **Issue:** Critical Error - This step is a logical fallacy. Subtracting inequalities in this manner is not a valid mathematical operation. | |
| ====================================================================== | |
| ### Problem ### | |
| {problem} | |
| ====================================================================== | |
| ### Solution ### | |
| {solution} | |
| ====================================================================== | |
| ### Verification Task Reminder ### | |
| Your task is to act as an IMO grader. Now, generate the **summary** and the **step-by-step verification log** for the solution above. In your log, justify each correct step and explain in detail any errors or justification gaps you find, as specified in the instructions above. | |
| """.strip() | |
| # NOTE: these were missing in the paper, best effort to replicate | |
| SOLVER_IMPROVE_PROMPT = """ | |
| You have just generated a solution to a complex mathematical problem. | |
| Your task is to review your own work for any potential weaknesses, logical gaps, or areas where the explanation could be more rigorous. | |
| Take a deep breath and critically analyze your own solution. Provide a new, improved version. | |
| Even if your original solution is strong, try to refine the language, clarify the steps, or make the logic more explicit. | |
| Here is the problem statement for context: | |
| {problem} | |
| Here is the solution you generated: | |
| {solution} | |
| Now, provide an improved version of the solution, following the original output format (Summary and Detailed Solution). | |
| """.strip() | |
| VERIFIER_IMPROVE_PROMPT = """ | |
| You have just generated a bug report for the problem and solution above based on the given instructions. | |
| Your task is to review your own work for any potential mistakes or issues in the bug report. | |
| Review carefully each issue raised in the bug report and ensure that its criticism is accurate, well supported and well explained. | |
| If it turns out that the stated issue is not an actual issue, remove it from the bug report. | |
| If the explanation of the issue can be improved, update it accordingly. | |
| Here is the original bug report: | |
| {bug_report} | |
| Now, provide an improved version of the bug report, following the original output format (Summary and Detailed Verification Log). | |
| """.strip() | |
| CORRECTION_PROMPT = """ | |
| You are tasked with correcting a mathematical solution based on a bug report from an expert reviewer. | |
| Your goal is to produce a new, corrected solution that addresses every issue raised in the report. | |
| If you do not agree with a criticsm, improve the argumentation to avoid the misunderstanding. | |
| ====================================================================== | |
| ### Problem ### | |
| {problem} | |
| ====================================================================== | |
| ### Solution ### | |
| {solution} | |
| ====================================================================== | |
| ### Bug Report ### | |
| {bug_report} | |
| Now, provide a new, complete, and corrected solution that addresses the points raised in the bug report. | |
| Ensure your new solution is self-contained and does not reference the bug report. | |
| """.strip() | |
| # LLM = ChatOpenAI(model="gpt-5", temperature=0.1) | |
| # LLM = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.1) | |
| LLM = ChatGoogleGenerativeAI(model="gemini-2.5-pro", temperature=0.1) | |
| # Constants from the paper were 5 and 10, respectively. | |
| ACCEPTANCE_THRESHOLD = 5 # consecutive passes for acceptance | |
| REJECTION_THRESHOLD = 10 # consecutive failures with major issues for discarding | |
| def classify_judgement(judgement: str) -> str: | |
| """Classify the judgement as "pass", "minor", or "major".""" | |
| # TODO: robustify | |
| if "solution is correct" in judgement: | |
| return "pass" | |
| elif "critical" in judgement or "Critical" in judgement: | |
| return "major" | |
| else: | |
| return "minor" | |
| def generate_initial_solution(state: GraphState): | |
| """(Step 1) Generate the initial solution.""" | |
| print("--- GENERATING INITIAL SOLUTION ---") | |
| messages = [ | |
| {"role": "system", "content": SOLVER_PROMPT}, | |
| { | |
| "role": "user", | |
| "content": f"Here is the problem: {state['problem_statement']}", | |
| }, | |
| ] | |
| response = LLM.invoke(messages) | |
| return {"current_solution": response.content} | |
| def improve_solution(state: GraphState): | |
| """(Step 2) The model reviews and improves its own work.""" | |
| print("--- IMPROVING SOLUTION ---") | |
| problem = state["problem_statement"] | |
| solution = state["current_solution"] | |
| messages = [ | |
| {"role": "system", "content": SOLVER_PROMPT}, | |
| { | |
| "role": "user", | |
| "content": SOLVER_IMPROVE_PROMPT.format(problem=problem, solution=solution), | |
| }, | |
| ] | |
| response = LLM.invoke(messages) | |
| return {"current_solution": response.content} | |
| def verify_solution(state: GraphState): | |
| """(Step 3) Verify the solution and generate a bug report.""" | |
| print("--- VERIFYING SOLUTION ---") | |
| problem = state["problem_statement"] | |
| solution = state["current_solution"] | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": VERIFIER_PROMPT.format(problem=problem, solution=solution), | |
| } | |
| ] | |
| response = LLM.invoke(messages) | |
| verdict = classify_judgement(response.content) | |
| history = state["verification_history"] + [verdict] | |
| print(f"--- VERIFICATION RESULT: {verdict.upper()} ---") | |
| return { | |
| "bug_report": response.content, | |
| "verification_history": history, | |
| "verdict": verdict, | |
| } | |
| def improve_verification(state: GraphState): | |
| """(Step 4) The verfier reviews and improves its own work.""" | |
| print("--- IMPROVING VERIFICATION ---") | |
| problem = state["problem_statement"] | |
| solution = state["current_solution"] | |
| bug_report = state["bug_report"] | |
| messages = [ | |
| { | |
| "role": "system", | |
| "content": VERIFIER_PROMPT.format(problem=problem, solution=solution), | |
| }, | |
| { | |
| "role": "user", | |
| "content": VERIFIER_IMPROVE_PROMPT.format(bug_report=bug_report), | |
| }, | |
| ] | |
| response = LLM.invoke(messages) | |
| verdict = classify_judgement(response.content) | |
| print(f"--- RE-VERIFICATION RESULT: {verdict.upper()} ---") | |
| return { | |
| "bug_report": response.content, | |
| "verdict": verdict, | |
| } | |
| def correct_solution(state: GraphState): | |
| """(Step 5) Correct the solution based on the bug report.""" | |
| print("--- CORRECTING SOLUTION ---") | |
| problem = state["problem_statement"] | |
| solution = state["current_solution"] | |
| bug_report = state["bug_report"] | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": CORRECTION_PROMPT.format( | |
| problem=problem, solution=solution, bug_report=bug_report | |
| ), | |
| } | |
| ] | |
| response = LLM.invoke(messages) | |
| # Increment iteration counter after a correction attempt | |
| iterations = state["iterations"] + 1 | |
| return {"current_solution": response.content, "iterations": iterations} | |
| def _decide_next_step(state: GraphState) -> str: | |
| """(Conditional Edge) Decide whether to accept, reject, or correct.""" | |
| history = state["verification_history"] | |
| # Acceptance Condition: consecutive passes | |
| if len(history) >= ACCEPTANCE_THRESHOLD and all( | |
| h == "pass" for h in history[-ACCEPTANCE_THRESHOLD:] | |
| ): | |
| return "accept" | |
| # Rejection Condition: consecutive major failures | |
| if len(history) >= REJECTION_THRESHOLD and all( | |
| h == "major" for h in history[-REJECTION_THRESHOLD:] | |
| ): | |
| return "reject" | |
| # if the last verification succeeded, re-verify directly | |
| if history[-1] == "pass": | |
| return "verify" | |
| return "improve_verify" | |
| def decide_next_step(state: GraphState) -> str: | |
| result = _decide_next_step(state) | |
| match result: | |
| case "accept": | |
| print( | |
| f"--- ACCEPTANCE: Solution passed verification {ACCEPTANCE_THRESHOLD} consecutive times. ---" | |
| ) | |
| case "reject": | |
| print( | |
| f"--- REJECTION: Solution failed {REJECTION_THRESHOLD} refinement iterations. ---" | |
| ) | |
| case "verify": | |
| print("--- ROUTING TO RE-VERIFY ---") | |
| case "improve_verify": | |
| print("--- ROUTING TO VERIFICATION IMRPOVEMENT AND CORRECTION ---") | |
| return result | |
| async def pass_at_k_early_stop(app, problem_input: dict, k: int) -> GraphState | None: | |
| """ | |
| Runs the graph `k` times in parallel, stopping all runs once one succeeds. | |
| """ | |
| success_event = asyncio.Event() | |
| final_result: GraphState | None = None | |
| async def run_worker(run_id: int): | |
| nonlocal final_result | |
| try: | |
| worker_input = {**problem_input, "run_id": run_id} | |
| cfg = { | |
| "recursion_limit": 200, | |
| "max_concurrency": 100, | |
| } | |
| async for step in app.astream( | |
| worker_input, | |
| config=cfg, | |
| stream_mode="values", | |
| ): | |
| if success_event.is_set(): | |
| print(f"[{run_id}] Stopping early after another run succeeded.") | |
| return | |
| if step["verification_history"] and _decide_next_step(step) == "accept": | |
| if not success_event.is_set(): | |
| final_result = step | |
| print(f"[{run_id}] --- Setting SUCCESS event! ---") | |
| success_event.set() | |
| return | |
| except Exception as e: | |
| print(f"[{run_id}] Error: {e}") | |
| tasks = [asyncio.create_task(run_worker(i)) for i in range(k)] | |
| await asyncio.gather(*tasks) | |
| return final_result | |
| workflow = StateGraph(GraphState) | |
| # Add nodes | |
| workflow.add_node("solver", generate_initial_solution) | |
| workflow.add_node("improver", improve_solution) | |
| workflow.add_node("verifier", verify_solution) | |
| workflow.add_node("verifier_improver", improve_verification) | |
| workflow.add_node("corrector", correct_solution) | |
| # Define edges | |
| workflow.set_entry_point("solver") | |
| workflow.add_edge("solver", "improver") | |
| workflow.add_edge("improver", "verifier") | |
| workflow.add_edge("verifier_improver", "corrector") | |
| workflow.add_edge("corrector", "verifier") | |
| # Add conditional edge from the verifier | |
| workflow.add_conditional_edges( | |
| "verifier", | |
| decide_next_step, | |
| { | |
| "accept": END, | |
| "reject": END, | |
| "improve_verify": "verifier_improver", | |
| "verify": "verifier", # repeat check | |
| }, | |
| ) | |
| app = workflow.compile() | |
| if __name__ == "__main__": | |
| IMO_2025_P1 = """ | |
| A line in the plane is called sunny if it is not parallel to any of the x-axis, the y-axis, or the line $x + y = 0$. | |
| Let $n \\geq 3$ be a given integer. Determine all nonnegative integers $k$ such that there | |
| exist $n$ distinct lines in the plane satisfying both of the following: | |
| - for all positive integers $a$ and $b$ with $a + b \\leq n + 1$, the point $(a, b)$ lies on at least one of the lines; and | |
| - exactly $k$ of the $n$ lines are sunny. | |
| """.strip() | |
| DEBUG_STATEMENT = "Prove that for any positive integer $n$, the number $n^3 + 5n$ is divisible by 6." | |
| IMO_2025_P3 = """ | |
| A function $f : \\mathbb{{N}} \\to \\mathbb{{N}} is said to be \\emph{{bonza}} if | |
| $f(a)$ divides $b^a - f(b)^{{f(a)}}$ | |
| for all positive integers $a$ and $b$. | |
| Determine the smallest real constant $c$ such that $f(n) \\leq cn$ for all bonza functions $f$ and all positive integers $n$. | |
| """.strip() | |
| problem_statement = DEBUG_STATEMENT | |
| initial_state = { | |
| "problem_statement": problem_statement, | |
| "current_solution": "", | |
| "bug_report": "", | |
| "iterations": 0, | |
| "verification_history": [], | |
| "verdict": "", | |
| } | |
| print("Starting IMO Problem-Solving Pipeline...") | |
| print(f"Problem: {problem_statement}\n") | |
| # final_state = app.invoke(initial_state) # run once | |
| k = 5 | |
| final_state = asyncio.run(pass_at_k_early_stop(app, initial_state, k=k)) | |
| print("\n--- PIPELINE FINISHED ---") | |
| if final_state is not None: | |
| print(f"Final Verdict: {final_state['verdict'].upper()}") | |
| print(f"Total Iterations: {final_state['iterations']}") | |
| print(f"Verification History: {final_state['verification_history']}") | |
| print("\n--- FINAL SOLUTION ---") | |
| print(final_state["current_solution"]) | |
| else: | |
| print(f"None of {k} runs succeeded.") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment