Launching Soon: On-Demand, Self-Paced Courses. Learn more!

DeepSeekMath‑V2: Open‑Source Math LLM That Reaches IMO Gold

Updated on December 04, 2025 13 minutes read

Researchers at a transparent board of equations and a verifier–meta-verifier–generator diagram, illustrating the DeepSeekMath-V2 LLM at Code Labs Academy

DeepSeek has released DeepSeekMath-V2, a large-scale math reasoning model that doesn’t just solve Olympiad problems, it also checks and scores its own proofs, reaching gold-level performance on IMO 2025 and CMO 2024 and scoring 118/120 on Putnam 2024.

DeepSeekMath-V2 is built on DeepSeek-V3.2-Exp-Base, tuned specifically for natural-language theorem proving and self-verifiable reasoning.

This article has two parts:

  1. Part 1 – Tech-news overview: what DeepSeekMath-V2 is, why it matters, and how it performs.
  2. Part 2 – Technical deep dive: how the verifier meta-verifier–generator loop works and how the training pipeline is structured.

Part 1 – Tech-News Overview

What is DeepSeekMath-V2?

DeepSeekMath-V2 is a specialized large language model focused on:

  • Olympiad-style mathematics (IMO, CMO, Putnam).
  • Natural-language theorem proving (full proofs in ordinary math English).
  • Self-verification, where the model evaluates and scores its own solutions before finalizing them.

Instead of just answering “what is the final number?”, DeepSeekMath-V2 is explicitly trained to answer “is this proof correct and rigorous?” and to improve its own reasoning based on that evaluation.

Why DeepSeekMath-V2 matters

1. Open-weights model with gold-level contest performance

DeepSeekMath-V2 demonstrates:

  • IMO 2025 – 5 out of 6 problems fully solved, with a total score in the gold medal range.
  • CMO 2024 – 4 out of 6 problems fully solved, plus partial credit on another, again at gold-equivalent level.
  • Putnam 2024 – 11 out of 12 problems solved completely and the remaining one with minor errors, for a 118/120 score, beating the top human score of 90.

These results are not based on answer-only checking; the authors had mathematical experts grade the proofs using official-style marking schemes.

2. From answer-checking to proof-checking

Most math-focused LLM training uses a simple RL reward:

Reward = 1 if final answer is correct, else 0.

This works for quantitative contests like AIME and HMMT, but fails for theorem proving, where:

  • Many problems only ask for a proof, not a numeric answer.
  • A model can reach the right answer with incorrect reasoning, and still get a full reward.

DeepSeekMath-V2 tackles this by:

  • Training a proof verifier that reads a problem and a candidate solution and:
    • Describes issues in the proof.
    • Assigns a score: 1 (rigorous), 0.5 (mostly right with minor gaps), or 0 (fundamentally flawed).
  • Using this verifier as the reward model for training the proof generator.

So the model isn’t rewarded for “right final numbers,” it’s rewarded for high-quality proofs.

3. Self-verification as a core feature

The final DeepSeekMath-V2 model is prompted not just to solve problems, but to formally evaluate its own solutions.

Solution

… full proof …

Self Evaluation

Here is my evaluation of the solution:

… critique …

Based on my evaluation, the final overall score should be:

$\boxed{0 / 0.5 / 1}$

During training, the model is penalized if:

  • It claims a high score a proof that the external verifier considers weak.
  • Its self-evaluation doesn’t match the verifier’s judgment.

This encourages the model to honestly identify and describe flaws in its own reasoning instead of bluffing.

Headline performance

DeepSeekMath-V2 is evaluated across several benchmarks:

  • In-house CNML-level benchmark

    • 91 theorem-proving problems approximating Chinese National High School Mathematics League difficulty, covering algebra, geometry, number theory, combinatorics, and inequalities.
    • DeepSeekMath-V2 achieves the highest average proof score in every category compared with GPT-5-Thinking-High and Gemini 2.5-Pro (see Figure 1 in the paper).
  • IMO-ProofBench

    • On the Basic subset, DeepSeekMath-V2 (heavy-compute setting) reaches 99.0% correct proofs.
    • On the Advanced subset, it reaches 61.9% accuracy, competitive with the strongest reported systems.
  • Real competitions

    • Gold-level performance on IMO 2025 and CMO 2024, plus near-perfect Putnam 2024 performance, as summarized in Table 1 of the paper.

All high-stakes results are confirmed by human expert graders who mark the model’s solutions as if they were contest scripts.

Licensing and availability

DeepSeekMath-V2 and its methodology are released under the Apache-2.0 license, which allows:

Part 2 – Technical Deep Dive

1. The core problem: final-answer RL hits a wall

Traditional RL for math reasoning works like this:

  1. Pre-train a large language model.
  2. Supervised fine-tuning it on chain-of-thought math solutions.
  3. Apply reinforcement learning where the reward is based on final answer correctness.

This has led models to saturate many quantitative benchmarks. But it has two major limitations for theorem proving:

  • Correct answer $\neq$ correct reasoning A model can still get a reward even if large parts of its proof are wrong, as long as the final number happens to be right.

  • Theorem questions often have no numeric answer Problems that simply say “prove that …” offer no final scalar to compare against.

The DeepSeekMath-V2 paper argues that to push LLMs toward deeper reasoning, we need to verify the rigor and completeness of reasoning itself, not just the result.

2. System architecture: verifier, meta-verifier, generator

DeepSeekMath-V2 is built around three tightly coupled roles:

  1. Proof verifier – evaluates proofs and scores them.
  2. Meta-verifier – evaluates the verifier’s own analyses.
  3. Proof generator – produces solutions and self-evaluations, trained using feedback from the verifier and meta-verifier.

All three are implemented as LLMs derived from the same base architecture, but with different prompts and RL objectives.

3. Training the proof verifier

3.1 Cold-start data

To train a proof verifier, DeepSeek first constructs a dataset $(D_v)$ of problems, candidate proofs, and expert scores:

  1. Problem collection – 17,503 proof-style problems from Art of Problem Solving (AoPS) contest archives, focusing on post-2010 olympiads and team selection problems that explicitly require proofs. This pool is called $D_p$.
  2. Candidate proof generation – an earlier DeepSeek-V3.2-Exp-Thinking model generates solutions, encouraged to iteratively refine proofs to increase length and rigor.
  3. Human scoring – math experts label each proof with a score:
    • 1 – complete and rigorous, all steps justified.
    • 0.5 – essentially correct, but with minor omissions or small errors.
    • 0 – fundamentally flawed, with serious logical gaps.

This yields training triples $(X_i, Y_i, s_i)$.

3.2 RL objective for the verifier

The verifier is initialized from a DeepSeek-V3.2-Exp-SFT checkpoint (already fine-tuned on math and code reasoning) and optimized via Group Relative Policy Optimization (GRPO) with two key reward terms:

  • Format reward $R_{\text{format}}$ ensures proper structure:

    • The output must include the phrase: Here is my evaluation of the solution:
    • It must end with $\boxed{score}$ after: Based on my evaluation, the final overall score should be:.
  • Score reward $R_{\text{score}}$ measures how close the predicted score $s’$ is to the expert label $s$:

$$R_{\text{score}} = 1 - |s’ - s|$$

The verifier’s RL objective is to maximize the expected product:

$$R_{\text{format}} \cdot R_{\text{score}}$$

over $D_v$

3.3 The hallucinated-issues problem

This setup successfully trains the verifier to predict scores, but it leaves a hole:

For flawed proofs, the verifier can predict the correct numeric score while hallucinating nonexistent issues in the explanation and still receive a full reward.

Since these textual critiques will later be used to refine proofs, the team needs a way to enforce that the identified issues really exist.

4. Meta-verification: verifying the verifier

To address this, DeepSeek adds a second layer: the meta-verifier.

The meta-verifier receives:

  • The problem $X$.
  • The candidate proof $Y$.
  • The verifier’s analysis $V$ (including its score).

Its job is to check if:

  • The verifier correctly restates relevant parts of the proof.
  • The defects it points out actually exist and are analyzed correctly.
  • The final score is justified according to the rubric.

4.1 Meta-verification dataset

To train this model, the team:

  1. Runs the initial verifier on various proofs.
  2. Has experts label each verifier output $V$ with a meta-score $ms \in {0, 0.5, 1}$, measuring the quality and faithfulness of the analysis.
  3. Builds a dataset $D_{mv} = {(X_i, Y_i, V_i, ms_i)}$.

The meta-verifier is then trained with the same RL structure as the verifier, but now the target is the quality of the analysis, not the quality of the proof.

4.2 Feeding meta-feedback back into verifier training

Once the meta-verifier can reliably score analyses, its feedback is used as an extra term in the verifier’s reward:

$$R_V = R_{\text{format}} \cdot R_{\text{score}} \cdot R_{\text{meta}},$$

Where $R_{\text{meta}}$ is the meta-verifier’s quality score for the verifier’s analysis.

By training the verifier with this augmented reward on both $D_v$ and $D_{mv}$, the authors obtain a model that:

  • Still predicts proof scores accurately.
  • Produces analyses whose average meta-score on a validation set increases from 0.85 to 0.96, indicating much more faithful issue identification.

5. Training the proof generator

With a robust verifier in hand, DeepSeek trains a proof generator that uses the verifier’s scores as RL rewards.

5.1 Basic generator objective

The generator is initialized from the enhanced verifier checkpoint (so it already has verification capabilities). For each problem $X$ from the AoPS pool $D_p$:

  1. The generator produces a candidate solution $Y$.
  2. The verifier scores it with $s \in {0, 0.5, 1}$.
  3. The generator is updated by GRPO to maximize the expected score:

$$R_Y = s.$$

This encourages the generator to produce proofs that the verifier considers rigorous and correct.

5.2 Adding self-verification

However, the authors observe that when asked to both solve and evaluate in a single forward pass, the generator tends to over-rate its own proofs, even when the external verifier easily spots mistakes.

To fix this, they explicitly train the generator to act like a verifier on its own outputs. During training:

  • The generator must produce:

    • $Y$ – a complete solution under the ## Solution section.
    • $Z$ – a detailed self-evaluation under ## Self Evaluation, ending with $\boxed{0}$, $\boxed{0.5}$, or $\boxed{1}$.
  • The external verifier is then used to:

    • Score the proof $Y \rightarrow s$.
    • Score the self-evaluation $Z$ as an analysis $\rightarrow$ meta-score $ms$.

The overall reward is:

$$R = R_{\text{format}}(Y, Z) \cdot (\alpha R_Y + \beta R_Z),$$

Where:

  • $R_Y = s$ is the proof score.
  • $R_Z = R_{\text{score}}(s’, s) \cdot R_{\text{meta}}(Z)$ measures how accurate and honest the self-evaluation is.
  • $\alpha = 0.76$, $\beta = 0.24$.

This reward structure encourages the generator to:

  • Produce correct proofs.
  • Accurately judge how correct they are.
  • Prefer honest acknowledgment of errors over falsely claiming correctness.

6. Automated Labeling with Scaled Verification

As the generator improves, its proofs become harder to judge, and human labeling becomes costly. DeepSeek therefore builds a fully automated labeling pipeline using scaled verification and meta-verification.

For each proof:

  1. Multiple verifier samples

    • Run $n$ independent verification analyses.
  2. Meta-verify analyses that report issues

    • For analyses with score 0 or 0.5, run $m$ meta-verification passes.
    • Mark an analysis as valid if the majority of meta-verifiers agree its defect findings are reasonable.
  3. Assign a proof label

    • If there are at least $k$ valid analyses with the lowest score (0 or 0.5), label the proof with that lowest score.
    • If no valid issues are found at all, label the proof as 1 (fully correct).
    • Otherwise, discard the proof or send it to humans (a step needed only in earlier training iterations).

By the final two iterations, this pipeline completely replaces human annotation, and spot checks show strong agreement with expert labels.

The authors evaluate DeepSeekMath-V2 under three main inference strategies.

7.1 One-Shot Generation

  • For the in-house CNML-level problems, the model generates 8 proof samples per problem.
  • Each sample is evaluated with 8 verifier runs, and majority vote determines correctness.

DeepSeekMath-V2 consistently outperforms GPT-5-Thinking-High and Gemini 2.5-Pro across all problem categories in this setting (Figure 1).

7.2 Sequential Self-Refinement

Harder problems often require more context than a single pass allows. To explore this, the authors test sequential refinement with self-verification on the IMO Shortlist 2024 set:

  1. Generate an initial solution + self-evaluation.
  2. Feed the entire transcript (problem, solution, self-evaluation) back into the model with a refinement prompt (Appendix A.4), asking it to fix issues.
  3. Repeat for up to 8 iterations, or until the model scores itself as fully correct.

They run 32 independent refinement threads per problem and evaluate using majority vote over 32 verifier runs. Two metrics:

  • Pass@1 – average score of the final proof in each thread.
  • Best@32 – best proof per problem, ranked by the model’s own self-scores.

Figure 2 shows:

  • Pass@1 improves significantly as the maximum number of iterations increases from 1 (no refinement) to 8.
  • Best@32 is consistently higher than Pass@1, showing that the model’s self-scores are useful for selecting high-quality proofs.

7.3 High-Compute Search (Contest Mode)

For real contest benchmarks like IMO 2025, CMO 2024, and Putnam 2024, DeepSeek uses a heavy-compute search and refinement loop:

  1. Initialize candidate pool

    • Generate 64 initial proofs per problem.
    • For each proof, generate 64 verification analyses.
    • Keep the 64 proofs with the highest average verification score as the candidate set.
  2. Iterative refinement (up to 16 iterations)

    • For each candidate’s proof, randomly select 8 analyses (favoring those that report issues).
    • Feed the proof plus these analyses back to the generator with the refinement prompt to produce a new proof.
    • Re-score all new proofs (again with 64 verification analyses each) and update the candidate pool.
  3. Stopping criterion

    • Stop early if a proof passes all 64 verification attempts (no issues reported), which indicates high confidence in correctness.

A single model, the final proof generator, is used for both generation and verification in this loop.

This strategy is what yields the strong competition results summarized in Table 1:

  • Gold-level scores on IMO 2025 and CMO 2024.
  • 118/120 on Putnam 2024, surpassing the top human competitor.

8. Relation to Formal Theorem Proving

DeepSeekMath-V2 operates in natural language, which means:

  • Proofs are written like AoPS posts or contest writeups, understandable by humans.
  • There is no built-in formal guarantee of correctness like in Lean or Isabelle.

However, the model is complementary to formal systems:

  • Natural-language proofs can serve as high-level sketches for formal provers.
  • DeepSeek’s own DeepSeek-Prover-V2 uses LLM-based informal reasoning to guide formal proof search and achieves strong results on formal benchmarks.

The authors explicitly argue that improving informal theorem proving with models like DeepSeekMath-V2 should significantly boost the effectiveness of formal theorem proving systems.


9. Limitations and Open Challenges

The paper is clear about what DeepSeekMath-V2 does not yet solve:

  • Informal, not formal – “self-verifiable” means “no issues found by the verifier/meta-verifier,” not a guaranteed formal proof. Subtle mistakes may still slip through.
  • Compute-heavy the strongest results rely on many proof samples and verification passes per problem; running such loops requires substantial compute.
  • Domain coverage most training and evaluation is on contest-style problems. Behavior on broad research-level mathematics remains an open question.
  • Imperfect self-evaluation – while the model’s self-scores correlate well with verifier scores, they are not perfect, especially on the hardest problems.
  • Safety considerations – powerful math models can be applied to dual-use domains (e.g., cryptanalysis, system design). Responsible deployment is essential, particularly given the open-weight release.

10. How to Experiment with DeepSeekMath-V2

If you have access to suitable compute, here’s a high-level roadmap to try DeepSeekMath-V2 yourself:

  1. Download the model

  2. Set up an inference stack

  3. Use role-specific prompts

    • Generation: follow the “Proof Generation Prompt” (Appendix A.1), have the model output both ## Solution and ## Self Evaluation sections.
    • Verification: follow the “Proof Verification Prompt” (Appendix A.2), provide a problem and solution, ask for evaluation, and score.
    • Meta-verification:Usee the “Meta-Verification Prompt” (Appendix A.3) to have the model judge another evaluation.
  4. Implement your own refinement loop

    • Generate several candidate solutions per problem.
    • Let the model verify each; pick top-scoring proofs and refine them using the verification feedback.
    • Repeat a few iterations to see how proof quality improves.
  5. Fine-tune for your application

    • Because of the Apache-2.0 license, you can fine-tune the model on:
      • In-house exercise sets (for ed-tech tools).
      • Specialized domains (e.g., optimization, control, discrete math).
      • Research-level problem collections.

If you’d like to build the skills needed to work with models like DeepSeekMath-V2 professionally, you can explore AI and data-driven programs at Code Labs Academy.

You can explore the project and resources here: GitHub repo (code + paper): https://github.com/deepseek-ai/DeepSeek-Math-V2/tree/main Hugging Face model card + weights: https://huggingface.co/deepseek-ai/DeepSeek-Math-V2

Conclusion

DeepSeekMath-V2 is not just another big math model; it’s a blueprint for training LLMs that can:

  • Generate detailed mathematical proofs.
  • Critically evaluate and score their own reasoning.
  • Use those evaluations to iteratively refine their solutions.

By combining proof verification, meta-verification, and self-verification, the DeepSeek team shows that LLMs can develop meaningful self-evaluation abilities on complex reasoning tasks and reach gold-level performance on some of the hardest math competitions in the world.

It’s an important step toward AI systems that don’t just answer questions, but can audit, debug, and trust-check their own reasoning, a capability that will be crucial far beyond competition mathematics.

Frequently Asked Questions

Career Services

Personalised career support to launch your tech career. Benefit from résumé reviews, mock interviews and insider industry insights so you can showcase your new skills with confidence.