DeepSeekMath Pushing the Limits of Mathematical Reasoning in Open Language Models

Source

@misc{deepseek-math,
  author = {Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo},
  title = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  journal = {CoRR},
  volume = {abs/2402.03300},
  year = {2024},
  url = {https://arxiv.org/abs/2402.03300},
}

(DeepSeek)

arXiv

TL;DR

GRPO

Flash Reading

Abstract: Mathematical reasoning has a complex and structured nature. This work introduces DeepSeekMath 7B, which continues pretrained DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl. There are two main contributions: web data with data selection pipeline and Group Relative Policy Optimization (GRPO).
Introduction: This work aims to provide an open-source and high-performance mathematical reasoning model. Firstly, a DeepSeekMath Corpus (120B math tokens) is constructed from Common Crawl with a data selection pipeline containing a fastText-based classifier. The corpus is collected in an interative way with human annotation. DeepSeekMath-Base is initialized from DeepSeek-Coder-Base-v1.5 7B. After pretraining on the corpus, it is instructed-tuned with chain-of-thought, program-of-thought, and tool-integrated reasoning data, which gives DeepSeekMath-Instruct 7B. Secondly, a novel RL method, Group Relative Policy Optimization (GRPO), is proposed to using grouping scores instead of values from critic models. This gives the final DeepSeekMath-RL 7B model. Models are evaluated on English & Chinese math reasoning, formal mathematics, and natural language understanding, reasoning, & coding.
Math Pre-Training (Data Collection): The data collection pipeline is an interative process. First, the OpenWebMath dataset is used as the initial seed corpus to train a fastText classifier. Then, the classifier is used to filter math-related documents from Common Crawl by keeping data from high-ranking domains and removing duplicates. These data are then annotated by human labelers and added to the seed corpus. After repeating this process four times, nearly 98% of the collected data has been already collected, and the process is considered converged. This corpus is validated via by training a series of models and evaluating them on MATH and GSM8K datasets.
Supervised Fine-Tuning: The instructed model is trained on a mathematical instruction-tuning dataset.
Reinforcement Learning (GPRO): Proximal Policy Optimization (PPO) [1] is an actor-critic RL algorithm widely used in RL fine-tuning. It computes the advantage (can be regarded as a modified reward) via Generalized Advantage Estimation (GAE) [2] based on the rewards and value function from the critic model. PPo requires training a value function alongside the policy model. The idea of GRPO is to replace the value function with grouping scores from averaging rewards from multiple feedback from the reward model given different sampled responses.

References

[1] Proximal Policy Optimization Algorithms, 2017. arXiv.
[2] High-Dimensional Continuous Control Using Generalized Advantage Estimation, 2015. arXiv.

Extensions

The chain of the RL method involved in this paper is: TRPO (Trust Region Policy Optimization) -> PPO (Proximal Policy Optimization) -> GRPO (Group Relative Policy Optimization). In this section, starting from policy gradient, the details of PPO and GRPO are introduced.

The basic idea of policy gradient is:

If an action or a sequence of actions leads to good results (high reward), increase the probability / frequency of taking those actions in the future, and vice versa.

Given a trajectory $\tau$, the objective function of policy gradient is: $\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=1}^{|\tau|} r_t \right],$ where $r_t$ is the reward at time step $t$. To find the gradient: $\nabla_\theta \mathcal{J}(\theta) = \mathbb{E_{\tau \sim \pi_{\theta}} \left[ \sum_{t=1}^{|\tau|} \nabla_\theta \log \pi_\theta(a_t|s_t) G_t \right]},$ where $G_t = \sum_{t’=t}^{|\tau|} r_{t’}$ is the return from time step $t$. To reduce the variance of the gradient estimation, an advantage function $A_t(s_t, a_t)=Q(s_t, a_t) - V(s_t)$ is used to replace the return $G_t$ as an estimation of an action’s quality over the average, which gives the REINFORCE with baseline algorithm. $\nabla_\theta \mathcal{J}(\theta) = \sum^{|\tau|}_{t=1} \mathbb{E_t \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) A_t(s_t, a_t) \right]}.$ However, it is unclear how large the update step should be, since the change of the parameter space does not directly reflect the change in the policy space (A Natural Policy Gradient, Sham Kakade, 2001).

Since looking for a good step size is difficult, we can assume that the new policy should not deviate too much from the old policy (so we can do importance sampling based on the old policy). This gives the idea of Trust Region Policy Optimization (TRPO) (Trust Region Policy Optimization, John Schulman et al., 2015). The objective function is: $\mathcal{J}_{\text{TRPO}}(\theta|\theta_{\text{old}}) = \mathbb{E}_t \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} A_t \right],$ which changes the mind into finding a new $\theta$ around the original one. To define the trust region, a constraint on the KL divergence between the new policy and the old policy is added: $\text{subject to } \mathbb{E}_t \left[ D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s_t) \| \pi_\theta(\cdot|s_t)) \right] \leq \delta.$

The problem of TRPO is that it is difficult to compute. To simplify the computation, Proximal Policy Optimization (PPO) (Proximal Policy Optimization Algorithms, John Schulman et al., 2017) is proposed. The idea is to use a clipped surrogate objective function or a penalty term instead of the original constraint, and the clipped surrogate objective is more commonly used. PPO optimizes a surrogate objective function (use notations from the DeepSeekMath paper): $\mathcal{J}_{\text{PPO}}(\theta) = \mathbb{E}_{q\sim P(Q),o\sim\bar{\pi}(O|q)} \frac{1}{|o|} \sum_{t=1}^{|o|} \left[ \min \left( \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{\text{old}}}(o_t|q,o_{<t})} A_t, \text{clip}(\frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{\text{old}}}(o_t|q,o_{<t})}, 1 - \epsilon, 1 + \epsilon) A_t \right) \right],$ where $A_t$ estimated by using GAE based on the rewards and a learned value function. $q$ and $o$ are the question (input) and response (output) from the dataset. In PPO, a value function needs to be trained alongside the policy model. To mitigate over-optimization of the reward model (policy may explore the imperfection of the RM), the real reward is constructed as the reward from the reward model minus a weighted KL penalty between the current policy and the reference policy (another pretrained model / baseline): $r_t = r_{\text{RM}}(q, o_{\leq t}) - \beta \log \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\text{ref}}(o_t|q,o_{<t})}.$.

The problem of PPO is that many models are involved (policy, value function, reward model, reference policy), making the training very expensive. To solve this problem, Group Relative Policy Optimization (GRPO) is proposed. The idea is to replace the value function with grouping scores from averaging rewards from multiple feedback from the reward model given different sampled responses. $\hat{A}_{i,t} = \frac{r_i-\mu_r}{\sigma_r},$ where $\mu_r$ and $\sigma_r$ are the mean and standard deviation of the rewards from different sampled responses.