DeepSeekMath Pushing the Limits of Mathematical Reasoning in Open Language Models

Source

@misc{deepseek-math,
  author = {Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo},
  title = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  journal = {CoRR},
  volume = {abs/2402.03300},
  year = {2024},
  url = {https://arxiv.org/abs/2402.03300},
}
(DeepSeek) arXiv

TL;DR

GRPO

Flash Reading

References

Extensions

The chain of the RL method involved in this paper is: TRPO (Trust Region Policy Optimization) -> PPO (Proximal Policy Optimization) -> GRPO (Group Relative Policy Optimization). In this section, starting from policy gradient, the details of PPO and GRPO are introduced.


The basic idea of policy gradient is:

If an action or a sequence of actions leads to good results (high reward), increase the probability / frequency of taking those actions in the future, and vice versa.

Given a trajectory $\tau$, the objective function of policy gradient is: \(\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=1}^{|\tau|} r_t \right],\) where $r_t$ is the reward at time step $t$. To find the gradient: \(\nabla_\theta \mathcal{J}(\theta) = \mathbb{E_{\tau \sim \pi_{\theta}} \left[ \sum_{t=1}^{|\tau|} \nabla_\theta \log \pi_\theta(a_t|s_t) G_t \right]},\) where $G_t = \sum_{t’=t}^{|\tau|} r_{t’}$ is the return from time step $t$. To reduce the variance of the gradient estimation, an advantage function $A_t(s_t, a_t)=Q(s_t, a_t) - V(s_t)$ is used to replace the return $G_t$ as an estimation of an action’s quality over the average, which gives the REINFORCE with baseline algorithm. \(\nabla_\theta \mathcal{J}(\theta) = \sum^{|\tau|}_{t=1} \mathbb{E_t \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) A_t(s_t, a_t) \right]}.\) However, it is unclear how large the update step should be, since the change of the parameter space does not directly reflect the change in the policy space (A Natural Policy Gradient, Sham Kakade, 2001).


Since looking for a good step size is difficult, we can assume that the new policy should not deviate too much from the old policy (so we can do importance sampling based on the old policy). This gives the idea of Trust Region Policy Optimization (TRPO) (Trust Region Policy Optimization, John Schulman et al., 2015). The objective function is: \(\mathcal{J}_{\text{TRPO}}(\theta|\theta_{\text{old}}) = \mathbb{E}_t \left[ \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} A_t \right],\) which changes the mind into finding a new $\theta$ around the original one. To define the trust region, a constraint on the KL divergence between the new policy and the old policy is added: \(\text{subject to } \mathbb{E}_t \left[ D_{\text{KL}}(\pi_{\theta_{\text{old}}}(\cdot|s_t) \| \pi_\theta(\cdot|s_t)) \right] \leq \delta.\)


The problem of TRPO is that it is difficult to compute. To simplify the computation, Proximal Policy Optimization (PPO) (Proximal Policy Optimization Algorithms, John Schulman et al., 2017) is proposed. The idea is to use a clipped surrogate objective function or a penalty term instead of the original constraint, and the clipped surrogate objective is more commonly used. PPO optimizes a surrogate objective function (use notations from the DeepSeekMath paper): \(\mathcal{J}_{\text{PPO}}(\theta) = \mathbb{E}_{q\sim P(Q),o\sim\bar{\pi}(O|q)} \frac{1}{|o|} \sum_{t=1}^{|o|} \left[ \min \left( \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{\text{old}}}(o_t|q,o_{<t})} A_t, \text{clip}(\frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\theta_{\text{old}}}(o_t|q,o_{<t})}, 1 - \epsilon, 1 + \epsilon) A_t \right) \right],\) where $A_t$ estimated by using GAE based on the rewards and a learned value function. $q$ and $o$ are the question (input) and response (output) from the dataset. In PPO, a value function needs to be trained alongside the policy model. To mitigate over-optimization of the reward model (policy may explore the imperfection of the RM), the real reward is constructed as the reward from the reward model minus a weighted KL penalty between the current policy and the reference policy (another pretrained model / baseline): \(r_t = r_{\text{RM}}(q, o_{\leq t}) - \beta \log \frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\text{ref}}(o_t|q,o_{<t})}.\).


The problem of PPO is that many models are involved (policy, value function, reward model, reference policy), making the training very expensive. To solve this problem, Group Relative Policy Optimization (GRPO) is proposed. The idea is to replace the value function with grouping scores from averaging rewards from multiple feedback from the reward model given different sampled responses. \(\hat{A}_{i,t} = \frac{r_i-\mu_r}{\sigma_r},\) where $\mu_r$ and $\sigma_r$ are the mean and standard deviation of the rewards from different sampled responses.