Flow Matching Policy Gradients

Source

@inproceedings{fpo,
  author = {David McAllister and Songwei Ge and Brent Yi and Chung Min Kim and Ethan Weber and Hongsuk Choi and Haiwen Feng and Angjoo Kanazawa},
  title = {Flow Matching Policy Gradients},
  journal = {ICLR},
  volume = {abs/2507.21053},
  year = {2026},
  url = {https://arxiv.org/abs/2507.21053},
}

(Berkeley)

arXiv

TL;DR

FPO

Flash Reading

Abstract: Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. Policy optimization as maximizing an advantage-weighted ratio without exact likelihood computation like PPO.
Introduction: FPO reframes policy optimization as maximizing an advantage weighted ratio computed from the conditional flow matching (CFM) objective. Instead of using complex likelihood calculations, it uses the flow matching loss as a surrogate in the policy gradient, which aligns the objective with increasing the evidence lower bound of high-reward actions. FPO treats the sampling process as a block box rather than reframing the denoising process as an MDP and binding the training to specific sampling methods.
Flow Matching Policy Gradients: (More details: GRPO paper) For policy gradient methods, the task is to use collected data to increase the likelihood of actions that lead to high rewards. $\max_\theta \mathbb{E}_{a_t \sim \pi_\theta} \left[ \log \pi_\theta(a_t|o_t) \hat{A}_t \right],$ given the observation $o_t$, action $a_t$, and advantage $\hat{A}_t$. This is only valid locally, and large updates can lead to policy collapse. PPO uses a clipped surrogate objective to construct a trust region to solve this problem, but it requires computing the likelihood ratio $r(\theta)=\pi_\theta(a_t|o_t)/\pi_{\text{old}}(a_t|o_t)$. In FPO, $r(\theta)$ is replaced with a proxy $\hat{r}^{\text{FPO}}(\theta) = \exp( \hat{\mathcal{L}}_{\text{CFM},\theta_{\text{old}}} - \hat{\mathcal{L}}_{\text{CFM},\theta} ).$ For a given observation $o_t$ and action $a_t$, $\hat{\mathcal{L}}_{\text{CFM},\theta}$ is the estimated per-sample CFM loss. $\hat{\mathcal{L}}_{\text{CFM},\theta} = \frac{1}{N_\text{mc}} \sum_{i=1}^{N_\text{mc}} \| \hat{v}_\theta(a^{\tau_i}_t, \tau_i | o_t) - (a_t - \epsilon_i) \|^2_2,$ where $\tau_i$ is the flow timestep.

References

[1] X, 2017. arXiv.

Source

TL;DR

Flash Reading

References

Extensions