Source
@misc{simplevla-rl,
author={Li, Haozhan and Zuo, Yuxin and Yu, Jiale and Zhang, Yuhao and Yang, Zhaohui and Zhang, Kaiyan and Zhu, Xuekai and Zhang, Yuchen and Chen, Tianxing and Cui, Ganqu and others},
title = {SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning},
journal={arXiv preprint arXiv:2509.09674},
year={2025}
}
| (Tsinghua, Shanghai AI Lab) | arXiv |
TL;DR
…

Flash Reading
- Abstract: VLAs have been successful in robotic manipulation tasks via large-scale pretraining and supervised fine-tuning (SFT). Two challenges: high cost of large-scale human operated data collection for SFT, and limited generalization to tasks requiring distribution shift. RL can enhance step-by-step reasoning in Large Reasoning Models (LRMs). This work aims as using RL to improve long-horizon step-by-step action planning in VLAs, proposing SimpleVLA-RL, which builds on veRL and introduces VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. A novel phenomenon “pushcut” is identified during RL training.
- Introduction: Previous works typically have a two-stage training: large-scale pretraining on multimodal data, followed by SFT on high-quality human operated data. This work aims to use RL to deal with the data scarcity and poor generalization problems.
- Preliminaries: In LLMs, at step $t$, the state $s_t$ consists of the input prompt $x_\text{prompt}$ and previously generated tokens $y_{<t}$. An action is to select the next token $y_t$ from the vocabulary $\mathcal{V}$, based on the policy $\pi_\theta(y_t;s_t)$, which is normally a distribution over tokens. A reward from a rule-based function or a learned reward model is given after generating the whole response. For VLAs, the state usually includes the observation, proprioceptive information, and language instruction. Actions are control commands for the robot, such as end-effector poses or joint angles, produced from an action decoder. The reward is given by task completion and optional progress rewards. Group Relative Policy Optimization (GRPO) is an RL algorithm based on PPO, which replaces the value function with grouping scores from averaging rewards from multiple feedback from the reward model given different sampled responses.
- SimpleVLA-RL: Inspired by DeepSeek-R1, SimpleVLA-RL extends the rule-based online RL framework to VLAs. Specifically, for each input, multiple trajectories are sampled with a simple outcome reward (success / failure). Based on the rewards and action token probabilities, GPRO loss is computed to update the model. To generate diverse trajectories, since current VLAs use (1) action token distribution, (2) diffusion-based denoising on latent, or (3) deterministic regression via MLPs, and the first one is most suitable for RL, this work uses the first one. Unlike LLMs, where a rollout proceeds autoregressively, VLA rollouts requires continuous interaction with the environment. The reward is only given at the end of the episode as in DeepSeek-R1. Previous works have shown that encouraging exploration is important, which is observed in VLA RL as well. VLA models tend to generate repetitive action patterns, leading to poor exploration, which is critical especially for low-success-rate tasks. To mitigate this, three techniques are used: (1) dynamic sampling during rollout (exclude groups with all succeed or fail rollouts for non-zero advantage values), (2) adjusting the clip range (higher for larger trust region) in GPRO loss, and (3) increasing the sampling temperature during rollout.
- Discussion (Pushcut Phenomenon): During RL training, all demonstrations use grasp-move-place strategy. However, after RL training, the model learns a new strategy “pushcut”, which pushes the object to the target location. This highlights the fundamental distinction between SFT and RL.