π0 A Vision-Language-Action Flow Model for General Robot Control

Source

@misc{Black_2024_pizero,
    title={π0: A Vision-Language-Action Flow Model for General Robot Control},
    author={Kevin Black and Noah Brown and Danny Driess and Adnan Esmail and Michael Equi and Chelsea Finn and Niccolo Fusai and Lachy Groom and Karol Hausman and Brian Ichter and Szymon Jakubczak and Tim Jones and Liyiming Ke and Sergey Levine and Adrian Li-Bell and Mohith Mothukuri and Suraj Nair and Karl Pertsch and Lucy Xiaoyang Shi and James Tanner and Quan Vuong and Anna Walling and Haohuan Wang and Ury Zhilinsky},
    year={2024},
    eprint={2410.24164},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2410.24164},
}

(Physical Intelligence) | arXiv

TL;DR

… General concept

Flash Reading

Abstract: Generalist robot policy learning through VLA models.
Introduction: Versatility in robot control tasks, as an extension of LLMs and VLMs. The training of LLMs and VLMs relies on internet-scale large data and fine-tuning on curated datasets for desired behaviors. In NLP and CV, general-purpose foundation models tend to outperform specialized models, which may apply to robotics as well. There are several challenges: (i) the need for large-scale training, (ii) requiring the right model architecture; (iii) requiring the right training recipe. In this work, cross-embodiment training [1] is used, where data from many robot types is combined into the same model. The continuous action distribution is modeled using action chunking architecture [2] with flow matching (diffusion) [3], which enables the model to control the robot at 50 Hz. To combine flow matching with VLMs, a novel action expert is used.
Related Work: Prior work includes RT-2, OpenVLA, TinyVLA, etc., which employ autoregressive discretization to represent actions. This work uses flow matching to produce actions instead. Out of robot control, many models used LLMs with diffusion [4]. There are also many previous works on large-scale robot learning. From self-supervised or autonomous data collection to large-scale datasets for robot control, this work wants to further explore more dexterous tasks by training on a much larger dataset [1]. This also enables robots to execute very long tasks.
Overview: The pre-training uses a mixture or a new dataset and the OXE dataset. The post-training, both efficient small-scale data and high-quality large-scale data are explored. The VLM backbone is PaliGemma [5]. The action output layer uses flow matching to generate continuous action distributions.
The Model: As the standard recipe, for the VLM, image encoders embed the robot’s image observations into the same embedding space as language tokens. For robotics-specific IOs, conditional flow matching is used to model the continuous action space, which provides high precision and multimodal modelling capability. The architecture is inspired by Transfusion [6] (training a single transformer using multiple objectives with tokens of both continuous and discrete outputs). The design of using separate sets of weights for image/text inputs and robotics-specific IOs is similar to a mixture of experts. The target is to model the data distribution $p(\bm{A}_t|\bm{o}_t)$, where $\bm{A}_t$ is a sequence of actions and $\bm{o}_t$ is an observation consisting of multiple RGB images, a language command, and the robot’s proprioceptive state (such as joint angles). For training, at time step $t$ for the robot, and flow step $\tau\in[0,1]$ for flow matching, the loss is $L^\tau(\theta)=\mathbb{E}_{p(\bm{A}_t|\bm{o}_t),q(\bm{A}^\tau_t|\bm{A}_t)} \| \bm{v}_\theta(\bm{A}^\tau_t,\bm{o}_t) - \bm{u}(\bm{A}^\tau_t|\bm{A}_t), \|$ where $\bm{u}(\bm{A}^\tau_t|\bm{A}_t)=\epsilon-\bm{A}^\tau_t$ is the denoising vector field. During inference, forward Euler integration is used $\bm{A}_t^{\tau+\delta}=\bm{A}_t^\tau + \delta \bm{v}_\theta(\bm{A}_t^\tau,\bm{o}_t),$ where $\delta$ is the step size. The backbone VLM is PaliGemma (3B) and the action expert (300M) is initialized from scratch.

References

[1] Open X-Embodiment: Robotic Learning Datasets and RT-X Models, ICRA 2024. arXiv.
[2] Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, 2023. arXiv.
[3] Flow Matching for Generative Modeling, ICLR 2023. arXiv.
[4] High-Resolution Image Synthesis with Latent Diffusion Models, CVPR 2022. arXiv.
[5] PaliGemma: A versatile 3B VLM for transfer, 2024. arXiv.
[6] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model, 2024. arXiv.

Extension

The first generation of VLA models, such as RT-2, OpenVLA, etc., uses the same architecture as VLMs, and transforms control problems into query-based formats, typically by tokenizing the output from VLMs into action sequences.

While robot actions are continuous, the second generation, such as π0, uses continuous action distributions to better model the action space, for example, by high-volume multimodal models (such as diffusion models).

However, the current VLA models simply imitate the training data without further optimization for task performance. This can be potentially solved via reinforcement learning techniques for better robustness.

Diffusion models

Ref 1: Denoising Diffusion Probabilistic Models (DDPM), 2020, arXiv.
Ref 2: Denoising Diffusion Implicit Models (DDIM), 2020, arXiv.
Ref 3: Score-Based Generative Modeling through Stochastic Differential Equations (Denoising Score Matching), 2021, arXiv.
Ref 4: Flow Matching for Generative Modeling, 2022, arXiv.

DDPM

The basic idea is to corrupt an image with Gaussian noise (pixel-wise) step by step, and learn to reverse this process to generate new images from pure noise. Given a clean image $\bm{x_0}$, the following images are corrupted based on a Markov chain operation, namely the forward process, $q(\bm{x}_{t+1} | \bm{x}_t) = \mathcal{N}(\bm{x}_t, \beta),$ where $\beta$ is a hyperparameter that controls the noise level. By repeating this process, at any step $t$, $q(\bm{x_t}|\bm{x_0})=\mathcal{N}(\bm{x_0}, t\beta)$, which is not a normal distribution. To ensure the process converges, the forward process is redefined, with $\beta\in(0,1)$, $q(\bm{x}_{t+1} | \bm{x}_t) = \mathcal{N}(\sqrt{1-\beta}\cdot\bm{x}_t, \beta).$ Then, $q(\bm{x}_t|\bm{x}_0)=\mathcal{N}(\sqrt{\bar{\alpha}_t}\bm{x}_0, 1-\bar{\alpha}_t), \quad \bar{\alpha}_t = \prod_{i=1}^t(1-\beta_i).$ For the reverse process, we want to learn it through a neural network, $p_\theta(\bm{x_{t-1}} | \bm{x_t})$ (conditioned on $t$), via the likelihood maximization. For a total of $T$ steps, $q(\bm{x}_{1:T}|\bm{x}_0) = \prod_{t=1}^T q(\bm{x}_t|\bm{x}_{t-1}), \quad p_\theta(\bm{x}_{0:T}) = p_\theta(\bm{x}_T)\prod_{t=1}^T p_\theta(\bm{x}_{t-1}|\bm{x}_t).$ To maximize the likelihood via negative log-likelihood (with the help of Jensen’s inequality), $\begin{align*} \mathcal{L}(\theta) &= -\log p_\theta(\bm{x}_{0}) = -\log \int p_\theta(\bm{x}_{0:T})d\bm{x}_{1:T} \\ &= -\log \int q(\bm{x}_{1:T}|\bm{x}_{0}) \frac{p_\theta(\bm{x}_{0:T})}{q(\bm{x}_{1:T}|\bm{x}_{0})} d\bm{x}_{1:T} \\ &= -\log \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \frac{p_\theta(\bm{x}_{0:T})}{q(\bm{x}_{1:T}|\bm{x}_{0})} \right] \\ & \le -\mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \log \frac{p_\theta(\bm{x}_{0:T})}{q(\bm{x}_{1:T}|\bm{x}_{0})} \right]. \end{align*}$

This is the evidence lower bound (ELBO) for the model. To further improve the loss function, $\text{ELBO} \approx \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \sum_{t>1}D_{\text{KL}}(q(\bm{x}_{t-1}|\bm{x}_{t},\bm{x}_0)||p_\theta(\bm{x}_{t-1}|\bm{x}_{t})) \right],$ where the $q$ term is the true posterior if we have the ground truth data $\bm{x_0}$. It can be proven that the true posterior is Gaussian, which means we can use Gaussian as the estimated distribution $p_\theta$ (reverse of a Gaussian Markov chain is also Gaussian). The neural network should learn to predict the mean and variance of this Gaussian distribution. To further simplify the training, we only learn the mean, i.e., $p_\theta(\bm{x_{t-1}}|\bm{x_{t}})=\mathcal{N}(\mu_\theta,\sigma_t)$, which gives the final loss function, $\mathcal{L}(\theta) = \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \sum_{t>1} \frac{1}{2\sigma_t^2} \| \tilde{\bm{\mu}}_{t} - \mu_\theta(\bm{x}_t, t) \|^2 \right],$ where $\tilde{\bm{\mu}}$ is the ground truth mean from the true posterior. The close form for the true mean is ($\alpha=1-\beta$) $\tilde{\bm{\mu_{t}}} = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_{t}} \bm{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}} \bm{x}_t.$ With the reparameterization trick, $\bm{x}_t = \sqrt{\alpha_t}\bm{x}_0 + \sqrt{1-\alpha_t} \bm{\epsilon}, \quad \bm{\epsilon} \sim \mathcal{N}(0, \bm{I}),$ we get $\tilde{\bm{\mu}}_{t} = \frac{1}{\sqrt{\alpha_t}} \left( \bm{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \bm{\epsilon} \right).$ Similarly, we can write $\mu_\theta$ the same way, which gives $\mathcal{L}(\theta) = \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \sum_{t>1} \frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)} \| \bm{\epsilon} - \bm{\epsilon}_\theta(\bm{x}_t, t) \|^2 \right]$

Example code for training:

import torch
import deepinv
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.0,), (1.0,))
])
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=True, download=True, transform=transform),
    batch_size=64, shuffle=True
)

model = deepinv.models.DiffUNet(
    in_channels=1, out_channels=1, pretrained=None
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
mse = deepinv.losses.MSE()

beta_start = 1e-4
beta_end = 0.02
beta_steps = 1000
betas = torch.linspace(beta_start, beta_end, beta_steps)
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1 - alphas_cumprod)

for epoch in range(100):
    model.train()
    for data, _ in train_loader:
        imgs = data
        noise = torch.randn_like(imgs)
        t = torch.randint(0, beta_steps, (imgs.size(0),), device=imgs.device)

        imgs_noisy = sqrt_alphas_cumprod[t, None, None, None] * imgs + sqrt_one_minus_alphas_cumprod[t, None, None, None] * noise

        optimizer.zero_grad()
        loss = mse(model(imgs_noisy, t, type_t='timestep'), noise)
        loss.backward()
        optimizer.step()

torch.save(model.state_dict(), 'model.pth')

Example code for inference

import torch
import deepinv

model = deepinv.models.DiffUNet(
    in_channels=1, out_channels=1, pretrained=None
)
model.load_state_dict(torch.load('model.pth'))
model.eval()

### Must be the same
beta_start = 1e-4
beta_end = 0.02
beta_steps = 1000
betas = torch.linspace(beta_start, beta_end, beta_steps)
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1 - alphas_cumprod)

n_samples = 32
with torch.inference_mode(): # sampling
    x = torch.randn(n_samples, 1, 32, 32)
    for t in reversed(range(beta_steps)):
        t = torch.ones(n_samples, dtype=torch.long) * t
        pred_noise = model(x, t, type_t='timestep')

        beta = betas[t]
        alpha = alphas[t]
        alpha_cumprod = alphas_cumprod[t]

        if t > 0:
            noise = torch.randn_like(x)
        else:
            noise = torch.zeros_like(x)

        x = (1/torch.sqrt(alpha)) * (x - (beta / torch.sqrt(1 - alpha_cumprod)) * pred_noise) + torch.sqrt(beta) * noise

DDIM

In DDPM, the density function of the reverse process $p_\theta(\bm{x_{t-1}}|\bm{x_{t}})$ is defined explicitly. In DDIM, there is a mapping from $\bm{x_t}$ to $\bm{x_{t-1}}$ that implicitly defines the distribution of the reverse process. $\bm{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \hat{\bm{x}}_0 + \sqrt{1 - \bar{\alpha}_{t-1}} \bm{\epsilon}_\theta(\bm{x}_t,t),$ where $\hat{\bm{x}_0}$ is the predicted clean image from the model. With this technique, the sampling is faster (by skipping steps) and the reverse process can be deterministic.

Flow Matching

Instead of adding noise to images, flow matching models the forward process as a continuous linear interpolation between the clean image and the noise. Given $t\in[0,1]$, $\bm{x}_t = (1-t) \bm{x}_0 + t \bm{\epsilon}.$ If the noise is Gaussian, this corresponds to the forward process in a diffusion model. The velocity field along the path is $\bm{v}(\bm{x}_t,t) = \frac{d\bm{x}_t}{dt} = \bm{\epsilon} - \bm{x}_0.$ We can train a neural network to predict the velocity field $\bm{v}_\theta(\bm{x}_t,t)$, and use it to guide the reverse process. For $s<t$, $\bm{x}_{s} = \bm{x}_{t} + \int_{s}^{t} \bm{v}_\theta(\bm{x}_{u},u) \, du.$