π0 A Vision-Language-Action Flow Model for General Robot Control

Source

@misc{Black_2024_pizero,
    title={π0: A Vision-Language-Action Flow Model for General Robot Control},
    author={Kevin Black and Noah Brown and Danny Driess and Adnan Esmail and Michael Equi and Chelsea Finn and Niccolo Fusai and Lachy Groom and Karol Hausman and Brian Ichter and Szymon Jakubczak and Tim Jones and Liyiming Ke and Sergey Levine and Adrian Li-Bell and Mohith Mothukuri and Suraj Nair and Karl Pertsch and Lucy Xiaoyang Shi and James Tanner and Quan Vuong and Anna Walling and Haohuan Wang and Ury Zhilinsky},
    year={2024},
    eprint={2410.24164},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2410.24164},
}

(Physical Intelligence) | arXiv

TL;DR

General concept

Flash Reading

References

Extension

The first generation of VLA models, such as RT-2, OpenVLA, etc., uses the same architecture as VLMs, and transforms control problems into query-based formats, typically by tokenizing the output from VLMs into action sequences.

While robot actions are continuous, the second generation, such as π0, uses continuous action distributions to better model the action space, for example, by high-volume multimodal models (such as diffusion models).

However, the current VLA models simply imitate the training data without further optimization for task performance. This can be potentially solved via reinforcement learning techniques for better robustness.

Diffusion models

DDPM

The basic idea is to corrupt an image with Gaussian noise (pixel-wise) step by step, and learn to reverse this process to generate new images from pure noise. Given a clean image $\bm{x_0}$, the following images are corrupted based on a Markov chain operation, namely the forward process, \(q(\bm{x}_{t+1} | \bm{x}_t) = \mathcal{N}(\bm{x}_t, \beta),\) where $\beta$ is a hyperparameter that controls the noise level. By repeating this process, at any step $t$, $q(\bm{x_t}|\bm{x_0})=\mathcal{N}(\bm{x_0}, t\beta)$, which is not a normal distribution. To ensure the process converges, the forward process is redefined, with $\beta\in(0,1)$, \(q(\bm{x}_{t+1} | \bm{x}_t) = \mathcal{N}(\sqrt{1-\beta}\cdot\bm{x}_t, \beta).\) Then, \(q(\bm{x}_t|\bm{x}_0)=\mathcal{N}(\sqrt{\bar{\alpha}_t}\bm{x}_0, 1-\bar{\alpha}_t), \quad \bar{\alpha}_t = \prod_{i=1}^t(1-\beta_i).\) For the reverse process, we want to learn it through a neural network, $p_\theta(\bm{x_{t-1}} | \bm{x_t})$ (conditioned on $t$), via the likelihood maximization. For a total of $T$ steps, \(q(\bm{x}_{1:T}|\bm{x}_0) = \prod_{t=1}^T q(\bm{x}_t|\bm{x}_{t-1}), \quad p_\theta(\bm{x}_{0:T}) = p_\theta(\bm{x}_T)\prod_{t=1}^T p_\theta(\bm{x}_{t-1}|\bm{x}_t).\) To maximize the likelihood via negative log-likelihood (with the help of Jensen’s inequality), \(\begin{align*} \mathcal{L}(\theta) &= -\log p_\theta(\bm{x}_{0}) = -\log \int p_\theta(\bm{x}_{0:T})d\bm{x}_{1:T} \\ &= -\log \int q(\bm{x}_{1:T}|\bm{x}_{0}) \frac{p_\theta(\bm{x}_{0:T})}{q(\bm{x}_{1:T}|\bm{x}_{0})} d\bm{x}_{1:T} \\ &= -\log \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \frac{p_\theta(\bm{x}_{0:T})}{q(\bm{x}_{1:T}|\bm{x}_{0})} \right] \\ & \le -\mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \log \frac{p_\theta(\bm{x}_{0:T})}{q(\bm{x}_{1:T}|\bm{x}_{0})} \right]. \end{align*}\)

This is the evidence lower bound (ELBO) for the model. To further improve the loss function, \(\text{ELBO} \approx \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \sum_{t>1}D_{\text{KL}}(q(\bm{x}_{t-1}|\bm{x}_{t},\bm{x}_0)||p_\theta(\bm{x}_{t-1}|\bm{x}_{t})) \right],\) where the $q$ term is the true posterior if we have the ground truth data $\bm{x_0}$. It can be proven that the true posterior is Gaussian, which means we can use Gaussian as the estimated distribution $p_\theta$ (reverse of a Gaussian Markov chain is also Gaussian). The neural network should learn to predict the mean and variance of this Gaussian distribution. To further simplify the training, we only learn the mean, i.e., $p_\theta(\bm{x_{t-1}}|\bm{x_{t}})=\mathcal{N}(\mu_\theta,\sigma_t)$, which gives the final loss function, \(\mathcal{L}(\theta) = \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \sum_{t>1} \frac{1}{2\sigma_t^2} \| \tilde{\bm{\mu}}_{t} - \mu_\theta(\bm{x}_t, t) \|^2 \right],\) where $\tilde{\bm{\mu}}$ is the ground truth mean from the true posterior. The close form for the true mean is ($\alpha=1-\beta$) \(\tilde{\bm{\mu_{t}}} = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_{t}} \bm{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}} \bm{x}_t.\) With the reparameterization trick, \(\bm{x}_t = \sqrt{\alpha_t}\bm{x}_0 + \sqrt{1-\alpha_t} \bm{\epsilon}, \quad \bm{\epsilon} \sim \mathcal{N}(0, \bm{I}),\) we get \(\tilde{\bm{\mu}}_{t} = \frac{1}{\sqrt{\alpha_t}} \left( \bm{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \bm{\epsilon} \right).\) Similarly, we can write $\mu_\theta$ the same way, which gives \(\mathcal{L}(\theta) = \mathbb{E}_{q(\bm{x}_{1:T}|\bm{x}_{0})} \left[ \sum_{t>1} \frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)} \| \bm{\epsilon} - \bm{\epsilon}_\theta(\bm{x}_t, t) \|^2 \right]\)

Example code for training:

import torch
import deepinv
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.0,), (1.0,))
])
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=True, download=True, transform=transform),
    batch_size=64, shuffle=True
)

model = deepinv.models.DiffUNet(
    in_channels=1, out_channels=1, pretrained=None
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
mse = deepinv.losses.MSE()

beta_start = 1e-4
beta_end = 0.02
beta_steps = 1000
betas = torch.linspace(beta_start, beta_end, beta_steps)
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1 - alphas_cumprod)

for epoch in range(100):
    model.train()
    for data, _ in train_loader:
        imgs = data
        noise = torch.randn_like(imgs)
        t = torch.randint(0, beta_steps, (imgs.size(0),), device=imgs.device)

        imgs_noisy = sqrt_alphas_cumprod[t, None, None, None] * imgs + sqrt_one_minus_alphas_cumprod[t, None, None, None] * noise

        optimizer.zero_grad()
        loss = mse(model(imgs_noisy, t, type_t='timestep'), noise)
        loss.backward()
        optimizer.step()

torch.save(model.state_dict(), 'model.pth')

Example code for inference

import torch
import deepinv

model = deepinv.models.DiffUNet(
    in_channels=1, out_channels=1, pretrained=None
)
model.load_state_dict(torch.load('model.pth'))
model.eval()

### Must be the same
beta_start = 1e-4
beta_end = 0.02
beta_steps = 1000
betas = torch.linspace(beta_start, beta_end, beta_steps)
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1 - alphas_cumprod)

n_samples = 32
with torch.inference_mode(): # sampling
    x = torch.randn(n_samples, 1, 32, 32)
    for t in reversed(range(beta_steps)):
        t = torch.ones(n_samples, dtype=torch.long) * t
        pred_noise = model(x, t, type_t='timestep')

        beta = betas[t]
        alpha = alphas[t]
        alpha_cumprod = alphas_cumprod[t]

        if t > 0:
            noise = torch.randn_like(x)
        else:
            noise = torch.zeros_like(x)

        x = (1/torch.sqrt(alpha)) * (x - (beta / torch.sqrt(1 - alpha_cumprod)) * pred_noise) + torch.sqrt(beta) * noise

DDIM

In DDPM, the density function of the reverse process $p_\theta(\bm{x_{t-1}}|\bm{x_{t}})$ is defined explicitly. In DDIM, there is a mapping from $\bm{x_t}$ to $\bm{x_{t-1}}$ that implicitly defines the distribution of the reverse process. \(\bm{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \hat{\bm{x}}_0 + \sqrt{1 - \bar{\alpha}_{t-1}} \bm{\epsilon}_\theta(\bm{x}_t,t),\) where $\hat{\bm{x}_0}$ is the predicted clean image from the model. With this technique, the sampling is faster (by skipping steps) and the reverse process can be deterministic.

Flow Matching

Instead of adding noise to images, flow matching models the forward process as a continuous linear interpolation between the clean image and the noise. Given $t\in[0,1]$, \(\bm{x}_t = (1-t) \bm{x}_0 + t \bm{\epsilon}.\) If the noise is Gaussian, this corresponds to the forward process in a diffusion model. The velocity field along the path is \(\bm{v}(\bm{x}_t,t) = \frac{d\bm{x}_t}{dt} = \bm{\epsilon} - \bm{x}_0.\) We can train a neural network to predict the velocity field $\bm{v}_\theta(\bm{x}_t,t)$, and use it to guide the reverse process. For $s<t$, \(\bm{x}_{s} = \bm{x}_{t} + \int_{s}^{t} \bm{v}_\theta(\bm{x}_{u},u) \, du.\)