Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Source

@misc{zhang_2023_act, 
    AUTHOR    = {Tony Z. Zhao AND Vikash Kumar AND Sergey Levine AND Chelsea Finn}, 
    TITLE     = {{Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware}}, 
    BOOKTITLE = {Proceedings of Robotics: Science and Systems}, 
    YEAR      = {2023}, 
    ADDRESS   = {Daegu, Republic of Korea}, 
    MONTH     = {July}, 
    DOI       = {10.15607/RSS.2023.XIX.016} 
} 

(Stanford, Meta)

arXiv

TL;DR

…

Architecture of Action Chunking with Transformers.

Flash Reading

Abstract: Low-cost system performs end-to-end imitation learning for bimanual manipulation through Action Chunking with Transformers (ACT).
Introduction: Fine manipulation tasks involve precise, closed-loop feedback and require high degrees of hand-eye coordination to adjust and re-plan in response to changes in the environment. To solve the problem that low-cost systems lack precision and accuracy, learning is used. An end-to-end policy is learned to map RGB camera observations to actions (pixel-to-action). The idea is to learn the manipulation skills instead of relying on modelling the physics of the system. To obtain high-quality demonstrations, a low-cost dexterous teleoperation system is designed. For ACT, specifically, the policy predicts the target joint positions for the next $k$ time steps. To improve the smoothness, temporal ensembling is used to weighted average the overlapping predictions. ACT is trained as a conditional variational autoencoder (CVAE).
Related Work: For behavior cloning, a major problem is compounding errors, where small errors accumulate over time and lead to failure. This can be mitigated via addtional on-policy interactions and corrections (DAgger) [1]. It is also possible to inject noise into demonstration to obtain data with correction behaviors [2]. In this work, it is proposed to prediction action chunks instead of single actions to reduce compounding errors.
ACTs: To generation demonstration data for a new task, joint positions of the leader arm are recorded and used as acitons. The observation includes the current joint positions of the follower arm and the RGB images from four cameras. The idea of action chunks is from [3] about actions are grouped for execution. For training, given the image-free observation $\bar{o}_t$, action $a_t$, and observation with images $o_t$,
1. Initialize the encoder $q_\phi(z|a_{t:t+k}, \bar{o}_t)$ and decoder $\pi_\theta(\hat{a}_{t:t+k}|z, o_t)$ networks.
2. Sample a chunk of actions and the corresponding observation and define the loss function (reconstruction+regularization) as
\[\begin{aligned} \mathcal{L} &= \mathcal{L}_{\text{rec}} + \beta \mathcal{L}_{\text{reg}}, \\ \mathcal{L}_{\text{rec}} &= \|a_{t:t+k} - \hat{a}_{t:t+k}\|^2_2, \\ \mathcal{L}_{\text{reg}} &= D_{\text{KL}}(q_\phi(z|a_{t:t+k}, \bar{o}_t) \| \mathcal{N}(0, I)). \end{aligned}\]
For inference, at each time step $t$, the action chunk $\hat{a}_{t:t+k}$ is generated autoregressively with $z=0$ (the mean behavior from the demenstration data). Generate the final action $a_t$ by temporal ensembling of the overlapping predictions.
ACT via cVAE: To focus on general behavior and high-precision regions, the generative model, cVAE, is used. The style variable $z$ is designed to be a diagnonal Gaussian. To make training faster, the image observation is not used in the encoder. The encoder is implemented as a BERT-like transformer. The inputs to the encoder are the current joint positions and the target action sequences of length $k$, prepended with a “[CLS]” token (used to predict the mean and variance of $z$). The decoder contains a ResNet image encoder, a transformer encoder, and a transformer decoder. To preserve the spatial information, a 2D sinusoidal position embedding is used [4].

References

[1] A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, AISTATS 2010. arXiv.
[2] DART: Noise Injection for Robust Imitation Learning, CoRL 2017. arXiv.
[3] Action chunking as policy compression, PsyArXiv, 2022. link.
[4] End-to-end object detection with transformers, ECCV 2020. arXiv.