Source
@inproceedings{chi_2023_diffusion,
author = {Cheng Chi AND Siyuan Feng AND Yilun Du AND Zhenjia Xu AND Eric Cousineau AND Benjamin CM Burchfiel AND Shuran Song},
title = {Diffusion Policy: Visuomotor Policy Learning via Action Diffusion},
booktitle = {Proceedings of Robotics: Science and Systems (RSS)},
year = {2023},
address = {Daegu, Republic of Korea},
month = {July},
doi = {10.15607/RSS.2023.XIX.026}
}
or
@article{chi_2024_diffusion,
title={Diffusion Policy: Visuomotor Policy Learning via Action Diffusion},
author={Chi, Cheng and Xu, Zhenjia and Feng, Siyuan and Cousineau, Eric and Du, Yilun and Burchfiel, Benjamin and Tedrake, Russ and Song, Shuran},
journal={The International Journal of Robotics Research (IJRR)},
volume={44},
number={10-11},
pages={1684--1704},
year={2024},
publisher={SAGE Publications}
}
(Columbia University) | arXiv
TL;DR
…

Flash Reading
-
Abstract: Diffusion Policy is a new way of generating robot actions by representing a robot’s visuomotor policy as a conditional denoising diffusion process. It learns the gradient of the action distribution score function and iteratively optimizes w.r.t. this gradient field during inference via stochastic Langevin dynamics steps. It can handle multi-modal action distributions, high-dimensional action spaces, and good training stability. Some technical contributions include the incorporation of MPC, visual conditioning, and time-series diffusion transformer.

- Introduction: Policy learning from demonstration can be formulated as the supervised regression task mapping from observations to actions. However, this approach has limitations in handling multi-modal action distributions and high-dimensional action spaces. Prior works use explicity policy or implicit policy. This work proposes Diffusion Policy, which uses a conditional denoising diffusion process [1] to model the action distribution. By learning the gradient of the action score function and performing Stochastic Langevin Dynamics sampling on this gradient field, Diffusion policy can express arbitrary normalizable distributions. Diffusion models have also shown scalability to high-dimensional data, such as images and videos. It is also stable to train.
- Diffusion Policy: Diffusion models are normally used for image generation. For action generation, two modifications are made: (1) Changing the output to represent the action space. (2) Conditioning on the current observation. To encourage temporal consistency and smoothness in long-horizon tasks, and reacting to unexpected changes, an action-sequence prediction for a fixed duration before replanning is used. Let the current time step be 0, given the observation sequence $O_0 = (o_{-T_o}, …, o_0)$, the goal is to predict the action sequence $A_0 = (a_0, a_1, …, a_{T_p-1})$, and $T_a$ steps of actions are executed without replanning. The denoising process is based on \(A^{k-1}_0 = \alpha (A^k_0 - \gamma \epsilon_\theta(A^k_0, O_0, k) + \mathcal{N}(0,\sigma^2I) )\)
- Key Design Choices: (1) Noise prediction network: CNN-based (good for low-complexity tasks and low-rate action changes) and transformer-based architectures (minGPT) are explored. (2) Visual encoder: It is trained end-to-end with the diffusion model. A ResNet-18 without pretraining is used. (3) Noise schedule: Square Cosine noise schedule. (4) Inference: Denoising Diffusion Implicit Models (DDIM) [2] is used for faster inference.
- Properties of Diffusion Policy: (1) Multi-modality. (2) Position control (better than velocity control). (3) Benefits of action sequence prediction. (4) Training stability.
References
[1] Denoising Diffusion Probabilistic Models, NeurIPS 2020. arXiv. [2] Denoising Diffusion Implicit Models, ICLR 2021. arXiv.