Source
@misc{Black_2025_pi05,
title={π0.5: A Vision-Language-Action Model with Open-World Generalization},
author={Physical Intelligence and Kevin Black and Noah Brown and James Darpinian and Karan Dhabalia and Danny Driess and Adnan Esmail and Michael Equi and Chelsea Finn and Niccolo Fusai and Manuel Y. Galliker and Dibya Ghosh and Lachy Groom and Karol Hausman and Brian Ichter and Szymon Jakubczak and Tim Jones and Liyiming Ke and Devin LeBlanc and Sergey Levine and Adrian Li-Bell and Mohith Mothukuri and Suraj Nair and Karl Pertsch and Allen Z. Ren and Lucy Xiaoyang Shi and Laura Smith and Jost Tobias Springenberg and Kyle Stachowicz and James Tanner and Quan Vuong and Homer Walke and Anna Walling and Haohuan Wang and Lili Yu and Ury Zhilinsky},
year={2025},
eprint={2504.16054},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.16054},
}
(Physical Intelligence) | arXiv
TL;DR
…

Flash Reading
- Abstract: Based on $\pi_0$, use co-training on heterogeneour tasks to enable broad generalization. $\pi_{0.5}$ uses data from more robots, semantic prediction, web, etc. for training, and uses extra information such as object detections, semantic subtask prediciton.
- Introduction: A robot system can use both raw and processed information. THe majority of training examples (97.6% during the first training stage) are not from mobile manipulators on household tasks. The model is first pre-trained on the heterogeneous mixture of training tasks, and then finetuned on mobile manipulator with both low-level action examples and high-level semantic subtask annotations. At runtime, the model first predicts the semantic subtask, and then uses it to condition the action generation.