Source
@misc{ji_2025_robobrain,
title={RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete},
author={Yuheng Ji and Huajie Tan and Jiayu Shi and Xiaoshuai Hao and Yuan Zhang and Hengyuan Zhang and Pengwei Wang and Mengdi Zhao and Yao Mu and Pengju An and Xinda Xue and Qinghang Su and Huaihai Lyu and Xiaolong Zheng and Jiaming Liu and Zhongyuan Wang and Shanghang Zhang},
year={2025},
eprint={2502.21257},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2502.21257},
}
(Peking University, Chinese Academy of Sciences) | arXiv
TL;DR
…

Flash Reading
- Abstract: The use of LLMs in robotics is limited due to the lack of (i) Planning capabilities (complex tasks to subtasks), (ii) Affordance perception (identify interactive objects), and (iii) Trajectory prediction (foresee the complete trajectory). This work introduces a dataset, ShareRobot, and a model, RoboBrain, to address these issues.
- Introduction: The use of LLMs in robotics for long-horizon tasks is limited. The model should decompose complex tasks into subtasks, identify interactive objects, and foresee the complete trajectory. RoboBrain is based on LLaVA [1]. The robotic and general training data are combined with long videos and high-res images. It has historical frame memory and high-definition image input. This is tested on multiple robotic benchmarks, including RoboVQA [2] and OpenEQA [3].
- ShareRobot Dataset: It labels the affordance for specific tasks, marks the whole trajectory for the task, and provides subtasks for long-horizon tasks. Unlike the original Open X-Embodiment dataset, each data point in ShareRobot has low-level planning instructions for individual frames. It has 51,403 instances, 1,027,990 question-answer pairs, 102 scenes, 12 embodiments, and 107 types of atomic tasks (such as “move”, “reach”, etc.). This work believes that a high-quality dataset is more important than a large-scale dataset, for example, high-res images, accurate annotations, successful scenes, long videos, etc. The labeling is done through Gemini and manual correction, and 5 templates for each of 10 question types in RoboVQA are designed. The affordance is labeled by bounding boxes. Trajectories are labeled by keyframes with positions of the end-effector.
-
RoboBrain Model: The training has two phases. Phase 1 focuses on general OneVision training. Phase 2 focuses on robotic training. RoboBrain consists of a foundation model (LLaVA=ViT+Projector+LLM) for planning, an A-LoRA model for affordance perception, and a T-LoRA model for trajectory prediction. The LLaVA model uses SigLIP as the VLM, a 2-layer MLP as the projector, and Qwen2.5-7B-Instruct as the LLM. The training process is detailed below.

- Experiment: Zero3 distributed training is used with 8 A800 GPUs.
References
- [1] Visual Instruction Tuning, NeurIPS 2024. arXiv.
- [2] Robovqa: Multimodal long-horizon reasoning for robotics. ICRA 2024. arXiv.
- [3] OpenEQA: Embodied question answering in the era of foundation models. CVPR 2024. IEEE.
Extension
- [Ref] LoRA: Low-Rank Adaptation of Large Language Models arXiv
The starting point is to, instead of updating the full weight matrix $W$, we freeze $W$ and inject trainable rank decomposition matrices $A$ and $B$ such that the weight update $\Delta W = BA$ (the rank of $A$ and $B$ can be chosen). The weight update is then added to the original weight matrix: $W’ = W + \Delta W$. This approach reduces the number of trainable parameters and allows for efficient fine-tuning of large models. The theory is that the update matrix $\Delta W$ lies in a low-rank subspace, which is often sufficient for adapting large models to new tasks.
Another hyperparameter is the scaling factor $\alpha$. The weight update is scaled as $W’ = W + \alpha \Delta W$. This helps to control the magnitude of the updates and can improve training stability. Dropout can also be applied to the intermediate activations to prevent overfitting.