GR00T N1 An Open Foundation Model for Generalist Humanoid Robots

Source

@misc{nv_2025_gr00t,
    title={GR00T N1: An Open Foundation Model for Generalist Humanoid Robots},
    author={Nvidia},
    year={2025},
    eprint={2503.14734},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2503.14734}, 
}

(Nvidia) | arXiv

TL;DR

… General pipeline

Flash Reading

Abstract: General-purpose robots need a versatile body and an intelligent mind. GR00T N1 is a VLA model with a dual-system architecture: VLM as System 2, and diffusion transformer as System 1. The whole system is trained end-to-end.
Introduction: System 2 reasoning module is a pre-trained VLM running at 10 Hz on an Nvidia L40 GPU. System 1 action module is a diffusion transformer trained with action flow-matching and running at 120 Hz. Both systems are based on transformers. The complexity in robotic system leads to data islands rather than a coherent, Internet-scale dataset. This is mitigated by using data pyramid, including read-world dataset, synthetic dataset, and web-scale dataset. A co-training strategy is used to learn the data pyramid in both pre- and post-training. For action-less data, a latent-actions codebook is learned and a trained inverse dynamics model (IDM) is used to infer pseudo-actions.
GR00T N1 Foundation Model: System 2 uses the Nvidia Eagle-2 VLM [1]. The GR00T-N1-2B model has 2.2B parameters, with 1.34B in the VLM. The inference time for sampling a chunk of 16 actions is 63.9ms on an L40 GPU using bf16 data type. Each embodiment has its own state and action MLP encoder, as well as action decoder. Eagle-2 is finetuned from a SmolLM2 LLM and a SigLIP-2 image encoder. Images have a resolution of 224x224 followed by pixel shuffle [2]. It is found that using middle-layer features from the LLM results in faster inference and higher downstream policy performance.
GR00T N1 Foundation Model (DiT module): A variant of DiT [3] is used as action modelling, which is a transformer with denoising step conditioning via adaptive layer normalization. It consists of alternating cross-attention and self-attention blocks. Given a ground-truth action chunk $A_t$, a flow-matching timstep $\tau\in[0,1]$ and sampled noise $\epsilon\sim\mathcal{N}(0,I)$, the noised action chunk $A^\tau_t$ is computed as $A^\tau_t = \tau A_t + (1-\tau) \epsilon$ The DiT model prediction $V_\theta(\phi_t, A^\tau_t, q_t)$, where $\phi_t$ is the vision-language embedding and $q_t$ is the state embedding, aims to approximate the denoising velocity field: $V_\theta(\phi_t, A^\tau_t, q_t) \approx \frac{dA^\tau_t}{d\tau} = \epsilon - A_t$ The loss function is $\mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{\tau} \left[ \| V_\theta(\phi_t, A^\tau_t, q_t) - (\epsilon - A_t) \|^2 \right]$ The schedule of $\tau$ is $p(\tau)=\text{Beta}((s-\tau)/s; 1.5, 1.5)$, where $s=0.999$ as in $\pi_0$. During inference, action is generated with K-step denoising (K=4). First, $A^0_t\sim\mathcal{N}(0,I)$ and then use forward Euler integration: $A^{\tau_{k+1}}_t = A^{\tau_k}_t + V_\theta(\phi_t, A^{\tau_k}_t, q_t) / K$
GR00T N1 Foundation Model (Training Data): For human egocentric videos and neural trajectories, no direct actions are available for training. Instead, latent actions are generated by training a VQ-VAE model to extract features [4]. The encoder has the current frame $x_t$ and future frame $x_{t+H}$ of a video with a fixed window size $H$ and outputs the latent action $z_t$. The decoder takes $x_t$ and $z_t$ to reconstruct $x_{t+H}$. After training, the encoder is used as an IDM to infer latent actions. Neural trajectories are collected from video generation models, which builds world models in the robotic domain through finetuning the model on a 88-hour in-house teleopration data. 827-hour video data is generated. For synthetic data, DexMimicGen [5] is used to generate 780,000 trajectories.
GR00T N1 Foundation Model (Training Profile): The training has two stages: pre-training and post-training. In pre-training, the model is trained via flow-matching on the data pyramid. In post-training, the model is finetuned on each single embodiment. The VLM backbone is frozen during both stages.
Pre-Training Datasets: For real-world data, an internal dataset, the open X-Embodiment dataset, and the Agibot-Alpha dataset [6] are used. The synthetic datasets contain simulated trajectories augmented from human demonstrations and neural trajectories. The human video datasets include Ego4D [7], Ego-Exo4D, etc.
Evaluation: The model is evaluated on both simulated and real-world tasks. The model is compared with two baselines: BC-Transformer [8] and Diffusion Policy [9]. The success rate over 100 trials is reported, taking the maximum score of the last 5 checkpoints.

References

[1] Eagle-2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models. arXiv.
[2] Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network, CVPR 2016. arXiv.
[3] Scalable diffusion models with transformers, ICCV 2023. arXiv.
[4] Latent Action Pretraining from Videos, ICLR 2025. arXiv.
[5] DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning, ICRA 2025. arXiv.
[6] AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems. arXiv.
[7] Ego4D: Around the World in 3,000 Hours of Egocentric Video, CVPR 2022. arXiv.
[8] What Matters in Learning from Offline Human Demonstrations for Robot manipulation, CoRL 2021. arXiv.
[9] Diffusion policy: Visuomotor policy learning via action diffusion, IJRR 2024. arXiv.