SmolVLA A Vision-Language-Action Model for Affordable and Efficient Robotics

Source

@misc{shukor_2025_smol,
      title={SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics}, 
      author={Mustafa Shukor and Dana Aubakirova and Francesco Capuano and Pepijn Kooijmans and Steven Palma and Adil Zouitine and Michel Aractingi and Caroline Pascal and Martino Russi and Andres Marafioti and Simon Alibert and Matthieu Cord and Thomas Wolf and Remi Cadene},
      year={2025},
      eprint={2506.01844},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
}

(Sorbonne University, Hugging Face) | arXiv

TL;DR

…

Flash Reading

Abstract: This work focuses on small (0.24B, 0.45B, 2.25B), efficient, and community-driven VLA that drastically reduces both training and inference costs. It is designed to be trained on a single GPU and deployed on consumer-grade hardware. An asynchronous inference stack decoupling perception and action prediction from action execution is proposed for high control rates with chunked action generation.
Introduction: Robotic policies face challenges in generalizing across object types, positions, environments, and tasks [1]. VLA models are designed to incorporate abstract reasoning, world knowledge, and decision-making capabilities. This work focuses on open-source, lightweight, community-driven VLA models supporting asynchronous inference (decoupling perception and action prediction from action execution).
Related Work: VLMs are typically constructed by combining a vision encoder with an LLM. VLAs need to process natural language instructions, visual observations, and proprioceptive information to generate robot actions. Early models (Octo, RT-1) trained transformer-based models from scratch, while later models (RT-2, OpenVLA, π0) leverage pre-trained VLMs.
SmolVLA: Similar to Pi0, SmolVLA has a pretrained VLM (SmolVLM-2 with SigLIP and SmolLM2) and an action expert trained with flow matching. It is pretrained with imitation learning on community-collected datasets, then evaluated in both real-world and simulated environments. Linear projection layers are used in multiple places to match dimensions. At inference time, an asynchronous execution is introduced to enable high control rates. Layer skipping [2,3] is used to reduce inference time. In practice, using half of the layers gives a good trade-off between speed and performance. The action expert is trained with a flow matching objective. Instead of using just self-attention or cross-attention, an interleaved approach is used, where each block contains either a CA or a SA layer. For SA layers, a causal mask is used to prevent attention to future tokens. The high heterogeneity of the data results in “data islands”, also the amount of data available for robotics is limited. Community datasets handle these issues. In this work, a subset of 481 community datasets from Hugging Face is used, containing 22.9K episodes and 10.6M frames. Off-the-shelf VLM (Qwen2.5-VL-3B-Instruct) is used to autogenerate concise task descriptions, as short, action-oriented sentences summarizing the behaviors (prompt available in the paper). Camera views are standardized into top, wrist, and side perspectives. Visuomotor policies output action chunks to an action queue. The robot finish the entire chunk before requesting a new one. Here an asynchronous execution is proposed, where after an action chunk of length $n$ is generated, after running $k<n$ steps (and the new observation differs), a new chunk is requested, and the new chunk is aggregated with the remaining $n-k$ actions from the previous chunk.

References

[1] Decomposing the Generalization Gap in Imitation Learning for Visual Robotic Manipulation, ICRA 2024. arXiv
[2] Skipping Computations in Multimodal LLMs, NeurIPS 2024 Workshop. arXiv
[3] An Empirical Study of Autoregressive Pre-training from Videos, 2025. arXiv