VLA-Adapter An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Source

@misc{Wang_2025_vlaadapter,
    title={{VLA}-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
    author={Yihao Wang and Pengxiang Ding and Lingxiao Li and Can Cui and Zirui Ge and Xinyang Tong and Wenxuan Song and Han Zhao and Wei Zhao and Pengxu Hou and Siteng Huang and Yifan Tang and Wenhui Wang and Ru Zhang and Jianyi Liu and Donglin Wang},
    year={2025},
    eprint={2509.09372},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2509.09372},
}

(Beijing University of Posts and Telecommunications, Westlake University)

arXiv

TL;DR

… General concept

Flash Reading

Abstract: VLAs require massive training costs. VLA-Adapter provides a new paradigm to reduce the reliance on large-scale VLMs and extensive pre-training. By finding what is essential for bridging perception and action spaces, a lightweight Policy module with Bridge Attention are introduced. This method achieves high performance with only 0.5B parameters and no robotic-data pre-training. This also enables training of a powerful VLA model in just 8 hours on a single consumer-grade GPU.
Related Work: PERCEPTION TO ACTION SPACE: Early work (e.g. OpenVLA) directly maps visual observations to actions discretized as tokens. Recent works use continuous action spaces, which can be categorized based on the type of perceptual features. (1) VLM feature. From final layer or middle layers (e.g., $\pi0$). (2) Additional query [1]. Rather than using raw features from VLMs, some works introduce additional learnable queries to bridge the VLMs and Policy.
Methodology: DINOv2 and CLIP are used to extract vision embeddings. Language is tokenized. The backbone is the Prismatic VLM with different models (Qwen2.5-0.5B, Llama2-7B) or OpenVLA (7B) pretrained on robotics data. This work explores two questions: (1) Which layers in VLMs provide the most effective features for the policy head? (2) Wether raw features or action queries are better? The result can be seen in the figure below. The use the feature from raw features and action queries, bridge attention is designed.

Feature extraction effect

References

[1] Fine-tuning vision-language-action models: Optimizing speed and success, 2025. arXiv.