OpenVLA An Open-Source Vision-Language-Action Model

Source

@misc{kim_2024_openvla,
    title={{OpenVLA}: An Open-Source Vision-Language-Action Model}, 
    author={Moo Jin Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
    year={2024},
    eprint={2406.09246},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2406.09246}, 
}

(Stanford, UC Berkeley, Toyota Research Institute)

arXiv

TL;DR

…

Flash Reading

Abstract: Open VLA with an efficient method of fine-tuning for new tasks. OpenVLA is a 7B parameter model builds on a combination of Llama 2 with a visual encoder that fuses pretrained features from DINOv2 and SigLIP.
Introduction: Learned policies are typically not generalizable beyond their training data. But foundation VLMs, such as CLIP, SigLIP, and Llama 2 are able to generalize for their large-scale pretraining. VLA models uses the ability of VLMs to generate control actions, such as RT-2 [1]. The problem is (i) most models are closed, and (ii) no practices for deploying and adapting VLAs to new robots, environments, and tasks. Open VLA consists of a pretrained visually-conditioned language backbone, fine-tuned on a 970k robot manipulation trajectory dataset from Open-X Embodiment dataset [2]. Efficient fine-tuning strategies for VLAs are explored. Additionally, compared to from-scratch imitation learning with diffusion policies [3], fine-tuned Open VLA shows better performance on grounding language to behavior in multi-task settings with multiple objects. The model is fine-tuned via low-rank adaptation (LORA) [4] and model quantization [5].
Related Work: (1) VLMs are trained on Internet-scale data and generate natural language from input images and language prompts. One of the key factors for the development of VLMs is the bridging of pretrained vision encoders with pretrained language models. Early work explored crossing-attending between vision and language features, and newer work used “patch-as-token” approach, in which patch features from pretrained visual transformers are treated as tokens projecting into the language space. This work uses VLMs from [6] for multi-resolution visual features. (2) Generalist robot policies are from training on large diverse multi-task datasets, such as Octo [7]. Prior works typically compose pretrainied models with additional components initialized from scratch. Open VLA adopts end-to-end approach, fine-tuning VLMs to generate robot actions by treating actions like tokens in the language model vocabulary. (3) VLA models refer to using VLMs directly for robot action generation.
The OpenVLA Model:
- VLMs: Consist of three parts, visual encoder (image to patch embeddings), projector (patch embeddings to language tokens), and LLM backbone.
- Training: Fine-tune a pretrained Prismatic-7B VLM backbone. Map continuous robot actions to discrete tokens used by the LLM. For each action in the action vector, discretize it into 256 bins whose range is determined by the 1st and 99th quantile of the actions in the training data (instead of min-max value for better outlier resistance).
- Data curation: Curation based on the Open-X Embodiment Dataset. To ensure coherent IO space, training data is selected to contain only manipulation tasks with at least one 3rd person camera and use single-arm end-effector control. To increase diversity, data mixture weight is used as Octo [7].
- Model design: The initial experiments for making design decision were conducted on the BridgeData V2 dataset [8], regarding VLM backbone, image resolution (224x224 here), vision encoder fine-tuning (enabled here), training epochs (27 epochs here), etc.
Experiments: Focus on out-of-the-box performance and generalization to unseen (OOD) tasks and environments.

References

[1] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, CoRL 2023. arXiv.
[2] Open X-Embodiment: Robotic Learning Datasets and RT-X Models, ICRA 2024. arXiv.
[3] Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, RSS 2023. arXiv.
[4] LoRA: Low-Rank Adaptation of Large Language Models, ICLR 2022. arXiv.
[5] QLoRA: Efficient Finetuning of Quantized LLMs, NIPS 2023. arXiv.
[6] Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models, ICML 2024. arXiv.
[7] Octo: An Open-Source Generalist Robot Policy, RSS 2024. arXiv.
[8] BridgeData V2: A Dataset for Robot Learning at Scale, CoRL 2023. arXiv.

Extra:

Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction. arXiv.

Extension

CLIP, from Open AI, is a model that jointly learns image and text embeddings in a shared latent space, enabling zero-shot image classification, retrieval, and other vision–language tasks without task-specific training.
DINOv2, from Meta AI, is a family of self-supervised Vision Transformer (ViT) backbones trained at scale to produce general-purpose visual features that transfer well to both image-level and dense tasks (classification, retrieval, segmentation, depth/normal estimation) without task-specific finetuning.
SigLIP, from Google/DeepMind, is a CLIP-style image–text model that replaces the usual softmax-normalized contrastive loss with a pairwise sigmoid loss, removing the need for global batch normalization over all negatives. This improves efficiency, works better with small batches, and scales cleanly.

Feature	CLIP (2021)	SigLIP (2023)	DINOv2 (2023)
Target	Learn joint image–text embeddings for zero-shot transfer	Improve CLIP’s contrastive training with sigmoid loss for better scaling & efficiency	Learn universal visual-only features that transfer across tasks
Supervision Type	Weakly supervised (image–text pairs)	Weakly supervised (image–text pairs)	Self-supervised (no paired text)
Training Data	400M noisy web image–text pairs	Large-scale web image–text pairs (billions, proprietary)	Curated LVD-142M (142M high-quality images)
Architecture	Dual encoder: ViT/ResNet + Transformer text encoder	Same as CLIP	ViT only
Loss Function	Symmetric InfoNCE (softmax over in-batch pairs)	Pairwise sigmoid loss (binary classification on matching vs non-matching pairs)	Self-distillation with teacher–student ViTs
Negatives Handling	In-batch negatives only (batch-size sensitive)	No explicit in-batch negatives; works well with small or large batches	Not applicable (no contrastive text–image matching)
Main Strengths	Zero-shot classification and retrieval on many datasets	Better small-batch performance, easier scaling, CLIP-like capabilities	Strong image-only features; great for detection, segmentation, etc.
Limitations	Needs large batches for strong performance; underperform in domain shifts	Same text–image domain dependency as CLIP; requires large-scale text–image pairs	No multimodal capability; zero-shot only for visual tasks
Best Use Cases	Zero-shot classification, retrieval, multimodal applications	Same as CLIP	Pure vision tasks needing strong general-purpose image features