OpenVLA An Open-Source Vision-Language-Action Model

Source

@misc{kim_2024_openvla,
    title={{OpenVLA}: An Open-Source Vision-Language-Action Model}, 
    author={Moo Jin Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
    year={2024},
    eprint={2406.09246},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2406.09246}, 
}
(Stanford, UC Berkeley, Toyota Research Institute) arXiv

TL;DR

Flash Reading

References

Extra:

Extension

Feature CLIP (2021) SigLIP (2023) DINOv2 (2023)
Target Learn joint image–text embeddings for zero-shot transfer Improve CLIP’s contrastive training with sigmoid loss for better scaling & efficiency Learn universal visual-only features that transfer across tasks
Supervision Type Weakly supervised (image–text pairs) Weakly supervised (image–text pairs) Self-supervised (no paired text)
Training Data 400M noisy web image–text pairs Large-scale web image–text pairs (billions, proprietary) Curated LVD-142M (142M high-quality images)
Architecture Dual encoder: ViT/ResNet + Transformer text encoder Same as CLIP ViT only
Loss Function Symmetric InfoNCE (softmax over in-batch pairs) Pairwise sigmoid loss (binary classification on matching vs non-matching pairs) Self-distillation with teacher–student ViTs
Negatives Handling In-batch negatives only (batch-size sensitive) No explicit in-batch negatives; works well with small or large batches Not applicable (no contrastive text–image matching)
Main Strengths Zero-shot classification and retrieval on many datasets Better small-batch performance, easier scaling, CLIP-like capabilities Strong image-only features; great for detection, segmentation, etc.
Limitations Needs large batches for strong performance; underperform in domain shifts Same text–image domain dependency as CLIP; requires large-scale text–image pairs No multimodal capability; zero-shot only for visual tasks
Best Use Cases Zero-shot classification, retrieval, multimodal applications Same as CLIP Pure vision tasks needing strong general-purpose image features