It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More
Foundation models have made great advances in robotics, enabling the creation of vision-language-action (VLA) models that generalize to objects, scenes, and tasks beyond their training data. However, the adoption of these models has been limited due to their closed nature and the lack of best practices for deploying and adapting them to new environments.
To address these challenges, researchers from Stanford University, UC Berkeley, Toyota Research Institute, Google Deepmind, and other labs have introduced OpenVLA, an open-source VLA model trained on a diverse collection of real-world robot demonstrations.
According to the researchers, OpenVLA outperforms other similar models on robotics tasks. Furthermore, it can easily be fine-tuned for generalization in multi-task environments involving multiple objects. And it has been designed to take advantage of optimization techniques to run on consumer-grade GPUs and be fine-tuned at a very small cost.
With foundation models becoming a cornerstone of robotics, OpenVLA can make these models more accessible and customizable to a broader range of companies and research labs.
VB Transform 2024 Registration is Open
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
Vision-language-action models for robotics
Classic learned policies for robotic manipulation struggle to generalize beyond their training data. They are not robust to scene distractors or unseen objects, and they struggle to execute task instructions that are slightly different from what they have been trained on.
LLMs and VLMs are capable of these types of generalization thanks to the world knowledge they capture from their internet-scale pretraining datasets. Research labs have recently started using LLMs and VLMs as one of the building blocks for training robotic policies.
One popular technique is to use pre-trained LLMs and VLMs as components in modular systems for task planning and execution. Another direction is training vision-language-action models (VLAs) from the ground up to directly generate robot control actions. Examples of VLAs include RT-2 and RT-2-X, which have set a new standard for generalist robot policies.
However, current VLAs have two key challenges. First, they are closed and there is little visibility into their architecture, training procedures, and data mixture. And second, there is a lack of best practices for deploying and adapting VLAs to new robots, environments, and tasks.
“We argue that to develop a rich foundation for future research and development, robotics needs open-source, generalist VLAs that support effective fine-tuning and adaptation, akin to the existing ecosystem around open-source language models,” the researchers write.
OpenVLA
OpenVLA is a 7B-parameter open-source VLA built on top of the Prismatic-7B vision-language model. It consists of a two-part visual encoder that extracts features from input images and a Llama-2 7B model to process language instructions.
To create OpenVLA, the researchers fine-tuned the Prismatic model on a large dataset of 970,000 robot manipulation trajectories from the Open-X Embodiment dataset, which spans a wide range of robot embodiments, tasks, and scenes. They also configured the model to output special tokens that can be mapped to robot actions.
OpenVLA receives a natural language instruction such as “wipe the table” along with an input image captured with a camera. The model reasons over the instruction and the visual input and decides which sequence of action tokens will enable the robot to accomplish the desired task.
According to the researchers, OpenVLA outperforms the 55B-parameter RT-2-X model, the prior state-of-the-art VLA, on the WidowX and Google Robot embodiments.
The researchers also experimented with efficient fine-tuning strategies for VLAs on seven manipulation tasks spanning from object pick-and-place to cleaning a table. Fine-tuned OpenVLA policies outperform fine-tuned pre-trained policies. Fine-tuning OpenVLA also improves performance on instructions that require mapping language instructions to multi-task behaviors with multiple objects.
“Notably, most prior works achieve strong performance only in either narrow single-instruction or diverse multi-instruction tasks, resulting in widely varying success rates,” the researchers write. “OpenVLA is the only approach that achieves at least 50% success rate across all tested tasks, suggesting that it can be a strong default option for imitation learning tasks, particularly if they involve a diverse set of language instructions.”
The researchers also make OpenVLA more accessible and compute-efficient through optimization techniques. With low-rank adaptation (LoRA), they fine-tuned OpenVLA on a new task within 10-15 hours on a single A100 GPU, an 8x reduction in compute compared to full fine-tuning. With model quantization, they were able to reduce the size of OpenVLA models and run them on consumer-grade GPUs without a significant drop in performance.
Open-sourcing OpenVLA
The researchers have open-sourced all models, deployment and fine-tuning notebooks, and the OpenVLA codebase for training VLAs at scale, “with the hope that these resources enable future work exploring and adapting VLAs for robotics,” they write. The library supports model fine-tuning on individual GPUs and training billion-parameter VLAs on multi-node GPU clusters. It is also compatible with modern optimization and parallelization techniques.
In the future, the researchers plan to improve OpenVLA by adjusting it to support multiple image and proprioceptive inputs as well as observation history. They also suggest that using VLMs pre-trained on interleaved image and text data may facilitate such flexible-input VLA fine-tuning.