Categories: Technology

OpenVLA is an open-source generalist robotics model

It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More


Foundation models have made great advances in robotics, enabling the creation of vision-language-action (VLA) models that generalize to objects, scenes, and tasks beyond their training data. However, the adoption of these models has been limited due to their closed nature and the lack of best practices for deploying and adapting them to new environments.

To address these challenges, researchers from Stanford University, UC Berkeley, Toyota Research Institute, Google Deepmind, and other labs have introduced OpenVLA, an open-source VLA model trained on a diverse collection of real-world robot demonstrations. 

According to the researchers, OpenVLA outperforms other similar models on robotics tasks. Furthermore, it can easily be fine-tuned for generalization in multi-task environments involving multiple objects. And it has been designed to take advantage of optimization techniques to run on consumer-grade GPUs and be fine-tuned at a very small cost.

With foundation models becoming a cornerstone of robotics, OpenVLA can make these models more accessible and customizable to a broader range of companies and research labs.


VB Transform 2024 Registration is Open

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now


Vision-language-action models for robotics

Classic learned policies for robotic manipulation struggle to generalize beyond their training data. They are not robust to scene distractors or unseen objects, and they struggle to execute task instructions that are slightly different from what they have been trained on.

LLMs and VLMs are capable of these types of generalization thanks to the world knowledge they capture from their internet-scale pretraining datasets. Research labs have recently started using LLMs and VLMs as one of the building blocks for training robotic policies. 

One popular technique is to use pre-trained LLMs and VLMs as components in modular systems for task planning and execution. Another direction is training vision-language-action models (VLAs) from the ground up to directly generate robot control actions. Examples of VLAs include RT-2 and RT-2-X, which have set a new standard for generalist robot policies.

However, current VLAs have two key challenges. First, they are closed and there is little visibility into their architecture, training procedures, and data mixture. And second, there is a lack of best practices for deploying and adapting VLAs to new robots, environments, and tasks.

“We argue that to develop a rich foundation for future research and development, robotics needs open-source, generalist VLAs that support effective fine-tuning and adaptation, akin to the existing ecosystem around open-source language models,” the researchers write.

OpenVLA

OpenVLA is a 7B-parameter open-source VLA built on top of the Prismatic-7B vision-language model. It consists of a two-part visual encoder that extracts features from input images and a Llama-2 7B model to process language instructions.

To create OpenVLA, the researchers fine-tuned the Prismatic model on a large dataset of 970,000 robot manipulation trajectories from the Open-X Embodiment dataset, which spans a wide range of robot embodiments, tasks, and scenes. They also configured the model to output special tokens that can be mapped to robot actions.

OpenVLA architecture (source: GitHub)

OpenVLA receives a natural language instruction such as “wipe the table” along with an input image captured with a camera. The model reasons over the instruction and the visual input and decides which sequence of action tokens will enable the robot to accomplish the desired task.

According to the researchers, OpenVLA outperforms the 55B-parameter RT-2-X model, the prior state-of-the-art VLA, on the WidowX and Google Robot embodiments.

The researchers also experimented with efficient fine-tuning strategies for VLAs on seven manipulation tasks spanning from object pick-and-place to cleaning a table. Fine-tuned OpenVLA policies outperform fine-tuned pre-trained policies. Fine-tuning OpenVLA also improves performance on instructions that require mapping language instructions to multi-task behaviors with multiple objects.

“Notably, most prior works achieve strong performance only in either narrow single-instruction or diverse multi-instruction tasks, resulting in widely varying success rates,” the researchers write. “OpenVLA is the only approach that achieves at least 50% success rate across all tested tasks, suggesting that it can be a strong default option for imitation learning tasks, particularly if they involve a diverse set of language instructions.”

The researchers also make OpenVLA more accessible and compute-efficient through optimization techniques. With low-rank adaptation (LoRA), they fine-tuned OpenVLA on a new task within 10-15 hours on a single A100 GPU, an 8x reduction in compute compared to full fine-tuning. With model quantization, they were able to reduce the size of OpenVLA models and run them on consumer-grade GPUs without a significant drop in performance.

Open-sourcing OpenVLA

The researchers have open-sourced all models, deployment and fine-tuning notebooks, and the OpenVLA codebase for training VLAs at scale, “with the hope that these resources enable future work exploring and adapting VLAs for robotics,” they write. The library supports model fine-tuning on individual GPUs and training billion-parameter VLAs on multi-node GPU clusters. It is also compatible with modern optimization and parallelization techniques.

In the future, the researchers plan to improve OpenVLA by adjusting it to support multiple image and proprioceptive inputs as well as observation history. They also suggest that using VLMs pre-trained on interleaved image and text data may facilitate such flexible-input VLA fine-tuning.

News Today

Share
Published by
News Today

Recent Posts

Kareena Kapoor’s Next Untitled Film With Meghna Gulzar Gets Prithviraj Sukumaran On Board

Kareena Kapoor is working with Raazi director Meghna Gulzar for her next film. The project,…

2 weeks ago

Purdue basketball freshman Daniel Jacobsen injured vs Northern Kentucky

2024-11-09 15:00:03 WEST LAFAYETTE -- Daniel Jacobsen's second game in Purdue basketball's starting lineup lasted…

2 weeks ago

Rashida Jones honors dad Quincy Jones with heartfelt tribute: ‘He was love’

2024-11-09 14:50:03 Rashida Jones is remembering her late father, famed music producer Quincy Jones, in…

2 weeks ago

Nosferatu Screening at Apollo Theatre Shows Student Interest in Experimental Cinema – The Oberlin Review

2024-11-09 14:40:03 A silent German expressionist film about vampires accompanied by Radiohead’s music — what…

2 weeks ago

What Are Adaptogens? Find Out How These 3 Herbs May Help You Tackle Stress Head-On

Let's face it - life can be downright stressful! With everything moving at breakneck speed,…

2 weeks ago

The new Mac Mini takes a small step towards upgradeable storage

Apple’s redesigned Mac Mini M4 has ditched the previous M2 machine’s SSD that was soldered…

2 weeks ago