RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation

Abstract

In recent years, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model, RoboMM, along with the comprehensive dataset, RoboData. RoboMM enhances 3D perception through camera parameters and occupancy supervision. Building on OpenFlamingo, it incorporates Modality-Isolation-Mask and multimodal decoder blocks, improving modality fusion and fine-grained perception. RoboData offers the complete evaluation system by integrating several well-known datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, and actions, and the space alignment facilitates comprehensive learning from diverse robotic datasets. Equipped with RoboData and the unified physical space, RoboMM is the first generalist policy that enables simultaneous evaluation across all tasks within multiple datasets, rather than focusing on limited selection of data or tasks. Its design significantly enhances robotic manipulation performance, increasing the average sequence length on the CALVIN from 1.7 to 3.3 and ensuring cross-embodiment capabilities, achieving state-of-the-art results across multiple datasets.

The RoboMM Architecture

The figure above show the architecture of RoboMM. Vision Encoder Block for extracting multi-view features, Adapter Block that leverages occupancy supervision to unify features and enhance spatial perception, Feature Fusion Block based on LLMs for merging text and visual information and Multimodal Decoder Blocks that enhance fine-grained perception and understanding through multimodal outputs.

The RoboData Details

To solve the lack of proper spatial alignment across datasets, we curate well-known datasets from the industry, including CALVIN, MetaWorld, LIBERO, Robomimic, RoboCasa, ManiSkill2, RoboCAS, RLbench, and Colosseum, forming a comprehensive dataset we call RoboData. This dataset aims to provide the industry with a complete and fair evaluation system, comprising 70,000 episodes and 7 million samples. It encompasses a diverse range of tasks, including placing, picking, turning, and stackin.

Experiments

Comparisons with Other State-of-the-Art Methods

The results summarized in the figure above reveal that RoboMM exhibits exceptional performance across the evaluated datasets, achieving state-of-the-art results in particular on the LIBERO and RoboCasa datasets. Notably, in the CALVIN dataset, RoboMM attain a success rate (SR) of 91.0%, which is competitive with the performance of MDT and HULC++, significantly exceeding the performance of models in the second tier. In the MetaWorld dataset, RoboMM achieve a success rate of 78.6%, positioning it within the top tier of models. Furthermore, RoboMM consistently outperformed other models in LIBERO, RoboCasa, and Robomimic, demonstrating its robustness and adaptability across a variety of tasks and environments. These findings not only underscore RoboMM’s versatility but also highlight its effectiveness in handling tasks with varying levels of complexity, indicating its potential for real-world applications.

Ablation Experiments

The results of the ablation experiments are summarized in the table above, which demonstrates a clear trend of performance improvement as different modules are added.

Cite Our Work

@misc{yan2024robomm,
    title={RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation},
    author={Feng Yan and Fanfan Liu and Liming Zheng and Yufeng Zhong and Yiyang Huang and Zechao Guan and Chengjian Feng and Lin Ma},
    year={2024},
    eprint={2412.07215},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2412.07215}
}