Multimodal Fusion with Vision-Language-Action Models for Robotic Manipulation: A Systematic Review

Muhayy Ud Dina, Waseem Akrama, Lyes Saad Saouda, Jan Rosellb, Irfan Hussain*a

aKhalifa University Center for Autonomous Robotic Systems (KUCARS), Khalifa University, United Arab Emirates

bInstitute of Industrial and Control Engineering (IOC), Universitat Politecnica de Catalunya, Spain

Accepted in Information Fusion Journal

A comprehensive resource page for Vision-Language-Action models, datasets, and evaluation tools in robotic manipulation research. This page accompanies the paper "Multimodal Fusion with Vision-Language-Action Models for Robotic Manipulation: A Systematic Review" published in Information Fusion Journal, and provides a living catalog of research resources. We aim to keep this collection up to date as new VLA models, datasets, and simulation tools emerge. Contributions and pull requests to our GitHub repository adding recently published work or tooling are most welcome!

Abstract

Vision Language Action (VLA) models represent a new frontier in robotics by unifying perception, reasoning, and control within a single multimodal learning framework. By jointly leveraging visual, linguistic, and motor modalities, they enable instruction-driven manipulation, cross-embodiment generalization, and scalable auton- omy. This systematic review synthesizes the state of the art in VLA research with an emphasis on architectures, algorithms, and applications relevant to robotic manipulation. We examine 102 models, 26 foundational datasets, and 12 simulation platforms, categorizing them according to their fusion strategies and integration mechanisms. Foundational datasets are evaluated using a novel criterion based on task complexity, modality richness, and dataset scale, allowing a comparative analysis of their suitability for generalist policy learning. We further introduce a structured taxonomy of fusion hierarchies and encoderdecoder families, together with a two-dimensional dataset characterization framework and a meta-analytic benchmarking protocol that quanti- tatively link design variables to empirical performance across benchmarks. Our analysis shows that hierarchical and late fusion architectures yield the highest manipulation success and generalization, confirming the benefit of multi-level cross-modal integration. Diffusion-based decoders demonstrate superior cross-domain transfer and robustness compared to autoregressive heads. Dataset analysis highlights a persistent lack of benchmarks that combine high-complexity, multimodal, and long-horizon tasks, while existing simulators offer limited multimodal synchronization and real-to-sim consistency. To address these gaps, we propose the VLA Fusion Evaluation Benchmark to quantify fusion efficiency and alignment. Drawing on both academic and industrial advances, the review outlines future research directions in adaptive and modular fusion architectures, compu- tational resource optimization, and the deployment of interpretable, resource-efficient robotic systems. This work provides both a conceptual foundation and a quantitative roadmap for advancing embodied intelligence through multimodal information fusion across robotic domains.

Citation

@article{UDDIN2026104062,
title = {Multimodal fusion with vision-language-action models for robotic manipulation: A systematic review},
author = {Muhayy {Ud Din} and Waseem Akram and Lyes {Saad Saoud} and Jan Rosell and Irfan Hussain},
journal = {Information Fusion},
volume = {129},
year = {2026},
issn = {1566-2535},
doi = {https://doi.org/10.1016/j.inffus.2025.104062},
}
VLA Applications Overview

Vision-Language-Action models have found diverse applications across robotics domains, from manipulation and navigation to human-robot interaction and autonomous systems. The following figure illustrates the broad spectrum of VLA applications in real-world scenarios:

Dataset Benchmarking Code

Benchmarking VLA Datasets by Task Complexity and Modality Richness. Each bubble represents a VLA dataset, positioned according to its normalized task-complexity score (x-axis) and its modality-richness score (y-axis). The bubble area is proportional to the dataset scale that is number of annotated episodes or interactions.

Dataset Benchmarking Visualization

View Code

VLA Models Evaluation & Visualization

This repository includes a comprehensive analysis and visualization suite for evaluating Vision-Language-Action (VLA) models. The analysis covers multiple aspects of VLA model performance, architecture components, and theoretical foundations through detailed visualizations and statistical analysis.

View Code

Representative Visualizations

Forest Plot Analysis

Forest Plot Analysis

Regression analysis showing that diffusion-based decoders and hierarchical fusion strategies provide the strongest positive impact on manipulation success, while symbolic/MLP controllers show degraded performance under real-world conditions.

Encoder Analysis

Encoder Analysis

Performance comparison across vision and language encoders. SigLIP and DINO vision encoders achieve highest success, while mid-scale instruction-tuned language models (T5, LLaMA, Qwen) provide optimal balance between task success and generalization.

Domain Analysis

Domain Component Analysis

Cross-domain performance analysis across humanoid, manipulation, and navigation tasks. Diffusion decoders consistently achieve higher success and generalization, demonstrating superior robustness for temporally coherent, cross-modal control.

VLA-FEB Score Distribution

VLA-FEB Score Distribution

Composite scores from the VLA-FEB framework evaluating fusion efficiency, generalization, real-to-sim transfer, and cross-modal alignment. Hierarchical and diffusion-based models achieve highest performance across all evaluation dimensions.

Scale Analysis

Scale Analysis

Analysis of model scale versus fusion depth impact on success rates. Results show that deeper fusion and hierarchical architectures consistently outperform pure parameter scaling, indicating that architectural design matters more than model size.

VLA Fusion Theory

VLA Fusion Theory

Quantitative visualization of fusion dynamics: (a) Entropy reduction showing progressive uncertainty reduction, (b) Cross-modal attention efficiency across fusion types, (c) Fusion energy correlation with task success, demonstrating hierarchical fusion superiority.

Quick Start

1. Create Virtual Environment (Recommended)

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

2. Install Dependencies

pip install -r requirements.txt

3. Run Analysis

python final_plots.py

VLA Models

VLA Models Trend

The top row presents major VLA models introduced each year, alongside their associated institutions. The bottom row displays key datasets used to train and evaluate VLA models, grouped by release year. The figure highlights the increasing scale and diversity of datasets and institutional involvement, with contributions from academic (e.g., CMU, CNRS, UC, Peking Uni) and industrial labs (e.g., Google, NVIDIA, Microsoft). This timeline highlights the rapid advancements in VLA research.

Below is the list of the VLAs reviewed in the paper:

2022Cliport: What and where pathways for robotic manipulation 2022Rt-1: Robotics transformer for real‑world control at scale 2022A Generalist Agent 2022VIMA: General Robot Manipulation with Multimodal Prompts 2022PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation 2022Do As I Can, Not As I Say: Grounding Language in Robotic Affordances 2023RoboAgent: Generalist Robot Agent with Semantic and Temporal Understanding 2023Robotic Task Generalization via Hindsight Trajectory Sketches 2023Learning fine‑grained bimanual manipulation with low‑cost hardware 2023Rt-2: Vision‑language‑action models transfer web knowledge to robotic control 2023Voxposer: Composable 3D value maps for robotic manipulation with language models 2024CLIP‑RT: Learning Language‑Conditioned Robotic Policies with Natural Language Supervision 2023Diffusion Policy: Visuomotor policy learning via action diffusion 2024Octo: An open‑source generalist robot policy 2024Towards testing and evaluating vision‑language manipulation: An empirical study 2024NaVILA: Legged robot vision‑language‑action model for navigation 2024RoboNurse‑VLA: Real‑time voice‑to‑action pipeline for surgical instrument handover 2024Mobility VLA: Multimodal instruction navigation with topological mapping 2024ReVLA: Domain adaptation adapters for robotic foundation models 2024Uni‑NaVid: Video‑based VLA unifying embodied navigation tasks 2024RDT‑1B: 1.2B‑parameter diffusion foundation model for manipulation 2024RoboMamba: Mamba‑based unified VLA with linear‑time inference 2024Chain‑of‑Affordance: Sequential affordance reasoning for spatial planning 2024Edge VLA: Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities 2024OpenVLA: LORA‑fine‑tuned open‑source VLA with high‑success transfer 2024CogACT: Componentized diffusion action transformer for VLA 2024ShowUI‑2B: GUI/web navigation via screenshot grounding and token selection 2024HiRT: Hierarchical planning/control separation for VLA 2024Pi‑0: General robot control flow model for open‑world tasks 2024A3VLM: Articulation‑aware affordance grounding from RGB video 2024SVLR: Modular "segment‑to‑action" pipeline using visual prompt retrieval 2024Bi‑VLA: Dual‑arm instruction‑to‑action planner for recipe demonstrations 2024QUAR‑VLA: Quadruped‑specific VLA with adaptive gait mapping 20243D‑VLA: Integrating 3D generative diffusion heads for world reconstruction 2024RoboMM: MIM‑based multimodal decoder unifying 3D perception and language 2024FLaRe: Large-Scale RL Fine-Tuning for Adaptive Robotic Policies 2024GRAPE: Preference‑Guided Policy Adaptation via Feedback 2024Diffusion Transformer Policy: Robust Multimodal Action Sampling 2024Diffusion‑VLA: Diffusion‑Based Policy for Generalizable Manipulation 2025FAST: Frequency‑space action tokenization for faster inference 2025OpenVLA‑OFT: Optimized fine‑tuning of OpenVLA with parallel decoding 2025CoVLA: Autonomous driving VLA trained on annotated scene data 2025ORION: Holistic end‑to‑end driving VLA with semantic trajectory control 2025UAV‑VLA: Zero‑shot aerial mission VLA combining satellite/UAV imagery 2025Combat VLA: Ultra‑fast tactical reasoning in 3D environments 2025HybridVLA: Ensemble decoding combining diffusion and autoregressive policies 2025NORA: Low‑overhead VLA with integrated visual reasoning and FAST decoding 2025SpatialVLA: 3D spatial encoding and adaptive action discretization 2025MoLe‑VLA: Selective layer activation for faster inference 2025JARVIS‑VLA: Open‑world instruction following in 3D games with keyboard/mouse 2025UP‑VLA: Unified understanding and prediction model for embodied agents 2025Shake‑VLA: Modular bimanual VLA for cocktail‑mixing tasks 2025MORE: Scalable mixture‑of‑experts RL for VLA models 2025DexGraspVLA: Diffusion‑based dexterous grasping framework 2025DexVLA: Cross‑embodiment diffusion expert for rapid adaptation 2025Humanoid‑VLA: Hierarchical full‑body humanoid control VLA 2025ObjectVLA: End‑to‑end open‑world object manipulation 2025Gemini Robotics: Bringing AI into the Physical World 2025ECoT: Robotic Control via Embodied Chain‑of‑Thought Reasoning 2025OTTER: A Vision‑Language‑Action Model with Text‑Aware Visual Feature Extraction 2025π‑0.5: A VLA Model with Open‑World Generalization 2025OneTwoVLA: A Unified Model with Adaptive Reasoning 2025Helix: A Vision-Language-Action Model for Generalist Humanoid Control 2025SmolVLA: A Vision‑Language‑Action Model for Affordable and Efficient Robotics 2025EF‑VLA: Vision‑Language‑Action Early Fusion with Causal Transformers 2025PD‑VLA: Accelerating vision‑language‑action inference via parallel decoding 2025LeVERB: Humanoid Whole‑Body Control via Latent Verb Generation 2025TLA: Tactile‑Language‑Action Model for High‑Precision Contact Tasks 2025Interleave‑VLA: Enhancing VLM‑LLM interleaved instruction processing 2025iRe‑VLA: Iterative reinforcement and supervised fine‑tuning for robust VLA 2025TraceVLA: Visual trace prompting for spatio‑temporal manipulation cues 2025OpenDrive VLA: End‑to‑End Driving with Semantic Scene Alignment 2025V‑JEPA 2: Dual‑Stream Video JEPA for Predictive Robotic Planning 2025Knowledge Insulating VLA: Insulation Layers for Modular VLA Training 2025GR00T N1: Diffusion Foundation Model for Humanoid Control 2025AgiBot World Colosseo: Unified Embodied Dataset Platform 2025Hi Robot: Hierarchical Planning and Control for Complex Environments 2025EnerVerse: World‑Model LLM for Long‑Horizon Manipulation 2025Beyond Sight: Sensor Fusion via Language-Grounded Attention 2025GeoManip: Geometric Constraint Encoding for Robust Manipulation 2025Universal Actions: Standardizing Action Dictionaries for Transfer 2025RoboHorizon: Multi-View Environment Modeling with LLM Planning 2025SAM2Act: Segmentation‑Augmented Memory for Object‑Centric Manipulation 2025VLA‑Cache: Token Caching for Efficient VLA Inference 2025Forethought VLA: Latent Alignment for Foresight‑Driven Policies 2025HAMSTER: Hierarchical Skill Decomposition for Multi‑Step Manipulation 2025TempoRep VLA: Successor Representation for Temporal Planning 2025ConRFT: Consistency Regularized Fine‑Tuning with Reinforcement 2025RoboBERT: Unified Multimodal Transformer for Manipulation 2025GEVRM: Generative Video Modeling for Goal‑Oriented Planning 2025SoFar: Successor‑Feature Orientation Representations 2025ARM4R: Auto‑Regressive 4D Transition Modeling for Trajectories 2025Magma: Foundation Multimodal Agent Model for Control 2025An Atomic Skill Library: Modular Skill Composition for Robotics 2025RoboBrain: Knowledge‑Grounded Policy Brain for Multimodal Tasks 2025SafeVLA: Safety‑Aware Vision‑Language‑Action Policies 2025CognitiveDrone: Embodied Reasoning VLA for UAV Planning 2025VLAS: Voice‑Driven Vision‑Language‑Action Control 2025ChatVLA: Conversational VLA for Interactive Control 2025RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Datasets

Comprehensive collection of datasets used for training and evaluating VLA models:

2018EmbodiedQA: Embodied Question Answering 2018R2R: Vision‑and‑Language Navigation: Interpreting Visually‑Grounded Navigation Instructions in Real Environments 2019Vision‑and‑Dialog Navigation 2020ALFRED 2020RLBench: The Robot Learning Benchmark & Learning Environment 2021TEACh: Task‑driven Embodied Agents that Chat 2022DialFRED: Dialogue‑Enabled Agents for Embodied Instruction Following 2022Ego4D: Around the World in 3,000 Hours of Egocentric Video 2022CALVIN: A Benchmark for Language‑Conditioned Long‑Horizon Robot Manipulation Tasks 2023BridgeData V2: A Dataset for Robot Learning at Scale 2023LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning 2023Robo360: A 3D Omnispective Multi‑Modal Robotic Manipulation Dataset 2024DROID: A Large‑Scale In‑The‑Wild Robot Manipulation Dataset 2024CoVLA: Comprehensive Vision‑Language‑Action Dataset for Autonomous Driving 2024RoboMM: All‑in‑One Multimodal Large Model for Robotic Manipulation 2024All Robots in One: A New Standard and Unified Dataset for Versatile, General‑Purpose Embodied Agents 2025Open X-Embodiment: Robotic Learning Datasets and RT‑X Models 2025RoboSpatial: Teaching Spatial Understanding via Vision‑Language Models for Robotics 2025TLA: Tactile‑Language‑Action Model for Contact‑Rich Manipulation 2025Kaiwu: A Multimodal Manipulation Dataset and Framework for Robotic Perception and Interaction 2025PLAICraft: Large‑Scale Time‑Aligned Vision‑Speech‑Action Dataset for Embodied AI 2025AgiBot World Colosseo: A Large‑Scale Manipulation Dataset for Intelligent Embodied Systems 2025REASSEMBLE: A Multimodal Dataset for Contact‑Rich Robotic Assembly and Disassembly 2025RoboCerebra: A Large‑Scale Benchmark for Long‑Horizon Robotic Manipulation Evaluation 2025IRef‑VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes 2025Interleave‑VLA: Enhancing Robot Manipulation with Interleaved Image‑Text Instructions 2025RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Simulators

Widely adopted simulation platforms for generating VLA data: