Robotics

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

Robos News Newsroom

Editorial Desk

2026-06-12 · 2 min read

Published June 12, 2026 · Category: Robotics

Overview

arXiv:2606.13515v1 Announce Type: cross Abstract: World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

Source

Originally published at arxiv.org.

Source: https://arxiv.org/abs/2606.13515

Robos News Newsroom

Robos News reports on robotics research, components, manufacturers, field deployments, and industrial automation worldwide. Tip our newsroom: [email protected]

Email the newsroom →

Reporting standard: Product specifications, deployment counts, and performance claims are attributed to their source. Safety-critical decisions should be based on the applicable technical documentation and validation for the operating environment.

Cookie Preferences

Overview

Source

Related Articles

Related Stories

Soft robotic heart offers new way to study disease and test life-saving devices

Sling2Sim2Real: One-Shot Elastic System Identification for Non-Destructive Slingshot Policy Learning

Continual-RL for Generalization in Autonomous Racing on the RoboRacer Platform

A Case Study on the Acceptance of a Humanoid Robotic Head Employed in Three Public Spaces

Cookie Preferences