Robotics

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Robos News Newsroom

Editorial Desk

2026-06-10 · 2 min read

Published June 10, 2026 · Category: Robotics

Overview

arXiv:2512.06628v3 Announce Type: replace Abstract: Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.

Source

Originally published at arxiv.org.

Source: https://arxiv.org/abs/2512.06628

Robos News Newsroom

Robos News reports on robotics research, components, manufacturers, field deployments, and industrial automation worldwide. Tip our newsroom: [email protected]

Email the newsroom →

Reporting standard: Product specifications, deployment counts, and performance claims are attributed to their source. Safety-critical decisions should be based on the applicable technical documentation and validation for the operating environment.

Cookie Preferences

Overview

Source

Related Articles

Related Stories

NEURA Robotics establishes NEURA Gym RWTH Aachen to train physical AI

A mini robot to simplify dental treatment

Drive As You Like: Multi-Head Diffusion with Reinforcement Learning for Personalized Driving

VoLN: Vision-Only Long-Horizon Navigation---Paradigm, Benchmark, and Method

Cookie Preferences