Robotics

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Robos News Newsroom

Editorial Desk

2026-06-10 · 2 min read

Published June 10, 2026 · Category: Robotics

Overview

arXiv:2508.13446v2 Announce Type: replace Abstract: Generalist robots should be able to understand and follow user instructions. Despite providing a powerful architecture for mapping open-vocabulary language instructions to robot actions, current vision-language-action (VLA) models struggle to follow fine-grained commands. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision-language models to create counterfactual labels. By augmenting existing datasets with these labels, we increase the diversity and granularity of language grounding for robot datasets, ultimately improving the language-following capabilities of VLAs. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting vision-language navigation experiments in 3 different indoor and outdoor environments. Our experiments show that counterfactual relabeling (without additional data collection) significantly improves instruction-following in VLA policies, outperforming state-of-the-art methods and doubling the success rate compared to VLAs trained on unaugmented data. We also evaluate our method for manipulation VLAs and find a similar gain in performance on tasks with distractors.

Source

Originally published at arxiv.org.

Source: https://arxiv.org/abs/2508.13446

Robos News Newsroom

Robos News reports on robotics research, components, manufacturers, field deployments, and industrial automation worldwide. Tip our newsroom: [email protected]

Email the newsroom →

Reporting standard: Product specifications, deployment counts, and performance claims are attributed to their source. Safety-critical decisions should be based on the applicable technical documentation and validation for the operating environment.

Cookie Preferences

Overview

Source

Related Articles

Related Stories

NEURA Robotics establishes NEURA Gym RWTH Aachen to train physical AI

A mini robot to simplify dental treatment

Drive As You Like: Multi-Head Diffusion with Reinforcement Learning for Personalized Driving

VoLN: Vision-Only Long-Horizon Navigation---Paradigm, Benchmark, and Method

Cookie Preferences