Robotics

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

Robos News Newsroom

Editorial Desk

2026-06-16 · 2 min read

Published June 16, 2026 · Category: Robotics

Overview

arXiv:2601.04061v2 Announce Type: replace Abstract: Generalist Vision-Language-Action models remain constrained by the scarcity of robotic data relative to the abundance of human video demonstrations. Existing Latent Action Models attempt to use video data but often suffer from visual entanglement, encoding noise rather than manipulation skills. To address this limitation, we propose Contrastive Latent Action Pretraining (CLAP), a framework that first uses Act-VAE to learn an executable action-token vocabulary from robot trajectories and then aligns human visual transitions with this vocabulary through contrastive learning. This alignment maps unlabeled human videos into a physically grounded latent action space rather than reconstructing appearance. Building on the aligned tokens, we train CLAP-NTP as an autoregressive VLA using robot demonstrations and pseudo-labeled human videos, preserving instruction following and object generalization. For deployment and target-domain adaptation, we further introduce a post-training strategy that combines CLAP-RF, a Rectified Flow action head for low-latency continuous action chunk prediction, with Knowledge Matching regularization to preserve pretrained semantic knowledge during fine-tuning. Extensive experiments show that CLAP achieves strong performance against competitive baselines while enabling effective skill transfer from human videos to robotic execution.

Source

Originally published at arxiv.org.

Source: https://arxiv.org/abs/2601.04061

Robos News Newsroom

Robos News reports on robotics research, components, manufacturers, field deployments, and industrial automation worldwide. Tip our newsroom: [email protected]

Email the newsroom →

Reporting standard: Product specifications, deployment counts, and performance claims are attributed to their source. Safety-critical decisions should be based on the applicable technical documentation and validation for the operating environment.

Cookie Preferences

Overview

Source

Related Articles

Related Stories

Researchers develop modular nanorobot

QQWorld: Quantile-Quantile Matching for World Model Regularization

RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

Cookie Preferences