Robotics

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

Robos News Newsroom

Editorial Desk

2026-06-19 · 2 min read

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

Published June 19, 2026 · Category: Robotics

Overview

arXiv:2606.20246v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.

Source

Originally published at arxiv.org.

Source: https://arxiv.org/abs/2606.20246

Robos News Newsroom

Robos News covers markets, crypto and commodities for Asia & the Middle East — tier-1 desk research, AI-driven analysis, institutional-grade data. Tip our newsroom: [email protected]

Email the newsroom →

Disclaimer: This article is for informational purposes only and does not constitute investment advice. Data may be delayed up to 15 minutes. Past performance is not indicative of future results. Consult a licensed financial advisor before making investment decisions.

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

Overview

Source

Related Articles

Related Stories

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

Overview

Source

Related Articles

Related Stories

Playful Agentic Robot Learning

3D Scene Graphs: Open Challenges and Future Directions

Temporal Self-Imitation Learning

Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

Cookie Preferences