Visual-Language-Guided Task Planning for Horticultural Robots
arXiv:2601.11906v2 Announce Type: replace Abstract: Crop monitoring is essential for precision agriculture, but current systems lack high-level reasoning. We introduce a novel, modular framework that uses a Vision Language Model (VLM) to guide robotic task planning by actively querying heterogeneous data sources, including enriched RGB camera feeds and 2D semantic occupancy maps, interleaved with robotic action primitives. We contribute a comprehensive benchmark for short- and long-horizon crop
Overview
arXiv:2601.11906v2 Announce Type: replace Abstract: Crop monitoring is essential for precision agriculture, but current systems lack high-level reasoning. We introduce a novel, modular framework that uses a Vision Language Model (VLM) to guide robotic task planning by actively querying heterogeneous data sources, including enriched RGB camera feeds and 2D semantic occupancy maps, interleaved with robotic action primitives. We contribute a comprehensive benchmark for short- and long-horizon crop monitoring tasks in monoculture and polyculture environments. Our results show that while zero-shot VLMs perform robustly for short-horizon tasks (achieving 87% success, comparable to human experts), success drops significantly to under 10% for complex long-horizon, multi-target tasks. Despite this decline, task completion rates remain above 76% under noiseless conditions. Critically, the system degrades when relying on noisy semantic maps, demonstrating a key limitation in current VLM context grounding for sustained robotic operations. This work offers a deployable framework and critical insights into VLM capabilities and shortcomings for complex agricultural robotics.
Source
Originally published at arxiv.org.
Related Articles
Source: https://arxiv.org/abs/2601.11906