HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

📰 ArXiv cs.AI

arXiv:2604.14125v1 Announce Type: cross Abstract: While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In

Published 16 Apr 2026
Read full paper → ← Back to Reads