Learning Vision-Language-Action World Models for Autonomous Driving

📰 ArXiv cs.AI

arXiv:2604.09059v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have recently achieved notable progress in end-to-end autonomous driving by integrating perception, reasoning, and control within a unified multimodal framework. However, they often lack explicit modeling of temporal dynamics and global world consistency, which limits their foresight and safety. In contrast, world models can simulate plausible future scenes but generally struggle to reason about or evaluate the

Published 13 Apr 2026

Read full paper → ← Back to Reads