Intro Background Method Simulation Scaling Real World Insights

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Yi Chen1   Yuying Ge2   Hui Zhou2   Mingyu Ding3   Yixiao Ge2   Xihui Liu1

1The University of Hong Kong    2XPENG Robotics    3University of North Carolina at Chapel Hill

Contact: yyge13@gmail.com  &  xihuiliu@eee.hku.hk

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). These powerful foundation models excel at jointly reasoning over visual inputs and language instructions, a capability that in principle should make them ideal brains for robotic manipulation. Yet most existing end-to-end VLAs treat the VLM as little more than a multimodal encoder, directly mapping its features to low-level motor commands. This paradigm underutilizes the VLM’s potential for high-level decision making and introduces training instability that frequently degrades its rich semantic representations.

We introduce DIAL (Decoupling Intent and Action via Latent World Modeling), a framework that bridges high-level reasoning and low-level execution through a differentiable latent intent bottleneck. Rather than mapping VLM features directly to actions, we ask the VLM to predict the future visual state in its own native ViT feature space. These features carry rich semantic structure rather than raw pixel-level appearance, so the prediction naturally encodes the VLM’s intent: what should change in the scene as a result of the robot’s action. This latent intent is the sole channel connecting the VLM to a lightweight motor policy, structurally encouraging the policy to ground every action in the VLM’s foresight rather than learning shortcuts around it. Because both systems share the same frozen ViT, each can be pre-trained independently and then unified end-to-end with no representation gap to bridge.

DIAL establishes a new state of the art on the RoboCasa GR1 Tabletop benchmark, achieving superior performance with 10× fewer demonstrations than prior methods. By leveraging heterogeneous human demonstrations, it learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on our humanoid robot.

Why is this hard?

Translating a VLM’s high-level understanding into precise motor control remains one of the central challenges in embodied AI. Existing approaches fall into two paradigms, each with a fundamental limitation.

Hierarchical planners prompt VLMs to generate text or code plans for a separate low-level controller. While modular, this incurs inter-system latency and introduces a non-differentiable interface that blocks action gradients from refining the VLM’s physical understanding, making it difficult for the two systems to closely collaborate on complex tasks.

End-to-end VLAs take the opposite approach: they directly predict actions from the VLM’s representations. This is fully differentiable and allows tighter integration between perception and control, but it often reduces the VLM to a passive encoder. Even methods that add auxiliary foresight objectives suffer from a critical weakness—without a strict structural bottleneck, the policy can learn shortcuts that bypass true intent grounding.

Comparison of VLA architectures
Existing VLA paradigms and their limitations. Hierarchical planners (left) suffer from non-differentiable gaps. End-to-end VLAs (middle) often treat foresight as optional context. DIAL (right) enforces a structural bottleneck through latent intent prediction.

How does it work?

DIAL implements this idea as a dual-system architecture, as illustrated below. System-2 (top, the “Brain”) is the VLM performing latent world modeling; System-1 (bottom, the “Cerebellum”) is a lightweight inverse dynamics policy that turns the predicted intent into motor commands.

Dual-system architecture of DIAL. System-2 synthesizes latent foresight from language and visual input. System-1 fuses current observation with predicted intent for action generation via flow matching. Training switches from decoupled warmup to end-to-end optimization.
System-2

Latent World Modeling

Learnable query tokens are appended to the VLM’s LLM sequence. Their output representations are projected through an MLP head to synthesize the latent intent, trained via MSE alignment with ground-truth ViT features of the future observation.

System-1

Latent Inverse Dynamics

A self-attention module fuses current ViT features with the predicted latent intent. A DiT-based decoder then generates action chunks via flow matching, conditioned on this fused signal and the robot’s proprioceptive state.

Training

Decoupled Warmup → End-to-End

Stage 1: System-2 learns foresight via the world modeling loss; System-1 learns control conditioned on ground-truth future features. Stage 2: System-1 switches to System-2’s predicted intent, enabling action-aware gradients to flow back into the VLM.

Simulation benchmark results

We evaluate DIAL on the RoboCasa GR1 Tabletop benchmark, which spans 24 tasks across two categories: Pick & Place (18 tasks) and Articulated Tasks (6 tasks involving cabinets, drawers, and microwaves). With full training data (1,000 trajectories per task), DIAL achieves 70.2% average success rate, substantially outperforming the strongest prior method FLARE (55.0%) and advanced VLA architectures such as GR00T-N1.6 (47.6%).

Full-data benchmark results

We show some example rollouts from the simulation benchmark below. DIAL handles both object rearrangement and articulated fixture manipulation.

Perhaps more striking is DIAL’s sample efficiency. Under a strict few-shot setting with only 100 trajectories per task (10% of the full data), DIAL reaches 58.3%—surpassing FLARE’s 55.0% achieved with 10× more data. Systematic ablations reveal three essential design choices behind this efficiency:

World modeling matters. Without explicit world modeling, even a fine-tuned VLA baseline (GR00T-Qwen2.5-FT) reaches only 30.6%. Adding predictive foresight objectives lifts all variants well above this ceiling, confirming that grounding the VLM in future physical states is a prerequisite for strong performance.

The bottleneck is critical. Simply adding world modeling is not enough if the coupling is loose. Variants that concatenate predicted future tokens as auxiliary context (SEER-style, 49.6%) or use foresight only as a training-time regularizer (FLARE-style, 51.9%) all plateau below 52%. DIAL’s structural bottleneck, where System-1 must derive actions by bridging current and predicted future states, reaches 58.3%.

Shared latent space is key. Replacing the VLM’s native ViT features with DINO-v2 as the prediction target drops performance from 58.3% to 47.2%. Even though DINO features carry strong geometric priors, the cross-manifold misalignment between reasoning and control undermines intent transfer. Both systems must operate in the same feature space.

Few-shot results and ablation study
Few-shot performance and ablation results. DIAL with 10% data outperforms FLARE with full data. Each ablation isolates a critical design choice.

Scaling with human demonstrations

DIAL’s decoupled design naturally accommodates heterogeneous data sources, including human demonstrations from a completely different embodiment. To evaluate how well these priors transfer, we construct three categories of OOD tasks by varying the simulation assets: unseen object types, unseen source-target container combinations, and unseen visual appearances. By incorporating cross-embodiment human demonstrations from the EgoDex basic_pick_place subset, DIAL boosts the average OOD success rate from 46.2% to 51.2%, with consistent gains across all three categories. In-distribution Pick & Place also improves from 56.0% to 60.8%.

Since the EgoDex subset contains only rearrangement interactions, articulated tasks see no improvement, but this domain-specific limitation further confirms that the gains stem from genuine cross-embodiment transfer rather than a general regularization effect.

Impact of human demonstrations
Impact of EgoDex human demonstrations. Cross-embodiment data improves in-distribution Pick & Place and all three OOD generalization categories.

Real-world generalization

We validate DIAL on the IRON-R01-1.11 humanoid robot with two tasks designed to match representative EgoDex subsets for cross-embodiment learning: Pick & Place (analogous to EgoDex basic_pick_place) and Pouring (analogous to EgoDex pour). All models are first pre-trained a mixture of human and robot demonstrations, then fine-tuned on the task-specific robot trajectories. Under this protocol, DIAL achieves 77.5% in-distribution success and 58.3% out-of-distribution success.

Real-world task design and data sources
Real-world task design. Human EgoDex demonstrations (left) provide cross-embodiment priors for Pick & Place and Pouring tasks executed by the IRON-R01-1.11 humanoid (right).

Two factors prove critical for real-world performance. First, decoupled warmup is essential for ensuring training stability, playing a pivotal role in maintaining robust in-distribution and OOD performance on physical hardware. Second, human data pre-training is equally vital; without it, the OOD success rate drops significantly from 58.3% to just 26.7%. Together, these results confirm that DIAL’s training recipe is not merely a simulation convenience but a genuine requirement for robust real-world deployment.

In-Distribution Performance

In-distribution results

Out-of-Distribution Performance

We test three types of generalization: combinatorial (multiple familiar objects present simultaneously, requiring language-grounded disambiguation), distractor robustness (unseen objects introduced as visual clutter), and instance-level transfer (novel object instances with unseen geometries or appearances in the pouring task). DIAL handles all three with no task-specific adaptation.

OOD results

What does DIAL actually learn?

To understand what System-2 is doing internally, we visualize its latent representations by mapping the first three PCA components to RGB channels. The figure below shows four quantities: the current observation’s latent features, the ground-truth future features, DIAL’s predicted foresight, and a cosine distance heatmap highlighting where the model anticipates the most change.

The predicted foresight closely mirrors the ground-truth future in task-relevant regions while diverging from the current observation precisely where manipulation is expected. The cosine distance map (rightmost column, warmer colors = more anticipated change) confirms that System-2 is not merely reconstructing the current scene—it is actively anticipating meaningful state transitions, generating a coherent “visual roadmap” for System-1 to follow.

Latent foresight visualization
PCA visualization of latent features. Predicted foresight aligns with the ground-truth future in task-relevant regions. The cosine distance heatmap (right) reveals where the model anticipates the greatest change.

Where do we go from here?

We demonstrated that decoupling a VLM’s cognitive intent from a policy’s physical actions via latent world modeling yields a highly data-efficient and generalizable VLA. DIAL achieves state-of-the-art results in simulation and seamlessly transfers to real-world humanoid manipulation. Its two-stage training paradigm allows action-aware gradients to refine the VLM in a controlled manner, avoiding the representation collapse that commonly plagues naive end-to-end VLA training.

Several directions can extend this framework. On the architecture side, scaling up System-1’s capacity, compressing latent tokens for efficiency, and fine-tuning the ViT encoder end-to-end are natural next steps. On the data side, DIAL’s predictive objective is well-suited to leverage massive, action-free human videos, since latent world modeling requires only visual observations and no action labels.

Looking ahead, DIAL’s decoupled design opens a modular path for robotic intelligence. Because System-1 and System-2 communicate solely through a shared latent interface, motor experts trained for a specific robot can be paired with increasingly capable VLMs as they become available, without retraining the entire pipeline. More broadly, integrating latent world modeling into VLM pre-training itself could produce foundation models with a native understanding of physical dynamics, further closing the gap between semantic reasoning and embodied control.