Contents
RL Post-Training

ROVE

Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

1XPENG Robotics, 2Fudan University, 3The Chinese University of Hong Kong, 4Shanghai Jiao Tong University
*Equal contribution   Corresponding author
arXiv:2606.17011
Abstract

Learning from humanoid experience under imperfect interventions

ROVE teaser figure
ROVE learns from mixed-quality rollout, adaptation, and recovery trajectories on real humanoid manipulation tasks.

Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to complex whole-body kinematics and dexterous-hand control. Consequently, the collected intervention trajectories are often suboptimal, and methods that rely on human interventions as expert supervision can absorb hesitant, inefficient, or even erroneous behaviors.

To address both the systems and algorithmic challenges, we propose ROVE, a reinforcement learning framework for humanoid VLA post-training with imperfect human interventions. First, ROVE introduces a human-in-the-loop pipeline capable of collecting deployment and intervention data for humanoid manipulation. Second, it utilizes Optimistic Value Estimation (OVE) to prioritize high-value behaviors from mixed-quality trajectories. To further robustify value estimation, we incorporate cross-embodiment human experience videos to provide rich supervision for long-tailed failure and recovery modes. The resulting critic yields informative advantage signals, steering the VLA actor to focus on high-value behaviors rather than indiscriminately imitating all actions. On challenging real-world contact-rich and fine-grained humanoid manipulation tasks, ROVE outperforms experience-learning baselines and consistently improves across multiple rollout-intervention iterations.

Method

ROVE: an RL framework for humanoid VLA post-training

Human-in-the-loop collection → stage-aware labels → optimistic critic → advantage-conditioned VLA policy extraction.

ROVE framework overview
Overview of ROVE (an RL framework with Optimistic Value Estimation for humanoid VLA post-training).

Human-in-the-loop data collection. Each episode starts with an autonomous VLA rollout; when failure is imminent, a motion-capture operator takes over with whole-body and dexterous-hand teleoperation. Episodes are decomposed into rollout, adaptation, and recovery stages, and value labels place a conservative failure boundary at the end of the adaptation stage, so that hesitant takeover motion is not mislabeled as recovery.

Optimistic value estimation for heterogeneous experience. The critic learns from heterogeneous experience: autonomous rollouts, human intervention trajectories, and cross-embodiment human experience videos. It is pretrained with Monte-Carlo regression on large-scale robot and egocentric human demonstrations, then fine-tuned with OVE, which combines an H-step TD bootstrap with expectile regression to estimate an in-distribution optimistic statistic—favoring the better recoveries observed in the data without querying out-of-distribution actions. Learning a state-value function (rather than Q) lets the critic absorb human videos whose action spaces are not directly comparable.

Advantage-conditioned policy extraction. The critic assigns advantage labels to action chunks, and the VLA actor is fine-tuned with advantage conditioning to emphasize high-value behaviors instead of uniformly imitating all collected actions.

Results

Learning stronger VLA policies from experience

Turning imperfect rollouts and interventions into policy gains

Put the bread into the toaster (fine-grained manipulation).
Erase the whiteboard (contact-rich manipulation).

Experiments are conducted on two real-world humanoid manipulation tasks: Erase the whiteboard, a contact-rich task, and Put the bread into the toaster, a fine-grained task.

Policy improvement results on two real-world humanoid manipulation tasks
Policy improvement results on two real-world humanoid manipulation tasks. ROVE (Ours) outperforms SFT in the demonstration-only setting, achieves the best average performance among experience-learning methods, and consistently improves across multiple iterations of rollout and intervention data.

Baseline comparison. Trained only from teleoperated demonstrations, ROVE already improves over SFT on both tasks (left panels). We compare ROVE with experience-learning baselines, including HG-DAgger, Filtered BC, and RECAP. The figure above shows that ROVE achieves the best average success rate across tasks. A notable result is that HG-DAgger performs poorly, even below the base demonstration-only policy on one task, and its learned policy often exhibits hesitant behavior. This pattern reflects the suboptimality of directly imitating intervention data. Compared with RECAP, the remaining gap reflects the combined effect of critic quality and advantage assignment for post-adaptation intervention segments. Compared with Filtered BC, the gap suggests that RL-style policy learning provides additional gains beyond BC-style data filtering, particularly in how negative samples are incorporated during policy optimization.

Iterative improvement. We collect a comparable amount of real-world rollout and intervention data in each iteration, then update the value function and policy from the previous iteration. As shown in the figure above, ROVE consistently improves across three iterations on both tasks. This demonstrates a closed-loop improvement process: better policies collect more informative experience, and the value function provides increasingly useful advantage signals for subsequent policy updates.

Value estimation

Human experience improves value estimation
Human experience improves value estimation. Adding human experience helps the critic assign lower values to incomplete erasing states and better reflect true task progress.

Human experience. Human experience videos make the critic less over-optimistic on incomplete states. Compared with the critic trained without human experience, the resulting value curve assigns lower values to partial erasing and better follows true task progress.

OVE vs Monte-Carlo value estimation
OVE provides sharper value estimates than Monte-Carlo estimates, producing clearer negative-advantage regions during failure and recovery.

Optimistic value estimation. OVE combines optimistic temporal-difference learning with expectile regression, producing sharper value estimates than Monte-Carlo regression on mixed success and failure data. It assigns lower values during failure, recovers value as the robot re-erases the board, and yields clearer negative-advantage regions for policy extraction.

Recovery from Failure

Retry and correction at deployment

Put bread into toaster: the slice misses the slot; the policy backs off and retries until insertion succeeds.
Erase whiteboard: after incomplete erasing, the policy re-engages and clears the remaining marks.

At deployment, ROVE policies exhibit explicit failure-recovery behavior: retrying after a misaligned insertion or returning to missed regions on the whiteboard. These behaviors are rarely observed in demonstration-only policies.

One More Thing...

Screw installation

An extremely challenging task demanding millimeter-level precision

One more result: ROVE completes screw installation, a high-precision humanoid assembly task.

Beyond the two core evaluation tasks, we additionally test ROVE on a screw installation scenario. The same learn-from-experience recipe extends to a high-precision assembly task, where the policy must coordinate dexterous hand motion, contact, and alignment to complete the task.

Citation

BibTeX

If you find ROVE useful, please cite our arXiv paper.

@article{xiao2026rove,
  title={ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning},
  author={Xiao, Wei and Tang, Weiliang and Ge, Yuying and Zhou, Hui and Mu, Yao and Zhang, Li and Ge, Yixiao},
  journal={arXiv preprint arXiv:2606.17011},
  year={2026}
}