Cross-Embodiment Pool
Spanning human activity and heterogeneous robots with dexterous hands and grippers.
What Cross-Embodiment Data Teaches an Embodied Foundation Model
heterogeneous robots and humans
From Heterogeneous Experience to Robot Execution
In embodied AI, it is a widely held belief that cross-embodiment data—a diverse mixture encompassing both heterogeneous robot data and egocentric human videos—enhances robot models. Yet, this consensus remains an opaque black box: we know it works, but we rarely understand exactly what this broad spectrum of data is teaching the robot.
To demystify this, we introduce Fe0, an embodied foundation model. Rather than scaling data to chase benchmarks, we designed a rigorous controlled experiment. We deliberately constrained our in-domain teleoperation data to the minimum: just 3 hours of trajectories featuring only a single, atomic pick-and-place primitive, where 95% of the data involves moving a single object to a single receptacle from a highly restricted vocabulary. By establishing this extreme baseline, we guarantee that any zero-shot generalization Fe0 exhibits beyond in-domain grasping is unequivocally powered by cross-embodiment learning.
But how do we measure this transfer scientifically? Instead of ad-hoc testing, we constructed a systematic L1–L5 capability framework—spanning from basic visual robustness and semantic grounding, up to complex relational reasoning, compositional planning, and novel action execution. We evaluate the model through both rigorous open-loop trajectory scoring and extensive closed-loop real-world trials on the IRON-R01-1.11 robot to unpack the true anatomy of cross-embodiment data.
We reveal a clear capability boundary: while heterogeneous data effectively teaches the model visual semantics and basic task logic, it falters at mid-level relational and sequential reasoning, and hits a hard ceiling when transferring precise physical execution. By systematically evaluating exactly where cross-embodiment data succeeds and where it fails, we illuminate what this heterogeneous experience actually teaches an embodied foundation model, and what it takes to push beyond current limits.
The Extreme Asymmetry Between Pool and Anchor
The cross-embodiment pool is deliberately diverse, spanning human activity and heterogeneous robots with dexterous hands and grippers to cover broad scenes, tasks, and action primitives. The target-embodiment teleoperation anchor, however, is intentionally minimal. It is restricted to the IRON-R01-1.11 humanoid robot and its action vocabulary is dominated by a single primitive: pick-and-place.
The scale disparity between the two is massive. The cross-embodiment pool is over three hundred times larger in terms of episodes, nearly a thousand times larger in frame count, and over fifteen hundred times larger in total duration compared to the target anchor. This extreme asymmetry makes the transfer analysis interpretable. When a trial moves beyond the anchor’s restricted object set, receptacle set, task logic, or action primitive, it directly tests whether knowledge from the broader pool can be transferred on the humanoid body.
Spanning human activity and heterogeneous robots with dexterous hands and grippers.
Intentionally minimal homogeneous teleoperation data.
Zooming into the target-embodiment anchor makes its constraints concrete, and reveals just how narrow the in-domain objects and containers really are.
Extreme simplicity in scale and action primitives. The dataset contains merely ~3 hours of demonstrations with an average duration of just 6.1 seconds per episode. Crucially, the physical action space is strictly confined to a single, highly uniform primitive: basic pick-and-place. Every demonstration strictly follows a rigid approach → grasp → lift → transport → release kinematic sequence.
Minimal task complexity and semantic diversity. Furthermore, the task complexity and semantic diversity are heavily restricted. An overwhelming 95% of the trajectories represent the most atomic manipulation scenario: transferring a single object to a single receptacle. The remaining 5% of the data involves placing two objects onto different tiers of a single shelf, and is strictly limited to only three object categories (apple, banana, and bread). Across the entire dataset, the visual and semantic scope is tightly bounded to 20 common household items and 6 fixed target locations (such as boxes, plates, or handing over to a person) as shown below, uniformly driven by highly templated language instructions.
From What to Look at, to How to Act
To make cross-embodiment transfer measurable, Fe0 is evaluated through a Multi-Axis Decomposition Framework rather than a single aggregate metric. The five axes are not arbitrary capability buckets—each one is designed to evaluate a specific type of reusable knowledge that the heterogeneous data pool can plausibly supply, mapping cleanly from perception to action: visual robustness and semantic grounding at the input side, relational reasoning and compositional planning in the middle, and novel action prediction at the output side.
For every layer we collected a matched set of teleoperated trajectories on the IRON-R01-1.11 to serve as ground truth. The model is then evaluated open-loop: along each recorded trajectory we query the model frame by frame and score it by the MSE between its predicted actions and the corresponding teleoperated actions. Unlike a closed-loop trial on the real robot, this protocol scores every frame from the ground-truth state rather than from where the model would have ended up—the success of preceding steps is never enforced. As a result, scores on long-horizon, multi-step layers (most visibly L4 Compositional Planning) may diverge from real-robot outcomes, where errors normally compound across steps.
Does the model preserve task-relevant attention when background, appearance, object pose, or distractors change?
Can a novel object name, category, or functional description be grounded as a reachable target?
Can attributes, comparisons, and spatial predicates select the intended object or placement relation?
Can familiar atomic actions be combined into multi-object or multi-step behavior while maintaining progress state?
Can the policy execute motions whose contact pattern departs from the pick-and-place template?
Where Cross-Embodiment Data Transfers, and Where It Stalls
We build Fe0 based on UniT1 and an in-house multimodal model to reveal what heterogeneous embodied experience actually teaches an embodied foundation. We measure transfer along two complementary views, both evaluated across the L1–L5 task framework. The first is open-loop: a numeric MSE score against teleoperated trajectories. The second is on the robot: closed-loop rollouts on IRON-R01-1.11. Together they give us both what the model predicts and what the model executes.
Mean across all models with different cross-embodiment data ratio.
Change from the 0-ratio baseline; lower is better.
The opening chart shows that increasing cross-embodiment data from 0% to 100% consistently lowers open-loop validation loss, reducing aggregate error by about 8% versus the zero-transfer baseline. This confirms that heterogeneous data improves generalization overall, but the averaged trend alone does not show which capabilities drive the gain. We detail the breakdown in this section.
The “Average MSE by Group” chart shows three difficulty tiers: Visual Robustness (L1) and Semantic Grounding (L2) form the low-error perception baseline; Relational Reasoning (L3) and Compositional Planning (L4) are moderately harder; and Action Generalization (L5) is the most challenging. Overall, the model handles perception well, while novel physical action execution remains its main hurdle.
The “Relative MSE Change” chart shows that added data benefits each dimension unevenly. Visual Robustness (L1), Semantic Grounding (L2), and Relational Reasoning (L3) gain the most, Compositional Planning (L4) follows the overall trend, while Action Generalization (L5) improves only slightly. This suggests that cross-embodiment data helps more with perception and reasoning than with physical action generalization.
Although the overall trend declines, Action Generalization (L5) and Relational Reasoning (L3) show non-linear patterns. L5 first worsens with limited cross-embodiment data before improving at higher volumes, while L3 briefly regresses mid-way before recovering at full data. This suggests that some capabilities depend not only on data quantity, but also on sufficient scale and source diversity.
Numbers tell us what the model predicts; a real robot tells us what it executes. We ran closed-loop trials on IRON-R01-1.11 across the same L1–L5 axes, then placed the model trained with 100% cross-embodiment data head-to-head against the 0% baseline on identical scenarios. The video grid at the top of this page shows the successful rollouts; the matrix below summarizes how each axis holds up on a real robot.
Handles every visual-robustness dimension, reliably isolating the target among distractors.
Cannot identify targets among distractors, failing at the most fundamental step of "seeing the target"—it tends to grab whatever object is convenient.
Grounds novel semantic targets—unseen objects and receptacles—with ease.
Fails to recognize unseen objects and receptacles like peppers, tape measures, or electronic scales.
Reliably executes novel action-object pairings and superlative selections, with only precise attribute and spatial-relation binding remaining unstable.
Fails superlative selections and novel action-object combos (e.g. handing items over), and struggles severely with attribute and spatial relations.
Achieves successful multi-step execution across multiple dimensions, and overcomes the teleoperation bias of stopping once a container is populated. Multi-step plans remain fragile.
On multi-object or categorical placement it freezes completely or grasps erratically, unable to overcome the teleoperation bias—it stops prematurely at populated containers.
Executes challenging pose generalization and restoration, and pushing, but cannot handle planar constrained motion.
Incapable of anything outside the standard pick-and-place template (e.g. uprighting, pushing to close, or wiping).
The closed-loop trials on the IRON-R01-1.11 robot directly mirror the open-loop trends, translating the numeric MSE drop-offs into visible execution failures. The 0% baseline model fails completely across all levels, remaining rigidly confined to the narrow scope of its teleoperation data and unable to handle any variation in objects, scenes, or multi-step logic.
In stark contrast, the 100% cross-embodiment model achieves full autonomy in fundamental perception and semantic grounding (L1–L2). However, as task complexity increases, its performance becomes fragmented. While it can successfully select superlatives (e.g., the largest/smallest object), it struggles with precise spatial relations (L3). It successfully overcomes teleoperation bias to complete certain multi-step plans, yet these long-horizon executions remain fragile (L4). Finally, while it can execute contact-based pushing (e.g., closing a cabinet door), it completely fails at planar constrained motions (L5).
Ultimately, these real-world results confirm that while cross-embodiment data is the critical catalyst for unlocking generalized cognition, simply scaling heterogeneous trajectories under current paradigms is not enough to master higher-order logic and complex physical action. We dissect the underlying structural bottlenecks behind this capability drop-off in the following section.
From Universal Cognition to the Bottlenecks of Physical Execution
Evaluating transfer across the L1–L5 framework reveals a clear capability boundary in cross-embodiment learning. While scaling heterogeneous data effectively solves visual semantics and basic task logic, it falters at mid-level relational reasoning and hits a hard ceiling when transferring precise physical execution. This capability drop-off exposes deep structural bottlenecks: the lossy nature of multimodal representations, the challenge of overcoming behavioral biases inherent in teleoperation data, and the fundamental limits of learning physical interaction from raw trajectories. True physical generalization requires new representational and learning paradigms, not just more data.
The most immediate signal in heterogeneous data is what remains invariant across humans and diverse robots: visual features, semantic concepts, and functional affordances. By leveraging the vision-language alignment of the foundation model, we observe clear gains in Visual Robustness (L1) (e.g., ignoring distractors, adapting to new scenes) and Semantic Grounding (L2) (e.g., generalizing to unseen objects, receptacles, and implicit instructions). Through this universal visual-language bridge, cross-embodiment data successfully teaches the model "what to look at" and "what things are" before it teaches "how to act."
Transferring higher-order logic proves significantly harder than basic perception due to two distinct bottlenecks. First, a relational representation bottleneck limits Relational Reasoning (L3): while basic semantic selection transfers well, binding precise spatial relationships or multi-object attributes struggles because current multimodal representations often fail to maintain strict correspondence between objects and their relative properties. Second, a behavioral bias bottleneck limits Compositional Planning (L4): executing multi-step continuous tasks forces the model to fight an ingrained "terminate early" instinct, driven by the overwhelmingly single-step nature of the anchor data.
Cross-embodiment data clearly demonstrates intent—showing that an actor is picking, pushing, or wiping. What it does not directly specify is the physical structure that makes the action executable: contact mode, constraint geometry, force direction, etc. This helps explain why L5 action generalization shows the weakest transfer: these variables are only weakly expressed in raw trajectories and do not transfer cleanly across bodies by scale alone. The remaining gap suggests that action learning may need a new data paradigm that captures transferable physical interaction structure, rather than relying only on more heterogeneous data.
Yizhuo Li*, Xiaoyu Zhang*, Yuying Ge*†, Teng Wang, Boyu Chen, Yi Chen, Bo Liu, Feng Qiu, Yuguo Gan, Hui Zhou, Yixiao Ge
@article{li2026fe0,
author = {Yizhuo Li and Xiaoyu Zhang and Yuying Ge and Teng Wang and Boyu Chen and Yi Chen and Bo Liu and Feng Qiu and Yuguo Gan and Hui Zhou and Yixiao Ge},
title = {Inside Fe0: What Cross-Embodiment Data Teaches an Embodied Foundation Model},
journal = {XPENG Robotics Blog},
year = {2026},
note = {https://xpeng-robotics.github.io/fe0/},
}