Cross-Embodiment Transfer

Inside Fe0

What Cross-Embodiment Data Teaches an Embodied Foundation Model

heterogeneous robots and humans

XPENG Robotics / Multimodal Team Contributors
Introduction

Opening the Cross-Embodiment Box

From Heterogeneous Experience to Robot Execution

Real-Robot Trials Multi-Level Generalization across Objects and Tasks

In embodied AI, it is a widely held belief that cross-embodiment data—a diverse mixture encompassing both heterogeneous robot data and egocentric human videos—enhances robot models. Yet, this consensus remains an opaque black box: we know it works, but we rarely understand exactly what this broad spectrum of data is teaching the robot.

To demystify this, we introduce Fe0, an embodied foundation model. Rather than scaling data to chase benchmarks, we designed a rigorous controlled experiment. We deliberately constrained our in-domain teleoperation data to the minimum: just 3 hours of trajectories featuring only a single, atomic pick-and-place primitive, where 95% of the data involves moving a single object to a single receptacle from a highly restricted vocabulary. By establishing this extreme baseline, we guarantee that any zero-shot generalization Fe0 exhibits beyond in-domain grasping is unequivocally powered by cross-embodiment learning.

But how do we measure this transfer scientifically? Instead of ad-hoc testing, we constructed a systematic L1–L5 capability framework—spanning from basic visual robustness and semantic grounding, up to complex relational reasoning, compositional planning, and novel action execution. We evaluate the model through both rigorous open-loop trajectory scoring and extensive closed-loop real-world trials on the IRON-R01-1.11 robot to unpack the true anatomy of cross-embodiment data.

We reveal a clear capability boundary: while heterogeneous data effectively teaches the model visual semantics and basic task logic, it falters at mid-level relational and sequential reasoning, and hits a hard ceiling when transferring precise physical execution. By systematically evaluating exactly where cross-embodiment data succeeds and where it fails, we illuminate what this heterogeneous experience actually teaches an embodied foundation model, and what it takes to push beyond current limits.

Data

A Glance at the Data

The Extreme Asymmetry Between Pool and Anchor

The cross-embodiment pool is deliberately diverse, spanning human activity and heterogeneous robots with dexterous hands and grippers to cover broad scenes, tasks, and action primitives. The target-embodiment teleoperation anchor, however, is intentionally minimal. It is restricted to the IRON-R01-1.11 humanoid robot and its action vocabulary is dominated by a single primitive: pick-and-place.

The scale disparity between the two is massive. The cross-embodiment pool is over three hundred times larger in terms of episodes, nearly a thousand times larger in frame count, and over fifteen hundred times larger in total duration compared to the target anchor. This extreme asymmetry makes the transfer analysis interpretable. When a trial moves beyond the anchor’s restricted object set, receptacle set, task logic, or action primitive, it directly tests whether knowledge from the broader pool can be transferred on the humanoid body.

Pool

Cross-Embodiment Pool

Spanning human activity and heterogeneous robots with dexterous hands and grippers.

Source Scope
human activityheterogeneous robotsdexterous handsgrippers
Action Shape
multiple embodimentsvaried scenesbroad tasksdiverse action primitives
Anchor

IRON Teleoperation

Intentionally minimal homogeneous teleoperation data.

Source Scope
IRON-R01-1.11
Action Shape
pick-and-place1/2-obj→1-container
Episodes345×
Frames909×
Hours1,587×

Zooming into the target-embodiment anchor makes its constraints concrete, and reveals just how narrow the in-domain objects and containers really are.

Extreme simplicity in scale and action primitives. The dataset contains merely ~3 hours of demonstrations with an average duration of just 6.1 seconds per episode. Crucially, the physical action space is strictly confined to a single, highly uniform primitive: basic pick-and-place. Every demonstration strictly follows a rigid approach → grasp → lift → transport → release kinematic sequence.

Minimal task complexity and semantic diversity. Furthermore, the task complexity and semantic diversity are heavily restricted. An overwhelming 95% of the trajectories represent the most atomic manipulation scenario: transferring a single object to a single receptacle. The remaining 5% of the data involves placing two objects onto different tiers of a single shelf, and is strictly limited to only three object categories (apple, banana, and bread). Across the entire dataset, the visual and semantic scope is tightly bounded to 20 common household items and 6 fixed target locations (such as boxes, plates, or handing over to a person) as shown below, uniformly driven by highly templated language instructions.

Objects
breadbananabowlapplewhiteboard erasermarkerplastic bottlespoonmouseblockcokeclothcuptissuesscrewdrivertennis ballpaper cupplush toyfruitegg
Containers
boxshelfplatebasketpersonbowl
In-domain object reference frame
In-domain container reference frame
Evaluation

Five Axes of Generalization

From What to Look at, to How to Act

To make cross-embodiment transfer measurable, Fe0 is evaluated through a Multi-Axis Decomposition Framework rather than a single aggregate metric. The five axes are not arbitrary capability buckets—each one is designed to evaluate a specific type of reusable knowledge that the heterogeneous data pool can plausibly supply, mapping cleanly from perception to action: visual robustness and semantic grounding at the input side, relational reasoning and compositional planning in the middle, and novel action prediction at the output side.

For every layer we collected a matched set of teleoperated trajectories on the IRON-R01-1.11 to serve as ground truth. The model is then evaluated open-loop: along each recorded trajectory we query the model frame by frame and score it by the MSE between its predicted actions and the corresponding teleoperated actions. Unlike a closed-loop trial on the real robot, this protocol scores every frame from the ground-truth state rather than from where the model would have ended up—the success of preceding steps is never enforced. As a result, scores on long-horizon, multi-step layers (most visibly L4 Compositional Planning) may diverge from real-robot outcomes, where errors normally compound across steps.

L1What to look at

Visual Robustness

Does the model preserve task-relevant attention when background, appearance, object pose, or distractors change?

L2What things are

Semantic Grounding

Can a novel object name, category, or functional description be grounded as a reachable target?

L3How to relate

Relational Reasoning

Can attributes, comparisons, and spatial predicates select the intended object or placement relation?

L4How to plan

Compositional Planning

Can familiar atomic actions be combined into multi-object or multi-step behavior while maintaining progress state?

L5How to act

Action Generalization

Can the policy execute motions whose contact pattern departs from the pick-and-place template?

L1

Visual Robustness

L1-01Scene Transfer L1-02Appearance Generalization L1-03Distractor Robustness
L2

Semantic Grounding

L2-01Unseen Object Transfer L2-02Unseen Receptacle Transfer L2-03Implicit Reasoning
L3

Relational Reasoning

L3-01Unseen Action-Object Pair L3-02Superlative Selection L3-03Attribute Binding L3-04Spatial Relations
L4

Compositional Planning

L4-01Semantic Classification L4-02Compositional & Pairing Logic L4-03Same-class Multi-Instance L4-04Many-to-one Aggregation L4-053D Multi-tier Categorical Allocation
L5

Action Generalization

L5-01Pose Generalization L5-02Pose Restoration L5-03Contact-based Push L5-04Planar Constrained Motion
Results

Open-Loop Numbers, Real-Robot Behavior

Where Cross-Embodiment Data Transfers, and Where It Stalls

We build Fe0 based on UniT1 and an in-house multimodal model to reveal what heterogeneous embodied experience actually teaches an embodied foundation. We measure transfer along two complementary views, both evaluated across the L1–L5 task framework. The first is open-loop: a numeric MSE score against teleoperated trajectories. The second is on the robot: closed-loop rollouts on IRON-R01-1.11. Together they give us both what the model predicts and what the model executes.

Open-Loop: MSE Decomposed Across L1–L5

Average MSE by Group

Mean across all models with different cross-embodiment data ratio.

0 .02 .04 .06 Validation MSE L1 L2 L3 L4 L5 Evaluation group average

Relative MSE Change

Change from the 0-ratio baseline; lower is better.

+5 0 -5 -10 -15 Change from 0-ratio (%) 0 12.5% 25% 50% 100% ALL -7.9% baseline Cross-embodiment data ratio
I The Big Picture: More Data Brings Overall Gains

The opening chart shows that increasing cross-embodiment data from 0% to 100% consistently lowers open-loop validation loss, reducing aggregate error by about 8% versus the zero-transfer baseline. This confirms that heterogeneous data improves generalization overall, but the averaged trend alone does not show which capabilities drive the gain. We detail the breakdown in this section.

II The Difficulty Ladder: Perception is Easy, Action is Hard

The “Average MSE by Group” chart shows three difficulty tiers: Visual Robustness (L1) and Semantic Grounding (L2) form the low-error perception baseline; Relational Reasoning (L3) and Compositional Planning (L4) are moderately harder; and Action Generalization (L5) is the most challenging. Overall, the model handles perception well, while novel physical action execution remains its main hurdle.

III The Uneven Payoff: Not All Skills Benefit Equally

The “Relative MSE Change” chart shows that added data benefits each dimension unevenly. Visual Robustness (L1), Semantic Grounding (L2), and Relational Reasoning (L3) gain the most, Compositional Planning (L4) follows the overall trend, while Action Generalization (L5) improves only slightly. This suggests that cross-embodiment data helps more with perception and reasoning than with physical action generalization.

IV The Bumpy Road: Transfer is Not a Straight Line

Although the overall trend declines, Action Generalization (L5) and Relational Reasoning (L3) show non-linear patterns. L5 first worsens with limited cross-embodiment data before improving at higher volumes, while L3 briefly regresses mid-way before recovering at full data. This suggests that some capabilities depend not only on data quantity, but also on sufficient scale and source diversity.

On a real Robot: Trials Across the Same Axes

Numbers tell us what the model predicts; a real robot tells us what it executes. We ran closed-loop trials on IRON-R01-1.11 across the same L1–L5 axes, then placed the model trained with 100% cross-embodiment data head-to-head against the 0% baseline on identical scenarios. The video grid at the top of this page shows the successful rollouts; the matrix below summarizes how each axis holds up on a real robot.

Axis 100% Cross-Embodiment Baseline (0%)
L1 Visual Robustness
Pass

Handles every visual-robustness dimension, reliably isolating the target among distractors.

Fail

Cannot identify targets among distractors, failing at the most fundamental step of "seeing the target"—it tends to grab whatever object is convenient.

L2 Semantic Grounding
Pass

Grounds novel semantic targets—unseen objects and receptacles—with ease.

Fail

Fails to recognize unseen objects and receptacles like peppers, tape measures, or electronic scales.

L3 Relational Reasoning
Pass

Reliably executes novel action-object pairings and superlative selections, with only precise attribute and spatial-relation binding remaining unstable.

Fail

Fails superlative selections and novel action-object combos (e.g. handing items over), and struggles severely with attribute and spatial relations.

L4 Compositional Planning
Partial

Achieves successful multi-step execution across multiple dimensions, and overcomes the teleoperation bias of stopping once a container is populated. Multi-step plans remain fragile.

Complete Failure

On multi-object or categorical placement it freezes completely or grasps erratically, unable to overcome the teleoperation bias—it stops prematurely at populated containers.

L5 Action Generalization
Limited

Executes challenging pose generalization and restoration, and pushing, but cannot handle planar constrained motion.

Complete Failure

Incapable of anything outside the standard pick-and-place template (e.g. uprighting, pushing to close, or wiping).

The closed-loop trials on the IRON-R01-1.11 robot directly mirror the open-loop trends, translating the numeric MSE drop-offs into visible execution failures. The 0% baseline model fails completely across all levels, remaining rigidly confined to the narrow scope of its teleoperation data and unable to handle any variation in objects, scenes, or multi-step logic.

In stark contrast, the 100% cross-embodiment model achieves full autonomy in fundamental perception and semantic grounding (L1–L2). However, as task complexity increases, its performance becomes fragmented. While it can successfully select superlatives (e.g., the largest/smallest object), it struggles with precise spatial relations (L3). It successfully overcomes teleoperation bias to complete certain multi-step plans, yet these long-horizon executions remain fragile (L4). Finally, while it can execute contact-based pushing (e.g., closing a cabinet door), it completely fails at planar constrained motions (L5).

Ultimately, these real-world results confirm that while cross-embodiment data is the critical catalyst for unlocking generalized cognition, simply scaling heterogeneous trajectories under current paradigms is not enough to master higher-order logic and complex physical action. We dissect the underlying structural bottlenecks behind this capability drop-off in the following section.

Takeaways

The Anatomy of Cross-Embodiment Transfer

From Universal Cognition to the Bottlenecks of Physical Execution

Evaluating transfer across the L1–L5 framework reveals a clear capability boundary in cross-embodiment learning. While scaling heterogeneous data effectively solves visual semantics and basic task logic, it falters at mid-level relational reasoning and hits a hard ceiling when transferring precise physical execution. This capability drop-off exposes deep structural bottlenecks: the lossy nature of multimodal representations, the challenge of overcoming behavioral biases inherent in teleoperation data, and the fundamental limits of learning physical interaction from raw trajectories. True physical generalization requires new representational and learning paradigms, not just more data.

Vision and Semantics Transfer Free: Cross-Embodiment Data First Teaches What Remains Stable

The most immediate signal in heterogeneous data is what remains invariant across humans and diverse robots: visual features, semantic concepts, and functional affordances. By leveraging the vision-language alignment of the foundation model, we observe clear gains in Visual Robustness (L1) (e.g., ignoring distractors, adapting to new scenes) and Semantic Grounding (L2) (e.g., generalizing to unseen objects, receptacles, and implicit instructions). Through this universal visual-language bridge, cross-embodiment data successfully teaches the model "what to look at" and "what things are" before it teaches "how to act."

Mid-Level Logic Falters: Relational and Sequential Tasks Expose Representational Limits

Transferring higher-order logic proves significantly harder than basic perception due to two distinct bottlenecks. First, a relational representation bottleneck limits Relational Reasoning (L3): while basic semantic selection transfers well, binding precise spatial relationships or multi-object attributes struggles because current multimodal representations often fail to maintain strict correspondence between objects and their relative properties. Second, a behavioral bias bottleneck limits Compositional Planning (L4): executing multi-step continuous tasks forces the model to fight an ingrained "terminate early" instinct, driven by the overwhelmingly single-step nature of the anchor data.

Physics Doesn't Scale Easily: Action Transfer Requires a New Learning Paradigm

Cross-embodiment data clearly demonstrates intent—showing that an actor is picking, pushing, or wiping. What it does not directly specify is the physical structure that makes the action executable: contact mode, constraint geometry, force direction, etc. This helps explain why L5 action generalization shows the weakest transfer: these variables are only weakly expressed in raw trajectories and do not transfer cleanly across bodies by scale alone. The remaining gap suggests that action learning may need a new data paradigm that captures transferable physical interaction structure, rather than relying only on more heterogeneous data.

Credits

Contributors

Yizhuo Li*, Xiaoyu Zhang*, Yuying Ge*†, Teng Wang, Boyu Chen, Yi Chen, Bo Liu, Feng Qiu, Yuguo Gan, Hui Zhou, Yixiao Ge

  • *Co-first authors.
  • Project lead.
@article{li2026fe0,
  author = {Yizhuo Li and Xiaoyu Zhang and Yuying Ge and Teng Wang and Boyu Chen and Yi Chen and Bo Liu and Feng Qiu and Yuguo Gan and Hui Zhou and Yixiao Ge},
  title = {Inside Fe0: What Cross-Embodiment Data Teaches an Embodied Foundation Model},
  journal = {XPENG Robotics Blog},
  year = {2026},
  note = {https://xpeng-robotics.github.io/fe0/},
}