AGiLe: Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning

Abstract

The robust execution of long-horizon manipulation tasks remains a central challenge in embodied intelligence, requiring both coherent high-level planning and reliable low-level control. Existing approaches face two critical limitations: the accumulation of prediction errors in subgoal planning, which compounds deviations over time; and the planning-execution gap, where abstract high-level plans fail to be grounded in the continuous perception-action space. To address these challenges, we propose a unified framework, Affordance-Grounded Bidirectional Latent Planning (AGiLe).

AGiLe introduces a bidirectional latent planning mechanism that jointly optimizes a backward planner and a forward critic. The backward planner generates goal-directed subgoals from the final objective, while the forward critic assesses their reachability, ensuring temporal robustness through sustained consistency over long horizons. AGiLe further bridges the planning-execution gap by using affordance as structural guidance, grounding abstract subgoals into dense, pixel-level visual affordances that drive action. This enhances spatial robustness, letting the system adapt to semantic and visual distractors. Across simulation and real-world settings, AGiLe significantly outperforms strong baselines, achieving an 8.5% improvement over prior state-of-the-art methods.

Contributions

AGiLe framework. A unified approach integrating bidirectional latent planning (temporal robustness) with affordance grounding (spatial robustness) for robust long-horizon manipulation policies.
Bidirectional latent planning. A mechanism that jointly optimizes a backward planner and a forward critic, ensuring generated subgoals are both goal-directed and goal-reachable.
Affordance grounding. A mechanism that uses abstract plans as structural guidance to purify visual features, bridging high-level planning and low-level control. Extensive simulation and real-world experiments show AGiLe substantially outperforms existing methods.

Method

Overview of AGiLe. A backward planner and forward critic produce a temporally consistent subgoal plan in latent space; affordance grounding then translates each abstract subgoal into dense pixel-level cues that drive low-level action.

Experiments

Evaluated on the LIBERO-LONG benchmark (10 long-horizon tasks, 50 demos each) and four real-world xArm6 tasks of four and six sub-stages.

Real-world rollouts. AGiLe maintains stage-level consistency across long multi-step instructions, outperforming the strongest planning baseline (LBP).

Real-World Demo

AGiLe executing long-horizon tasks on a real xArm6, planning and grounding each sub-stage from a single language instruction.

BibTeX

@inproceedings{chen2026agile,
  title     = {AGiLe: Learning Robust Long-Horizon Manipulation via
               Affordance-Grounded Bidirectional Latent Planning},
  author    = {Chen, Zixuan and Feng, Xiangrong and Shi, Jieqi and
               Shao, Lin and Huo, Jing and Gao, Yang},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision
               and Pattern Recognition (CVPR)},
  year      = {2026}
}

Supported in part by the New Generation Artificial Intelligence National Science and Technology Major Project (2025ZD0122902), NSFC (62192783, 62276128, 62506153), Jiangsu Science and Technology Major Project (BG2025035), the Fundamental Research Funds for the Central Universities (KG202514), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

AGiLe

Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning

AGiLe achieves both temporal and spatial robustness by combining bidirectional latent planning with affordance grounding, improving the robustness of long-horizon manipulation.