← 返回首页
SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents SPIRAL

SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents

Yu Yang*,1,2,3 Yue Liao*,3 Jianbiao Mei*,1,2 Baisen Wang*,4 Xuemeng Yang2 Licheng Wen2 Jiangning Zhang1,5
Xiangtai Li6 Liang Lv7 Hanlin Chen3 Botian Shi2 Yong Liu1,† Shuicheng Yan3 Gim Hee Lee3
1Zhejiang University   2Shanghai AI Laboratory   3National University of Singapore
4Chinese Academy of Sciences   5Tencent Youtu Lab   6Nanyang Technological University   7Wuhan University
*Equal contribution   †Corresponding author

Abstract

Long-Horizon Action-Conditioned Video Generation: Challenges and Solution. (a) General TI2V is single-shot and open-loop, often causing incomplete actions and hallucinated motions. (b) We propose a closed-loop think-act-reflect framework for iterative planning, generation, and verification. (c) We introduce the ActVideoGen-Dataset and Benchmark for task-specific experiments. (d) Our closed-loop design enables self-evolving, continually improving video generation quality.

Long-horizon action-conditioned video generation aims to synthesize temporally coherent videos that follow complex action instructions over extended horizons. Existing single-shot video generation models typically operate in an open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. To address this, we propose SPIRAL, a closed-loop framework that performs Sequential Planning and Iterative Reflection for Action-conditioned Long-horizon video generation. Specifically, a PlanAgent decomposes a high-level goal into sub-actions that condition video generation, while a CriticAgent evaluates intermediate video segments and provides corrective feedback for iterative refinement. This closed-loop design further supports self-evolving, utilizing planning and verification signals for GRPO-based post-training to enhance the video generator's consistency and action quality over extended horizons. Moreover, we introduce ActVideoGen-Dataset and ActVideoGen-Bench for training and evaluation. Experiments across multiple TI2V backbones with self-evolving show consistent gains on ActVideoGen-Bench and VBench, demonstrating the effectiveness of SPIRAL.

Method Overview

SPIRAL Overview. (a) Closed-Loop Framework: PlanAgent decomposes abstract goals into atomic plans for action-conditioned video generation, CriticAgent evaluates videos and triggers dual-level inner/outer refinement feedback. (b) Self-Evolving via GRPO: guided by PlanAgent, VideoGenerator produces rollouts and is optimized using CriticAgent rewards.

PlanAgent

Decomposes a high-level goal and visual context into ordered, object-centric action plans with explicit pre-conditions and post-conditions for each generation step.

VideoGenerator

Synthesizes each video segment from the current sub-action and accumulated context, enabling long-horizon generation through step-wise controllable execution.

CriticAgent

Evaluates action-video alignment, detects local failures or global drift, and returns feedback that triggers refinement, regeneration, or replanning.

SPIRAL agent flow PlanAgent sends action plans to VideoGenerator, VideoGenerator sends video segments to CriticAgent, CriticAgent sends local refine feedback to VideoGenerator and global replan feedback to PlanAgent. PlanAgent decompose goals VideoGenerator synthesize steps CriticAgent verify alignment action plans video segments local refine global replan

Full Pipeline Demo

A side-by-side view of the agentic execution process and the final long-horizon video.

Full Pipeline Process

PlanAgent, VideoGenerator, and CriticAgent collaborate through planning, generation, verification, and refinement.

Long-Horizo Video Result

The final generated video composed from verified step-wise segments.

End-to-end pipeline of SPIRAL. Given a user goal, PlanAgent decomposes the task into step-wise actions, VideoGenerator synthesizes each segment, and CriticAgent verifies alignment before final long-horizon composition.

Feedback Refinement Demo

A side-by-side view of CriticAgent-triggered local refinement and the corrected video result.

Local Refinement Trigger

CriticAgent detects an execution issue and triggers local refinement to regenerate and optimize the video segment.

Refined Video Result

The corrected local refinement result after regenerating the problematic step.

Closed-loop feedback refinement. CriticAgent detects local failures, SPIRAL refines the action instruction, regenerates a corrected segment, and continues the procedure without propagating errors.

Results Gallery

Baseline Comparison

Install Computer RAM

Open-Loop Baseline

Step 1: Open the Back CoverStep 2: Insert the RAM (Physical Violation)

Closed-Loop SPIRAL (Ours)

Step 1: Open the Back CoverStep 2: Insert the RAM

Pump Gas

Open-Loop Baseline

Step 1: Remove the Gas CapStep 2: Insert the Fuel Nozzle (Missing Action)

Closed-Loop SPIRAL (Ours)

Step 1: Remove the Gas CapStep 2: Insert the Fuel Nozzle

Wash Hands

Open-Loop Baseline

Step 1: Rinse with WaterStep 2: Dry with a Towel (Incomplete Action)

Closed-Loop SPIRAL (Ours)

Step 1: Rinse with WaterStep 2: Dry with a Towel

Wash Onion and Green Pepper

Open-Loop Baseline

Step 1: Wash the OnionStep 2: Wash the Green Pepper (Sudden Switch)

Closed-Loop SPIRAL (Ours)

Step 1: Wash the OnionStep 2: Wash the Green Pepper

Make Milktea

Open-Loop Baseline

Step 1: Pour in Hot Tea (Physical Violation)Step 2: Pour in Milk (Sudden Switch)

Closed-Loop SPIRAL (Ours)

Step 1: Pour in Hot TeaStep 2: Pour in Milk

Perform a Magic Trick

Open-Loop Baseline

Step 1: Show a Blank Piece of PaperStep 2: Fold the Paper to Produce Money (Sudden Switch)

Closed-Loop SPIRAL (Ours)

Step 1: Show a Blank Piece of PaperStep 2: Fold the Paper to Produce Money

Use Fire Extinguisher

Open-Loop Baseline

Step 1: Pull the Safety Pin (Physical Violation)Step 2: Spray the fire

Closed-Loop SPIRAL (Ours)

Step 1: Pull the Safety PinStep 2: Spray the fire

Clean the Seat

Open-Loop Baseline

Step 1: Spray CleanerStep 2: Wipe with a Cloth (Physical Violation)

Closed-Loop SPIRAL (Ours)

Step 1: Spray CleanerStep 2: Wipe with a Cloth

Ultra Long-Horizon

Tomato Preparation & Storage

Open-Loop Baseline

Step 1: Open the RefrigeratorStep 2: Take Out TomatoesStep 3: Wash the Tomatoes (Physical Violation)Step 4: Cut the Tomatoes (Physical Violation)Step 5: Seal in a Bag (Incomplete Action)

Closed-Loop SPIRAL (Ours)

Step 1: Open the RefrigeratorStep 2: Take Out TomatoesStep 3: Close the Right DoorStep 4: Close the Left DoorStep 5: Place on the Cutting BoardStep 6: Wash the TomatoesStep 7: Cut the TomatoesStep 8: Seal in a Bag

Make Tomato & Cucumber Salad

Open-Loop Baseline

Step 1: Slice Tomatoes & Cucumbers (Missing Actions)Step 2: Place Tomatoes & Cucumbers in Bowl (Incomplete Action)Step 3: Get and Pour Salad Dressing (Physical Violation)

Closed-Loop SPIRAL (Ours)

Step 1: Slice CucumbersStep 2: Place Cucumbers in PlateStep 3: Slice TomatoesStep 4: Get Salad BowlStep 5: Place Tomatoes in BowlStep 6: Place Cucumbers in BowlStep 7: Get Salad DressingStep 8: Pour the DressingStep 9: Toss with Spoon

Citation

@article{yang2026spiral, title = {SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents}, author = {Yang, Yu and Liao, Yue and Mei, Jianbiao and Wang, Baisen and Yang, Xuemeng and Wen, Licheng and Zhang, Jiangning and Li, Xiangtai and Lv, Liang and Chen, Hanlin and Shi, Botian and Liu, Yong and Yan, Shuicheng and Lee, Gim Hee}, journal = {arXiv preprint arXiv:2603.08403}, year = {2026} }
SPIRAL · Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents