1 Columbia University 2 Adobe Research 3 University of Toronto 4 University of Cambridge 5 University College London
* Work done during an internship at Adobe Research.
ViPS is a feed-forward model that lifts static auto-rigged meshes into a plausible, editable pose manifold, by distilling motion priors from a pretrained video diffusion model.
We introduce ViPS, a universal feed-forward model that lifts static, auto-rigged meshes into a plausible and editable pose manifold. ViPS leverages the rich priors of foundational video models to automatically reveal a pose space that enables (a) manifold-constrained editing; (b) smooth pose-space interpolation, and (c) pose-guided video synthesis by using 3D proxies as structural guidance. The pose space is queryable using a single 3D mesh and its autorig (using RigAnything) to generate a manifold of plausible poses, while invalid configurations, such as unnatural bone twisting, naturally fall outside this manifold (indicated with red dots).We first build a dataset of rigged meshes and plausible skeleton poses by combining video priors, per-frame 3D reconstruction, skeleton extraction, and pose optimization. We then train a diffusion model conditioned on the mesh and skeleton to model the distribution of plausible poses. At inference, we sample from the model, invert poses with DDIM inversion, and apply sparse constraints through guidance.
Rigged mesh poses from a video prior. We generate single-object videos with image-to-video priors, reconstruct per-frame meshes and align them in a common world space, then extract a skeleton from the first frame and optimize node positions to match each frame using Chamfer distance and edge-length regularization.
Learning a pose space. We train a diffusion model to denoise poses conditioned on the mesh, skeleton edges, rest pose, and semantic node features. Sampling uses a standard denoising schedule.
Constrained sampling. We reproduce a pose via DDIM inversion into a noise sample, and apply sparse constraints by nudging the denoising trajectory using an energy term.
All examples below are from assets unseen during training, using only the rest mesh and its auto-rig as input.
To visualize the continuity of the learned pose space, we generate pose-space traversals by interpolating in latent noise space and decoding each intermediate step. We first obtain the noisy latent xT for the start and end poses using deterministic DDIM inversion (η = 0), then interpolate between them using variance-preserving interpolation, and decode each step with DDIM. This produces smooth, semantically meaningful transitions that remain on the learned manifold, in contrast to direct interpolation in joint space.
ViPS enables precise inverse kinematics by projecting user-driven joint handles (orange → green) into the discovered plausible pose space. It generates poses that remain faithful to the learned prior while approximately satisfying sparse user constraints through guided sampling, where an energy function measures constraint violation and nudges each denoising step toward lower energy.
Our pose space provides a simple interface for generating keyframes that can steer a video diffusion model. We select keyframes along a pose-space traversal (or between edits), render the corresponding mesh+skeleton proxy, and supply these as conditioning frames. This enables controllable, semantically aligned video generation: the video model is free to synthesize appearance and texture, while the pose sequence provides precise 3D control.
We introduce a high-quality 4D motion dataset with correspondence, containing 127k poses spanning 100+ species and 200+ unique individuals built from generative video priors with VLM guidance and 4D reconstruction. The dataset will be released upon acceptance.
We compare our generated data with Puppeteer independently of the feed-forward model. Both pipelines reconstruct poses from video frames; here, our pipeline fits poses to videos generated from scratch, while Puppeteer requires videos initialized from a render of the rest-pose mesh Mα. Puppeteer can miss limb configurations due to tracking errors under self-occlusion or large motion, whereas our 4D reconstruction avoids explicit inter-frame tracking and better follows the video frames.
Our data pipeline can be integrated with different video models as motion priors. Using the same input image and the same text prompt describing a robot runningA metallic humanoid robot with articulated joints and a sleek design stands in a static, full-body shot against a plain white background. The robot begins to run, showcasing its smooth motion and mechanical precision. Each step is fluid and realistic, maintaining natural weight and timing typical of a metallic structure. The environment remains simple and light-stable, ensuring a clear and stable view of the robot's running action from a fixed tripod position. The scene highlights the robot's dynamic movement without introducing any additional elements or transitions. The video is rendered in photorealistic, 4k, high-fidelity quality., we show pose optimization results obtained by swapping our current video prior, Wan2.2-TurboDiffusion, with Kling, Runway, and SeedDance.
We compare artist-authored motion clips against samples from our generated data. The quality of artist-authored motion varies considerably across data sources. Free, open-source datasets such as Objaverse-XL often contain animations whose quality varies substantially depending on the asset origin and animator effort. and they may offer limited motion diversity for a specific object. In contrast, commercial datasets such as Truebones Zoo can provide higher-quality animations, but they are typically smaller in scale and more expensive to acquire.
Our method is still constrained by the input auto-rig, so issues such as poor bone placement, inaccurate skinning weights, or suboptimal topology can directly limit the learned pose space. It also inherits biases from the video prior and 4D extraction, which may struggle with rare motions, unusual species, or non-biological objects. More broadly, the current framework models plausible static articulations rather than full motion dynamics, and it does not yet capture more complex deformation behaviors such as soft materials, topology changes, or secondary physical effects.