Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

Figure 1 – Overview of SP4D

Stable Part Diffusion 4D (SP4D) transforms a single image or video into multi-view, multi-frame sequences with temporally consistent kinematic part segmentations. Unlike methods that only generate appearance, SP4D captures both visual structure and articulated geometry, producing animation-ready 3D assets with part-aware rigging capabilities.

Abstract

We present Stable Part Diffusion 4D (SP4D), a framework for generating paired RGB and kinematic part videos from monocular inputs. Unlike conventional part segmentation methods that rely on appearance-based semantic cues, SP4D learns to produce kinematic parts — structural components aligned with object articulation and consistent across views and time.

SP4D adopts a dual-branch diffusion model that jointly synthesizes RGB frames and corresponding part segmentation maps. To simplify architecture and flexibly enable different part counts, we introduce a spatial color encoding scheme that maps part masks to continuous RGB-like images. This encoding allows the segmentation branch to share the latent VAE from the RGB branch, while enabling part segmentation to be recovered via straightforward post-processing.

A Bidirectional Diffusion Fusion (BiDiFuse) module enhances cross-branch consistency, supported by a contrastive part consistency loss to promote spatial and temporal alignment of part predictions. We demonstrate that the generated 2D part maps can be lifted to 3D to derive skeletal structures and harmonic skinning weights with few manual adjustments.

To train and evaluate SP4D, we construct KinematicParts20K, a curated dataset of over 20K rigged objects selected and processed from Objaverse XL, each paired with multi-view RGB and part video sequences. Experiments show that SP4D generalizes strongly to diverse scenarios, including real-world videos, novel generated objects, and rare articulated poses, producing kinematic-aware outputs suitable for downstream animation and motion-related tasks.

Method Overview

Figure 2 – Model Architecture: SP4D dual-branch diffusion model with BiDiFuse module for RGB and kinematic part generation

Figure 2 – Model Architecture
SP4D builds on SV4D 2.0 and introduces a dual-branch architecture: one branch generates RGB frames, and the other produces kinematic part segmentations. A BiDiFuse module enables bidirectional information exchange between the two branches, ensuring coherence between appearance and structure. In addition, a contrastive consistency loss enforces that the same part remains stable across different views and time steps.

Results

Figure 3 – Limitations of prior methods: Comparison showing challenges with existing segmentation and rigging approaches

Figure 3 – Limitations of prior methods
Conventional approaches struggle with kinematic reasoning. Appearance-driven 2D segmentation models, such as SAM2, fail to capture motion-consistent parts. State-of-the-art 3D rigging methods often rely on category-specific priors and thus generalize poorly to novel shapes. Meanwhile, existing 3D segmentation methods focus on semantic regions (like "head" or "leg") rather than physically meaningful articulated parts, which limits their usefulness for animation.

Figure 4 – Multi-view kinematic part results: Consistent part decompositions across synthetic and real-world videos

Figure 4 – Multi-view kinematic part results
Qualitative results show that SP4D is able to produce consistent kinematic part decompositions across both synthetic and real-world videos. The generated parts remain stable across multiple views and frames, even for diverse object categories and complex motions. Input frames are highlighted in purple boxes.

Figure 5 – Comparison with SAM2: SP4D produces more structured decompositions aligned with object articulation

Figure 5 – Comparison with SAM2
Compared to SAM2, SP4D produces decompositions that are more structured and better aligned with object articulation. The parts are not only clearer and more consistent across views, but also directly useful for downstream rigging and animation tasks.

Video Demonstrations

Cross-frame generation

In a fixed camera view, SP4D maintains temporal coherence of the predicted parts throughout the motion. The generated segmentations remain stable even as the object undergoes large deformations, leading to more physically meaningful and reliable decompositions across time.

Cross-view generation

SP4D can take a single input video and expand it into multiple viewpoints, generating both RGB sequences and corresponding part segmentations. This ensures that the articulated structure of the object is preserved consistently, no matter the viewpoint, which is critical for 3D reconstruction and animation pipelines.

Lifting 2D part maps to 3D

SP4D enables the conversion of single images into fully riggable 3D assets. Starting from a single input image, we generate multi-view RGB sequences and part segmentations, then apply Hunyuan 3D 2.0 to reconstruct 3D geometry. Through vertex clustering with HDBSCAN, each mesh vertex receives discrete part IDs, creating complete 3D assets with geometry, part decomposition, and skinning weights ready for animation.

Rigging and animation

The kinematic part maps generated by SP4D can be lifted into 3D meshes and used to estimate skinning weights without requiring explicit skeleton annotations. This enables automatic rigging of the object, turning the generated assets into fully animatable 3D models. As a result, SP4D bridges the gap between raw visual generation and practical animation-ready content.

BibTeX

@misc{zhang2025stablediffusion4dmultiview,
      title={Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation}, 
      author={Hao Zhang and Chun-Han Yao and Simon Donné and Narendra Ahuja and Varun Jampani},
      year={2025},
      eprint={2509.10687},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.10687},
}