MIMO: Controllable Character Video Synthesis with
Spatial Decomposed Modeling

Yifang Men, Yuan Yao, Miaomiao Cui, Liefeng Bo

Institute for Intelligent Computing, Alibaba Group   

MIMO, a generalizable model for controllable video synthesis, can Mimic anyone anywhere in complex Motions with Object interactions
Given a reference image, MIMO can synthesize animatable avatars via few-minute inference

Abstract


Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel generalizable model which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. This spatial decomposition strategy enables flexible user control, spatial motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate the proposed method’s effectiveness and robustness.

Core idea


Users are allowed to feed multiple inputs (e.g., a single image for character, a pose sequence for motion, and a single video/image for scene) to provide desired attributes respectively or a direct driving video as input. The proposed model can embed target attributes into the latent space to construct target codes and encode the driving video with spatial-aware decomposition as spatial codes, thus enabling intuitive attribute control of the synthesis by freely integrating latent codes in a specific order.

Method


An overview of the proposed framework. The video clip is decomposed to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on 3D depth. The human component is further disentangled for properties of identity and motion via canonical appearance transfer and structured body codes, and encoded to identity code $\mathcal{C}_{id}$ and motion code $\mathcal{C}_{mo}$. The scene and occlusion components are embedded with a shared VAE encoder and re-organized as a full scene code $\mathcal{C}_{so}$. These latent codes are inserted into a diffusion-based decoder as conditions for video reconstruction.



Results


Arbitrary Character Control

Animating human, cartoon or personified ones from a single image



Novel 3D Motion Control

Complex motions from in-the-wild videos

Spatial 3D motions from the database



Interactive Scene Control

Complicated real-world scenes with object interaction accompanied by occlusions



Comparisons


Compared with SOTA 2D methods




Compared with SOTA 3D methods




Demo Video






Citation


@article{men2024mimo,
  title={MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling},
  author={Men, Yifang and Yao, Yuan and Cui, Miaomiao and Liefeng, Bo},
  booktitle={arXiv preprint arXiv:2409.16160},
  year={2024}
}

This project is intended solely for academic research and effect demonstration. Thanks to Lior Yariv for the website template.