4D VISION: REPRESENT, RECONSTRUCT AND GENERATE THE DYNAMIC 3D WORLD
Degree type
Graduate group
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
This thesis focuses on 4D Vision, which studies the representation, reconstruction, and generation of dynamic 3D geometric content. While the conventional vision community has primarily focused on 2D or static 3D problems, modeling the dynamic 3D world is essential for applications such as embodied robotics, XR/AR devices, and next-generation AI systems operating in time-varying environments. The thesis addresses three fundamental questions: representation (how to store dynamic geometric data efficiently), reconstruction (how to recover 4D content from sparse observations), and generation (how to synthesize new and future 4D content). A central theme is that dynamic 3D problems are inherently high-dimensional and ill-posed, necessitating geometric inductive biases and structural priors in modern data-driven systems. The chapters are organized by increasing non-rigidity and degrees of freedom. We begin with multi-body systems, where scenes decompose into rigid parts. EFEM introduces an EM algorithm using SE(3)-equivariant priors to segment objects without scene-level supervision. NAP proposes the first deep generative model for articulated objects, leveraging graph diffusion networks to jointly generate kinematic structures and part geometries. Moving to more complex deformations, GART models humans and animals via Gaussian splatting with template skeletons, introducing “latent bones” to capture local non-rigid motion. DynMF generalizes skeleton-based motion to template-free scenarios, using motion basis decomposition to reconstruct and generate general non-rigid scenes from multi-view videos.To relax multi-view supervision, MoSca presents the first full-stack system for reconstructing dynamic scenes from casual monocular videos. It leverages 2D vision foundation models and introduces Motion Scaffolds, a physics-inspired representation enabling robust monocular reconstruction without calibration. Finally, CaDeX targets implicit volumetric representations via bijective canonical mappings that guarantee cycle consistency and topology preservation. CaDeX++ improves efficiency with local feature grids and foundation models, significantly speeding up dense tracking. This thesis demonstrates that geometric inductive biases such as piecewise rigidity, skeletons, low-rank motion, and invertible fields enable effective solutions to fundamental 4D problems, laying foundations for future advances in dynamic scene understanding and embodied AI.