Learning to Reconstruct 3D Humans
Humans are typically the central element in the majority of the visual content that we can access. Understanding their posture, the social cues they communicate, and their interactions with the world is critical towards enabling holistic scene understanding from images and videos. Recent advances in computer vision have led to very successful systems that are able to estimate the 2D pose of humans with impressive robustness. However, our interactions with the world are fundamentally 3D, so to be able to understand, explain and predict these interactions, it is crucial to reconstruct people in 3D. The goal of this PhD thesis is to describe our recent steps towards this goal of automatic 3D reconstruction of humans from visual data. The main direction that we explore with this thesis is increasing the detail with which our automatic approaches reconstruct the human body. The most common representation considers only the major body joints and uses a 3D keypoint to represent each one of them. This type of abstraction goes beyond simply detecting the joints on the 2D pixel space and can provide important information about the 3D pose of the body. To further enhance the detail of the reconstruction, we consider a statistical body model, SMPL, that can capture the 3D surface of the human body. The relevant approaches estimate the parameters of this model given a single image as input and return the surface of the full body in the output. Finally, to go beyond the body-only representations and achieve more expressive reconstructions, we propose to extend SMPL, to also include articulated hands and a deformable face. Along with this enriched model, SMPL-X, we also propose the first approach to reconstruct the 3D body, hands and face from a single image. While a lot of effort is dedicated at generating more detailed reconstructions of the human body, simultaneously, a crucial goal is to build automatic approaches that require as little annotated data as possible. With the overwhelming success of deep learning, many of the proposed data-driven approaches make very constraining assumptions about the availability of training data, requiring images with full 3D ground truth for training. To relax this requirements, we investigate a variety of alternative supervision signals that rely on weaker annotations. Across the proposed approaches, we have investigated the availability of: 3D keypoint annotations, multiple synchronized views, monocular video sequences, ordinal depth annotations, external 3D pose and/or shape data, 2D body silhouette annotations, and 2D keypoint annotations. These alternative forms of supervision are effective at reducing our reliance on ground truth 3D data, allowing us to eventually reconstruct body, hands and face in 3D, requiring only 2D keypoint annotations from images.
Pavlakos, Georgios, "Learning to Reconstruct 3D Humans" (2020). Dissertations available from ProQuest. AAI28031568.