Date of Award
Doctor of Philosophy (PhD)
Computer and Information Science
Many modern applications require extracting the core attributes of human behavior such as a person's attention, intent, or skill level from the visual data. There are two main challenges related to this problem. First, we need models that can represent visual data in terms of object-level cues. Second, we need models that can infer the core behavioral attributes from the visual data. We refer to these two challenges as ``learning to see'', and ``seeing to learn'' respectively. In this PhD thesis, we have made progress towards addressing both challenges.
We tackle the problem of ``learning to see'' by developing methods that extract object-level information directly from raw visual data. This includes, two top-down contour detectors, DeepEdge and HfL, which can be used to aid high-level vision tasks such as object detection. Furthermore, we also present two semantic object segmentation methods, Boundary Neural Fields (BNFs), and Convolutional Random Walk Networks (RWNs), which integrate low-level affinity cues into an object segmentation process. We then shift our focus to video-level understanding, and present a Spatiotemporal Sampling Network (STSN), which can be used for video object detection, and discriminative motion feature learning.
Afterwards, we transition into the second subproblem of ``seeing to learn'', for which we leverage first-person GoPro cameras that record what people see during a particular activity. We aim to infer the core behavior attributes such as a person's attention, intention, and his skill level from such first-person data. To do so, we first propose a concept of action-objects--the objects that capture person's conscious visual (watching a TV) or tactile (taking a cup) interactions. We then introduce two models, EgoNet and Visual-Spatial Network (VSN), which detect action-objects in supervised and unsupervised settings respectively. Afterwards, we focus on a behavior understanding task in a complex basketball activity. We present a method for evaluating players' skill level from their first-person basketball videos, and also a model that predicts a player's future motion trajectory from a single first-person image.
Bertasius, Gediminas, "Embodied Visual Perception Models For Human Behavior Understanding" (2019). Publicly Accessible Penn Dissertations. 3344.