Embodied Visual Perception Models For Human Behavior Understanding

Bertasius, Gediminas

Embodied Visual Perception Models For Human Behavior Understanding

Files

Bertasius_upenngdas_0175C_13629.pdf (58.69 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Subject

Computer Vision
Deep Learning
Edge Detection
Egocentric Vision
Machine Learning
Semantic Segmentation
Artificial Intelligence and Robotics
Computer Sciences

Copyright date

2019-08-27T20:19:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/30358

View all metadata

Author

Bertasius, Gediminas

Abstract

Many modern applications require extracting the core attributes of human behavior such as a person's attention, intent, or skill level from the visual data. There are two main challenges related to this problem. First, we need models that can represent visual data in terms of object-level cues. Second, we need models that can infer the core behavioral attributes from the visual data. We refer to these two challenges as learning to see'', and seeing to learn'' respectively. In this PhD thesis, we have made progress towards addressing both challenges. We tackle the problem of learning to see'' by developing methods that extract object-level information directly from raw visual data. This includes, two top-down contour detectors, DeepEdge and HfL, which can be used to aid high-level vision tasks such as object detection. Furthermore, we also present two semantic object segmentation methods, Boundary Neural Fields (BNFs), and Convolutional Random Walk Networks (RWNs), which integrate low-level affinity cues into an object segmentation process. We then shift our focus to video-level understanding, and present a Spatiotemporal Sampling Network (STSN), which can be used for video object detection, and discriminative motion feature learning. Afterwards, we transition into the second subproblem of seeing to learn'', for which we leverage first-person GoPro cameras that record what people see during a particular activity. We aim to infer the core behavior attributes such as a person's attention, intention, and his skill level from such first-person data. To do so, we first propose a concept of action-objects--the objects that capture person's conscious visual (watching a TV) or tactile (taking a cup) interactions. We then introduce two models, EgoNet and Visual-Spatial Network (VSN), which detect action-objects in supervised and unsupervised settings respectively. Afterwards, we focus on a behavior understanding task in a complex basketball activity. We present a method for evaluating players' skill level from their first-person basketball videos, and also a model that predicts a player's future motion trajectory from a single first-person image.

Advisor

Jianbo Shi

Date of degree

2019-01-01

Collection

Dissertations and Theses