Rethinking the Inductive Bias of Optimization for Learning Under Distribution Shift
Penn collection
Degree type
Discipline
Electrical Engineering
Computer Sciences
Subject
Funder
Grant number
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Deep learning architectures and optimizers have co-evolved with IID train/test paradigm. However, in deep generative modeling, sequential decision making, and language modeling, distribution shift is inevitable: at test time the model drives itself towards different inputs than those seen during training. We hypothesize that State-of-the-art architectures and optimizers over-fixate on loss minimization, aggressively descending along sharp curvature directions in the loss landscape, which often correspond to brittle feature learning. We present three case studies that elucidate the inductive bias of standard deep learning optimizers under shift, and propose layerwise preconditioning as a simple correction.