The Hidden Geometry of Learning
Degree type
Graduate group
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Deep learning has achieved remarkable success in recent years, yet it is still mysterious how it achieved this success. The optimization problem underlying deep learning is usually highly non-convex, and the search space consists of functions with high expressive power. However, simple local search algorithms like stochastic gradient descent can often successfully find solutions that generalize well to unseen data. This thesis explores the hypothesis that neural networks tend to explore a low-dimensional region of the function space when trained on typical data. This phenomenon may explain both the success of optimization and the generalizability. This perspective emphasizes the interplay between data and optimization dynamics and provides a unifying framework that can be used to study different architectures and tasks.
In Chapter 2, we develop information-geometric techniques to properly define and analyze the model manifold explored by training trajectories of deep networks. We construct a training manifold by training networks with different configurations from different initializations and characterize its geometry using the tools we developed. We reveal that the training process explores an effectively low-dimensional manifold. We associate this low-dimensional manifold with the hyper-ribbon structure found in statistical physics.
In Chapter 3, we extend our techniques to the broader case of multitask learning and allow comparison of networks initially trained on different tasks. We also show that the same low dimensionality can be observed in those scenarios. We also demonstrate how our methods could reveal structure in the space of multitask learning.
In Chapter 4, we connect the hyper-ribbon structure to sloppiness in data covariance structure and the Fisher Information Matrix of trained networks. We show that the input correlation matrix of typical classification datasets has a “sloppy” eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range. This structure is mirrored in the Hessian of a trained network, which allows us to compute non-vacuous generalization bounds via PAC-Bayes.
In Chapter 5, we provide theoretical insights into training manifolds by analyzing the gradient descent trajectories of linear models. We identify the key factors determining the dimensionality of the training manifold and characterize conditions under which they are provably low-dimensional. We show how our analysis can be extended to variants like stochastic gradient descent and kernel methods.
We conclude the thesis with a discussion on possible future directions in Chapter 6, focusing on what our observations could imply for practical training.