A Dynamical Systems Perspective on Optimization Algorithms
Degree type
Graduate group
Discipline
Computer Sciences
Engineering
Subject
Control
Discretization
Dynamical Systems
Lyapunov
Optimization
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
The intersection of machine learning (ML) — including deep learning and reinforcement learning — and dynamical systems and control (S&C) has become a prominent area of research in recent years, as more problems naturally blur the lines between these fields. However, the majority of research at this intersection has focused on applying ML techniques to S&C problems, whereas this dissertation explores the reverse direction: utilizing tools from S&C in ML and optimization algorithms. \textbf{Part I (Ch 2 and 3): Continuous-Time and Discrete-Time Optimization} We investigate the relationship between continuous-time dynamical systems and traditional iterative optimization algorithms. By reinterpreting iterative methods like gradient descent as continuous-time systems described by differential equations or inclusions, we gain insights into their qualitative and quantitative behaviors, with convergence rates often matching in both time domains. However, the fundamental limits of this interplay are not well understood. To address this, we show that a simple modification (rescaling) of the gradient flow can provably achieve finite-time convergence — a property fundamentally incompatible with iterative algorithms. Notably, this is possibly in continuous time only as this domain allows instantaneous corrections in velocity without risk of overshooting. When discretizing, we instead risk overshooting if step sizes are too large, as is directly dictated by the geometric landscape of the function. Furthermore, we demonstrate that under certain regularity conditions, any high-order off-the-shelf ordinary differential equation (ODE) solver can approximate the convergence rate of any optimization flow. Notably, certain ODE solvers — most prominently the Forward Euler —preserve exponential stability and thus linear convergence. \textbf{Part II (Ch 4): Generalizing Stability and Smoothness} We re-examine the connections between convexity and smoothness in convergence rates, motivated by the abrupt transition between the sublinear rate $\mathcal{O}(L/k)$ and the linear rate $\mathcal{O}(e^{-\frac{\mu}{L}k})$ in gradient descent for the class $\mathcal{F}L$ of $L$-smooth convex functions and the subclass $\mathcal{F}{\mu,L}$ of $\mu$-strongly convex functions. Using and expanding tools from Part I, we show that this abrupt transition is actually smooth if we quantify convexity in terms of the Bregman divergence, namely $D_f(x,y) \geq \frac{\mu}{p}|x - y|^p $, and consider the classes $\mathcal{F}L$ and $\mathcal{F}{\mu,L}$ as the edge cases $p\to\infty$ and $p=2$. More generally, introducing the \emph{modulus of convexity} ($\phi$) and the \emph{modulus of smoothness} ($\sigma$), characterized by $\phi(|x-y|) \leq D_f(x,y) \leq \sigma(|x-y|)$, we derive convergence rates in terms of these functions for a proposed algorithm called $\sigma$-rescaled gradient descent ($\sigma$-\texttt{RGD}), motivated by the earlier finite-time convergence findings. In particular, we show that linear convergence with a rate $\mathcal{O}\left(\left(1 - \frac{1}{\kappa}\right)^k\right)$ is achieved, provided that the generalized condition number $\displaystyle\kappa = \sup_{s>0}\frac{\phi^(s)}{\sigma^(s)}$ is finite. \textbf{Part III (Ch 5 and 6): Applications to Machine Learning} We use tools from S&C for two ML problems. The first one is to re-analyze the convergence of the Expectation-Maximization (EM) algorithm, now using Lyapunov stability theory. The second is to use standard tools from linear systems theory to a provide preliminary bias-variance tradeoff bound for a novel method we propose to reduce the variance of the \texttt{ConfTr} (conformal training) method proposed by~\cite{stutz2022learning}. This method consists of simulating a quantile-threshold calibration procedure during training in order to learn conformal predictors that are more length efficient. We identify a source of sample inefficiency, propose a fix for it, and use in key places tools from S&C to aid in the theoretical analysis. Further, the insights gained from Part II are used to guide the training regime, particularly as some of the hyperparameters directly influence (in a predictable way) the smoothness of the cost function in the problem studied.