Equivariant Learning for 3D Vision and Pattern Recognition
Degree type
Graduate group
Discipline
Data Science
Subject
Equivariance
Machine Learning
Pattern Recognition
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Equivariance is an essential property in computer vision and pattern recognition, as it preserves the transformation structure of the input. Since symmetry is ubiquitous in real-world data, equivariance serves as an effective inductive bias in neural networks. It removes redundant intrinsic structure in the data, enables more efficient training, and improves model generalization. A classic example is the convolutional neural network (CNN), which achieves translational equivariance through its sliding-window design—ensuring that shifts in the input produce corresponding shifts in the output. In this dissertation, we embed equivariance into 3D vision and pattern recognition applications with respect to the relevant symmetry groups, where input includes geometric information. We advance the design of equivariant architectures along two key dimensions: generalizability and efficiency. For efficiency, we focus on combining equivariance with powerful conventional architectures and diverse data modalities, in a way that reduces model complexity without sacrificing inductive bias: (1) We introduce equivariant multi-view networks for 3D shape analysis by relaxing SO(3) equivariance to icosahedral group equivariance, integrating the representational power of 2D CNNs with finite group convolution. (2) We extend equivariance to physical observation input(IMU), design subequivariant inertial odometry, where SO(3) equivariance is reduced to SO(2) in the presence of gravity. Equivariance is achieved through a canonicalization mechanism that enables the use of off-the-shelf non-equivariant backbones. (3) We explore the equivariant multi-view prior and propose equivariant ray embeddings for implicit multi-view depth estimation, embedding equivariance into the Perceiver IO architecture to enable efficient transformer-based inference over ray space. On the generalizability side: (1) We propose SE(3)-equivariant convolution and transformers in ray space, generalizing the learning of equivariant multi-view priors in 3D vision to the broader setting of equivariant light field representations. (2) We develop a general Fourier-based formulation for both kernel and nonlinearity design in equivariant CNNs over homogeneous spaces, unifying their construction from the spectral domain.