Equivariant Learning for 3D Vision and Pattern Recognition

Xu, Yinshuang

Equivariant Learning for 3D Vision and Pattern Recognition

Files

Xu_upenngdas_0175C_17128.pdf (28.49 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Discipline

Computer Sciences
Data Science

Subject

Computer Vison
Equivariance
Machine Learning
Pattern Recognition

Copyright date

2025

Permalink

https://repository.upenn.edu/handle/20.500.14332/61674

View all metadata

Author

Xu, Yinshuang

Abstract

Equivariance is an essential property in computer vision and pattern recognition, as it preserves the transformation structure of the input. Since symmetry is ubiquitous in real-world data, equivariance serves as an effective inductive bias in neural networks. It removes redundant intrinsic structure in the data, enables more efficient training, and improves model generalization. A classic example is the convolutional neural network (CNN), which achieves translational equivariance through its sliding-window design—ensuring that shifts in the input produce corresponding shifts in the output. In this dissertation, we embed equivariance into 3D vision and pattern recognition applications with respect to the relevant symmetry groups, where input includes geometric information. We advance the design of equivariant architectures along two key dimensions: generalizability and efficiency. For efficiency, we focus on combining equivariance with powerful conventional architectures and diverse data modalities, in a way that reduces model complexity without sacrificing inductive bias: (1) We introduce equivariant multi-view networks for 3D shape analysis by relaxing SO(3) equivariance to icosahedral group equivariance, integrating the representational power of 2D CNNs with finite group convolution. (2) We extend equivariance to physical observation input(IMU), design subequivariant inertial odometry, where SO(3) equivariance is reduced to SO(2) in the presence of gravity. Equivariance is achieved through a canonicalization mechanism that enables the use of off-the-shelf non-equivariant backbones. (3) We explore the equivariant multi-view prior and propose equivariant ray embeddings for implicit multi-view depth estimation, embedding equivariance into the Perceiver IO architecture to enable efficient transformer-based inference over ray space. On the generalizability side: (1) We propose SE(3)-equivariant convolution and transformers in ray space, generalizing the learning of equivariant multi-view priors in 3D vision to the broader setting of equivariant light field representations. (2) We develop a general Fourier-based formulation for both kernel and nonlinearity design in equivariant CNNs over homogeneous spaces, unifying their construction from the spectral domain.

Advisor

Daniilidis, Kostas

Date of degree

2025

Collection

Dissertations and Theses