Language Priors for Visual Intelligence
Degree type
Graduate group
Discipline
Data Science
Engineering
Subject
Data Efficiency
Interpretability
Multimodal Model
Natural Language Processing
Robustness
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
The intersection of language and vision is a fundamental aspect of human cognition. Our ability to interpret visual stimuli is often guided by linguistic constructs, which enable us to describe, understand, and interact with the world around us. This intrinsic connection suggests that language can serve as a powerful prior for creating systems with advanced visual intelligence.While existing approaches have achieved broad success by scaling data and model sizes, their applicability to critical domains such as healthcare remains limited, which we hypothesize is due to a lack of appropriate priors. Without effective priors, models may overly rely on in-domain patterns, risking catastrophic failures when deployed to unseen scenarios. Moreover, these models typically require large-scale, high-quality human annotations, which can be costly and time-consuming to obtain in domains like robotics. To overcome these challenges, this thesis explores integrating language priors into both the architectural design and data synthesis of vision systems, aiming to enhance their interpretability, robustness, and data efficiency. With the vast progress of large language models (LLMs), it has become more feasible to obtain language priors to aid in developing vision models. First, this thesis demonstrates that language priors from LLMs can construct inherently interpretable image classifiers that achieve competitive performance compared to their black-box counterparts. Then, in medical imaging, incorporating interpretable structures enhanced by knowledge priors from medical documents significantly improves robustness to domain shifts, such as changes in patient populations or imaging protocols across hospitals. To address data scarcity in Embodied AI and Vision-Language Models (VLMs), the thesis further illustrates how language priors can be harnessed to synthesize visual training data. Specifically, this thesis proposes using LLMs to automate the generation of diverse 3D environments, allowing embodied agents to navigate across a wide range of scenarios. Moreover, to improve VLMs in interpreting text-rich images, such as charts and documents, this thesis leverages the coding capabilities of text-only LLMs to create synthetic multimodal datasets for training more generalizable VLMs. In summary, by integrating language priors into model structures and data generation, this thesis contributes to developing more trustworthy, robust, and generalizable visual intelligence systems, enabling their broader and reliable deployment across critical real-world applications.