Towards Ecologically Valid Evaluations of Language Models
Degree type
Graduate group
Discipline
Subject
natural language processing
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
In recent years, language models have seen widespread adoption across diverse fields and been applied to novel, unprecedented use cases. This calls for a critical need to systematically evaluate how well these models perform in practical, user-centered applications. To ensure these evaluations are meaningful, it is important that they possess high ecological validity (Brunswik, 1955), such that experimental results generalize to real-world contexts. However, prior evaluations do not always simulate these use cases appropriately. Specifically, evaluation tasks may lack relevance to practical applications, and evaluation protocols often fail to capture the dynamic, context-dependent nature of user interactions. These limitations risk mischaracterizing the true capabilities and weaknesses of language models. At the same time, designing ecologically valid evaluations is challenging due to the inherent complexity of real-world problems, which involve implicit contexts and nuanced user requirements. In this thesis, I address these challenges by developing benchmarks and evaluation protocols tailored to realistic use cases. These benchmarks cover a variety of tasks, including information retrieval, question answering, and long-form writing, spanning multiple domains and contemporary use cases. I also propose evaluation protocols to account for the context-dependent nature of user interactions. My work introduces a better framework to evaluate language models in real-world contexts, with a focus on improving alignment with diverse user needs.
Advisor
Roth, Dan