Human-Centered AI in Computational Social Science: Evaluating Automated Annotation with Large Language Models
Degree type
Graduate group
Discipline
Computer Sciences
Data Science
Subject
Automated Annotation
Computational Social Science
Large Language Models
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Computational social scientists are increasingly incorporating text as data into their research. A typical framework for working with large text data sets involves hiring human annotators to read a subset of the text samples and then building a statistical model to annotate the remainder of the text corpus. Due to their effectiveness at quantifying natural language, their ease of application, and their relatively low cost, artificial intelligence tools, like generative large language models (LLMs), may be used to automate these manual annotation procedures. This process, which I call ``automated annotation," can dramatically improve research designs that involve text as data. For example, I demonstrate that automated annotation procedures can cost 11.6% that of standard annotation approaches and take 18.8% the time. Although automated annotation has remarkable potential in social science, there are serious concerns about misuse and uncritical application. If practitioners use automated annotation without validation, for instance, they risk unknown bias and other inaccuracies in downstream applications. Thus, my dissertation aims to test strategies to develop effective and responsible automated annotation procedures. Specifically, I argue for a human-centered automated annotation framework, which places a central role for human annotations at each stage of the workflow. Across three studies, I develop and implement various automated annotation techniques that all remain grounded in human reasoning. My empirical investigations cover a wide range of topics—from testing automated annotation strategies with generative LLMs to developing a multi-stage, human-in-the-loop annotation pipeline. As a whole, my findings underscore the potential of leveraging AI tools to enhance text-as-data methodologies and to help researchers explore important substantive questions. With proper validation techniques, generative LLMs can approximate human reasoning at a rapid pace and low cost.