Human-Centered AI in Computational Social Science: Evaluating Automated Annotation with Large Language Models

Pangakis, Nicholas, James

Human-Centered AI in Computational Social Science: Evaluating Automated Annotation with Large Language Models

Files

Pangakis_upenngdas_0175C_16834.pdf (1.26 MB)

Degree type

PhD

Graduate group

Political Science

Discipline

Political Science
Computer Sciences
Data Science

Subject

Artificial Intelligence
Automated Annotation
Computational Social Science
Large Language Models

Copyright date

01/01/2025

Permalink

https://repository.upenn.edu/handle/20.500.14332/61201

View all metadata

Author

Pangakis, Nicholas, James

Abstract

Computational social scientists are increasingly incorporating text as data into their research. A typical framework for working with large text data sets involves hiring human annotators to read a subset of the text samples and then building a statistical model to annotate the remainder of the text corpus. Due to their effectiveness at quantifying natural language, their ease of application, and their relatively low cost, artificial intelligence tools, like generative large language models (LLMs), may be used to automate these manual annotation procedures. This process, which I call ``automated annotation," can dramatically improve research designs that involve text as data. For example, I demonstrate that automated annotation procedures can cost 11.6% that of standard annotation approaches and take 18.8% the time. Although automated annotation has remarkable potential in social science, there are serious concerns about misuse and uncritical application. If practitioners use automated annotation without validation, for instance, they risk unknown bias and other inaccuracies in downstream applications. Thus, my dissertation aims to test strategies to develop effective and responsible automated annotation procedures. Specifically, I argue for a human-centered automated annotation framework, which places a central role for human annotations at each stage of the workflow. Across three studies, I develop and implement various automated annotation techniques that all remain grounded in human reasoning. My empirical investigations cover a wide range of topics—from testing automated annotation strategies with generative LLMs to developing a multi-stage, human-in-the-loop annotation pipeline. As a whole, my findings underscore the potential of leveraging AI tools to enhance text-as-data methodologies and to help researchers explore important substantive questions. With proper validation techniques, generative LLMs can approximate human reasoning at a rapid pace and low cost.

Advisor

Hopkins, Daniel, J

Date of degree

2025

Collection

Dissertations and Theses