Multi-level Methods for Estimating Community Language from Social Media with User and Community Sociodemographics

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Computer and Information Science
Discipline
Computer Sciences
Subject
Funder
Grant number
License
Copyright date
2023
Distributor
Related resources
Author
Giorgi, Salvatore
Contributor
Abstract

Nowcasting based on social media text promises to provide unobtrusive near real-time predictions of community-level outcomes ranging from subjective well-being and physical health to personality and opioid use. Early methods for predicting outcomes from community-level language, e.g., Twitter, tended to (1) focus on keyword-driven analyses, where manually selected sets of words were examined for their ability to predict real-world outcomes (i.e., the community's use of the word "opioids" on Twitter to predict opioid poisoning mortality), and (2) lacked a person-centered focus, largely ignoring the fact that communities are groups of individuals who may share common attributes. Furthermore, the focus is typically on prediction, where complex models are built to predict some community attribute from language instead of directly focusing on building and validating better language estimates. In this thesis, I develop and evaluate methods to estimate the language of spatial units (e.g., U.S. counties) that contextualize people within their communities and leverage the multi-level, bi-directional relationships between people and their environments. Using corpora including billions of tweets from millions of geolocated Twitter users, I (1) construct community-level features from person-level linguistic features, (2) build tunable restratification methods to remove selection biases, (3) use deep hierarchical modeling to explore relationships between people and their environments, and (4) produce state-of-the-art accuracies across community-level prediction tasks in public health, geographic psychology, and substance use. These person-centered spatial language estimates are psychometrically valid, more representative of the socio-demographic makeup of their communities, generalizable across spatial units (e.g., prefectures in Japan and U.K. local authority districts), and robust to spatial dependencies. This thesis lays the foundation for using large public corpora for population-level tasks and open up the possibility of real-time public health monitoring.

Advisor
Ungar, Lyle, H
Date of degree
2023
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation