Multi-level Methods for Estimating Community Language from Social Media with User and Community Sociodemographics

Giorgi, Salvatore

Multi-level Methods for Estimating Community Language from Social Media with User and Community Sociodemographics

Files

Giorgi_upenngdas_0175C_16194.pdf (9.86 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Discipline

Computer Sciences

Copyright date

2023

Permalink

https://repository.upenn.edu/handle/20.500.14332/59527

View all metadata

Author

Giorgi, Salvatore

Abstract

Nowcasting based on social media text promises to provide unobtrusive near real-time predictions of community-level outcomes ranging from subjective well-being and physical health to personality and opioid use. Early methods for predicting outcomes from community-level language, e.g., Twitter, tended to (1) focus on keyword-driven analyses, where manually selected sets of words were examined for their ability to predict real-world outcomes (i.e., the community's use of the word "opioids" on Twitter to predict opioid poisoning mortality), and (2) lacked a person-centered focus, largely ignoring the fact that communities are groups of individuals who may share common attributes. Furthermore, the focus is typically on prediction, where complex models are built to predict some community attribute from language instead of directly focusing on building and validating better language estimates. In this thesis, I develop and evaluate methods to estimate the language of spatial units (e.g., U.S. counties) that contextualize people within their communities and leverage the multi-level, bi-directional relationships between people and their environments. Using corpora including billions of tweets from millions of geolocated Twitter users, I (1) construct community-level features from person-level linguistic features, (2) build tunable restratification methods to remove selection biases, (3) use deep hierarchical modeling to explore relationships between people and their environments, and (4) produce state-of-the-art accuracies across community-level prediction tasks in public health, geographic psychology, and substance use. These person-centered spatial language estimates are psychometrically valid, more representative of the socio-demographic makeup of their communities, generalizable across spatial units (e.g., prefectures in Japan and U.K. local authority districts), and robust to spatial dependencies. This thesis lays the foundation for using large public corpora for population-level tasks and open up the possibility of real-time public health monitoring.

Advisor

Ungar, Lyle, H

Date of degree

2023

Collection

Dissertations and Theses