Unmasking The Language Of Science Through Textual Analyses On Biomedical Preprints And Published Papers

Nicholson, David

Unmasking The Language Of Science Through Textual Analyses On Biomedical Preprints And Published Papers

Files

Nicholson_upenngdas_0175C_15370.pdf (3.36 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Genomics & Computational Biology

Subject

Natual Language Processing
Preprints
Pubmed Central
Semantic Change
Text Mining
Web Resources
Bioinformatics
Computer Sciences
Linguistics

Copyright date

2022-09-17T20:22:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/31796

View all metadata

Author

Nicholson, David

Abstract

Scientific communication is essential for science as it enables the field to grow. This task is often accomplished through a written form such as preprints and published papers. We can obtain a high-level understanding of science and how scientific trends adapt over time by analyzing these resources. This thesis focuses on conducting multiple analyses using biomedical preprints and published papers. In Chapter 2, we explore the language contained within preprints and examine how this language changes due to the peer-review process. We find that token differences between published papers and preprints are stylistically based, suggesting that peer-review results in modest textual changes. We also discovered that preprints are eventually published and adopted quickly within the life science community. Chapter 3 investigates how biomedical terms and tokens change their meaning and usage through time. We show that multiple machine learning models can correct for the latent variation contained within the biomedical text. Also, we provide the scientific community with a listing of over 43,000 potential change points. Tokens with notable changepoints such as “sars” and “cas9” appear within our listing, providing some validation for our approach. In Chapter 4, we use the weak supervision paradigm to examine the possibility of speeding up the labeling function generation process for multiple biomedical relationship types. We found that the language used to describe a biomedical relationship is often distinct, leading to a modest performance in terms of transferability. An exception to this trend is Compound-binds-Gene and Gene-interacts-Gene relationship types.

Advisor

Casey S. Greene

Date of degree

2022-01-01

Collection

Dissertations and Theses