Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Genomics & Computational Biology

First Advisor

Casey S. Greene


Scientific communication is essential for science as it enables the field to grow. This task is often accomplished through a written form such as preprints and published papers. We can obtain a high-level understanding of science and how scientific trends adapt over time by analyzing these resources. This thesis focuses on conducting multiple analyses using biomedical preprints and published papers. In Chapter 2, we explore the language contained within preprints and examine how this language changes due to the peer-review process. We find that token differences between published papers and preprints are stylistically based, suggesting that peer-review results in modest textual changes. We also discovered that preprints are eventually published and adopted quickly within the life science community. Chapter 3 investigates how biomedical terms and tokens change their meaning and usage through time. We show that multiple machine learning models can correct for the latent variation contained within the biomedical text. Also, we provide the scientific community with a listing of over 43,000 potential change points. Tokens with notable changepoints such as “sars” and “cas9” appear within our listing, providing some validation for our approach. In Chapter 4, we use the weak supervision paradigm to examine the possibility of speeding up the labeling function generation process for multiple biomedical relationship types. We found that the language used to describe a biomedical relationship is often distinct, leading to a modest performance in terms of transferability. An exception to this trend is Compound-binds-Gene and Gene-interacts-Gene relationship types.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."