FROM SIMULATIONS TO LANGUAGE MODELS: COMPUTATIONAL INNOVATIONS FOR CHEMICAL BIOLOGY AND DRUG DISCOVERY

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Chemistry
Discipline
Chemistry
Biochemistry, Biophysics, and Structural Biology
Subject
AI
Biophysics
Chemical biology
Machine learning
Proteins
Funder
Grant number
License
Copyright date
2023
Distributor
Related resources
Author
Giannakoulias, Sam
Contributor
Abstract

The exponential growth of computational power, coupled with rapid advancements in artificial intelligence and machine learning, has brought about profound transformations across various disciplines. This thesis addresses the challenges of applying these techniques in the domains of chemical biology and drug discovery, where scientific datasets present unique characteristics, distinct from the vast and standardized datasets prevalent in major technology companies. These datasets are characterized by their small size, lack of standardization, and heteroskedastic errors arising from compilation from diverse sources and various experiments. Given the inherent time and resource demand of scientific research, there is a pressing need for computational strategies that effectively leverage these datasets to expedite scientific discovery. This thesis contributes innovative computational methods for different datatypes, including tables, graphs, and sequences. Initially, we pioneered a simulation-based machine learning strategy, successfully applying it in three chemical biology projects to predict ΔΔG of mutations at protein-protein interfaces, proteolytic resistance of thioamide-containing peptides, and solubility of unnatural amino acid-containing proteins. Moving in a different but related direction, this thesis introduces a novel deep learning strategy called "hint token learning" for large language modeling. This approach effectively highlights relevant information from mutant protein sequences which can differ by as little as a single token from the wild-type sequence. Furthermore, we introduce a novel chemically informed graph downsampling technique for deep learning on chemical datasets, enabling enhanced analysis and prediction of protein-ligand binding interactions by capturing intricate relationships within molecular structures. Collectively, these advancements contribute to the development of computational strategies tailored to the unique characteristics of chemical biology and drug discovery datasets and ultimately aim to expedite experimental research in these areas.

Advisor
Petersson, Ernest, J
Date of degree
2023
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation