Researching a Two-Stage Machine Learning Model Pipeline to Predict Bacteriophage Interactions

Vargheese, Shishir; Lee, Hongzhe

Researching a Two-Stage Machine Learning Model Pipeline to Predict Bacteriophage Interactions

Files

Penn collection

Interdisciplinary Centers, Units and Projects::Center for Undergraduate Research and Fellowships (CURF)::Fall Research Expo

Discipline

Immunology and Infectious Disease
Bioinformatics
Microbiology
Computer Sciences

Subject

Machine Learning, ML, Bacteriophage Interactions, Modeling, Bioinformatics, Virus, Host, Phage

License

https://creativecommons.org/licenses/by/4.0/

Copyright date

2025

Related resources

https://doi.org/10.1038/s41564-024-01832-5
http://dx.doi.org/10.1214/11-STS368

Permalink

https://repository.upenn.edu/handle/20.500.14332/62155

View all metadata

Author

Vargheese, Shishir

Lee, Hongzhe

Abstract

This paper aims to examine the effectiveness of a two-stage machine-learning pipeline for the prediction of interactions between bacteria and phages used in bacteriophage cocktails. Phage therapy depends on knowing which bacteriophages kill which bacterial strains, and how strongly they act. Wet-lab testing scales poorly across thousands of phage–host pairs, so we need accurate, fast in-silico screening. We study an interaction matrix of Escherichia coli strains (rows) and phages (columns) with entries 0–4 representing Minimum Lytic Concentration (MLC; 0 = no kill, 1–4 = increasing kill). We enrich each pair with interpretable biology: bacterial serotypes (O/H), LPS type, ST-Warwick, capsule and ABC serotype, an 8-D UMAP embedding of phylogeny, and phage morphotype/genus/species plus “same-as-host” match flags (e.g., same O-type as the phage’s original host). To cope with label noise and crossed structure (every strain and every phage has its own baseline), we adopt a two-stage pipeline: (1) a binary gate that predicts “any kill?” (0 vs >0), and (2) a rating model (Netflix-style matrix factorization) that predicts intensity 1–4 within the predicted positives. This design keeps Stage-1 biologically interpretable and Stage-2 collaborative, improving end-to-end screening while remaining practical for fold-in of new strains. Through this research, we aim to find a more optimal approach towards testing and creating bacteriophage cocktail therapies for evolving infectious strains. The data and code used can be found at https://github.com/skvargh/coli_phage_mixed_model

Date of presentation

2025-09-15

Conference dates

2025-09-15

Comments

This project was supported with funding from the Penn Undergraduate Research Mentoring (PURM) program.

Collection

Presentations