Researching a Two-Stage Machine Learning Model Pipeline to Predict Bacteriophage Interactions

Loading...
Thumbnail Image
Penn collection
Interdisciplinary Centers, Units and Projects::Center for Undergraduate Research and Fellowships (CURF)::Fall Research Expo
Degree type
Discipline
Immunology and Infectious Disease
Bioinformatics
Microbiology
Computer Sciences
Subject
Machine Learning, ML, Bacteriophage Interactions, Modeling, Bioinformatics, Virus, Host, Phage
Funder
Grant number
Copyright date
2025
Distributor
Author
Vargheese, Shishir
Lee, Hongzhe
Contributor
Abstract

This paper aims to examine the effectiveness of a two-stage machine-learning pipeline for the prediction of interactions between bacteria and phages used in bacteriophage cocktails. Phage therapy depends on knowing which bacteriophages kill which bacterial strains, and how strongly they act. Wet-lab testing scales poorly across thousands of phage–host pairs, so we need accurate, fast in-silico screening. We study an interaction matrix of Escherichia coli strains (rows) and phages (columns) with entries 0–4 representing Minimum Lytic Concentration (MLC; 0 = no kill, 1–4 = increasing kill). We enrich each pair with interpretable biology: bacterial serotypes (O/H), LPS type, ST-Warwick, capsule and ABC serotype, an 8-D UMAP embedding of phylogeny, and phage morphotype/genus/species plus “same-as-host” match flags (e.g., same O-type as the phage’s original host). To cope with label noise and crossed structure (every strain and every phage has its own baseline), we adopt a two-stage pipeline: (1) a binary gate that predicts “any kill?” (0 vs >0), and (2) a rating model (Netflix-style matrix factorization) that predicts intensity 1–4 within the predicted positives. This design keeps Stage-1 biologically interpretable and Stage-2 collaborative, improving end-to-end screening while remaining practical for fold-in of new strains. Through this research, we aim to find a more optimal approach towards testing and creating bacteriophage cocktail therapies for evolving infectious strains. The data and code used can be found at https://github.com/skvargh/coli_phage_mixed_model

Advisor
Date of presentation
2025-09-15
Conference name
Conference dates
2025-09-15
Conference location
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
This project was supported with funding from the Penn Undergraduate Research Mentoring (PURM) program.
Recommended citation
Collection