Researching a Two-Stage Machine Learning Model Pipeline to Predict Bacteriophage Interactions
Penn collection
Degree type
Discipline
Bioinformatics
Microbiology
Computer Sciences
Subject
Funder
Grant number
Copyright date
Distributor
Author
Contributor
Abstract
This paper aims to examine the effectiveness of a two-stage machine-learning pipeline for the prediction of interactions between bacteria and phages used in bacteriophage cocktails. Phage therapy depends on knowing which bacteriophages kill which bacterial strains, and how strongly they act. Wet-lab testing scales poorly across thousands of phage–host pairs, so we need accurate, fast in-silico screening. We study an interaction matrix of Escherichia coli strains (rows) and phages (columns) with entries 0–4 representing Minimum Lytic Concentration (MLC; 0 = no kill, 1–4 = increasing kill). We enrich each pair with interpretable biology: bacterial serotypes (O/H), LPS type, ST-Warwick, capsule and ABC serotype, an 8-D UMAP embedding of phylogeny, and phage morphotype/genus/species plus “same-as-host” match flags (e.g., same O-type as the phage’s original host). To cope with label noise and crossed structure (every strain and every phage has its own baseline), we adopt a two-stage pipeline: (1) a binary gate that predicts “any kill?” (0 vs >0), and (2) a rating model (Netflix-style matrix factorization) that predicts intensity 1–4 within the predicted positives. This design keeps Stage-1 biologically interpretable and Stage-2 collaborative, improving end-to-end screening while remaining practical for fold-in of new strains. Through this research, we aim to find a more optimal approach towards testing and creating bacteriophage cocktail therapies for evolving infectious strains. The data and code used can be found at https://github.com/skvargh/coli_phage_mixed_model