Optimal Phylogenetic Reconstruction

Loading...
Thumbnail Image
Penn collection
Statistics Papers
Degree type
Discipline
Subject
optimal phylogenetic reconstruction
mutation probability
second author
markov chain
phylogenetic tree
underlying biology
special case
statistical physic
phase transition
reconstruction problem
evolutionary tree
genetic sequence
molecular data
cfn evolutionary model
evolutionary model
clear mathematical formulation
true evolutionary tree
major task
evolutionary biology
critical importance
free measure
Statistics and Probability
Theory and Algorithms
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Daskalakis, Constantinos
Mossel, Elchanan
Roch, Sébastien
Contributor
Abstract

One of the major tasks of evolutionary biology is the reconstruction of phylogenetic trees from molecular data. This problem is of critical importance in almost all areas of biology and has a very clear mathematical formulation. The evolutionary model is given by a Markov chain on the true evolutionary tree. Given samples from this Markov chain at the leaves of the tree, the goal is to reconstruct the evolutionary tree. It is crucial to minimize the number of samples, i.e., the length of genetic sequences, as it is constrained by the underlying biology, the price of sequencing etc. It is well known that in order to reconstruct a tree on n leaves, sequences of length Ω(log n) are needed. It was conjectured by M. Steel that for the CFN evolutionary model, if the mutation probability on all edges of the tree is less than p∗ = ( √ 2 −1)/23/2 than the tree can be recovered from sequences of length O(log n). This was proven by the second author in the special case where the tree is “balanced”. The second author also proved that if all edges have mutation probability larger than p∗ then the length needed is nΩ(1). This “phase-transition ” in the number of samples needed is closely related to the phase transition for the reconstruction problem (or extremality of free measure) studied extensively in statistical physics and probability. Here we complete the proof of Steel’s conjecture and give a reconstruction algorithm using optimal (up to a multiplicative constant) sequence length. Our results further extend to obtain optimal reconstruction algorithm for the Jukes-Cantor model with short edges. All reconstruction algorithms run in time polynomial in the sequence length. The algorithm and the proofs are based on a novel combination of combinatorial, metric and probabilistic arguments.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2006-01-01
Journal title
Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation
Collection