Bayesian Gene-tree Reconstruction and Learning in Phylogenomics

Matthew Rasmussen (February 25, 2010)

Please install the Flash Plugin

Abstract

Advances in genome sequencing have enabled the study of evolution across both genomes and large clades of species, and has been especially useful for studying gene families as they expand and contract over evolutionary time by gene duplication and loss events. Here, we present a new approach for the reconstruction of gene-tree phylogenies that models simultaneously gene and species evolution, which enables significantly increased reconstruction accuracy by incorporating genome-wide information. We introduce SPIDIR v2, a Bayesian method for gene tree reconstruction that incorporates within a unified framework models for gene duplication and loss, gene- and species-specific rate variation, and sequence substitution. The method includes an Expectation-Maximization method for inferring the distribution of species-specific and gene-specific rates across unambiguous orthologs, which we then use as priors in the reconstruction of any family. We have also developed a novel fast search algorithm that uses the Birth-Death process to more efficiently explore tree space leading to faster and more accurate recovery of a Maximum a posteriori (MAP) gene tree. We have applied our method in two clades of fully-sequenced genomes consisting of 12 Drosophila and 16 fungal species as well as simulated data. We use these extensive benchmarks to study the overall and branch-level accuracy for SPIDIR and many other phylogenetic methods. We find that SPIDIR achieves dramatic improvements in reconstruction accuracy over both traditional phylogenetic methods (Maximum Parsimony, Maximum Likelihood, and Bayesian methods) as well as other species-aware methods (SYNERGY, PRIME-GSR) in both real and simulated datasets.

Work done in collaboration with Manolis Kelli.