Current Topic Workshop: The Problems of Phylogenetic Analysis of Large Datasets

(December 1,2005 - December 2,2005 )

Organizers


Daniel Janies
Department of Biomedical Informatics, The Ohio State University
Diego Pol
Independent Researcher, Museo Paleontologico E. Feruglio
Ward Wheeler
Division of Invertebrate Zoology, American Museum of Natural History

In this workshop, we will focus on recent advances in phylogenetic analysis of large datasets considering two major aspects. The first is the problem of analysis of datasets with a large number of taxa (exceeding several hundreds). This includes recent improvements in tree search algorithms and synergistic application of parallelization strategies using Beowulf clusters and distributed computing. The workshop participants will also compare and discuss the use of improved search tools for competing methods (e.g., parsimony, maximum likelihood, Bayesian). This discussion will illustrate advantages and problems associated specific to various methods with the analysis of datasets with large number of taxa.

The second point of discussion will be the problems associated with analysis of large comparative genomic datasets with various types of information (homology, rearrangement, origin, loss, and multiplication among loci between ancestor and descendant genomes). This problem includes the generalization of edit cost models to account for of multiple kinds of genomic change rather than only DNA substitution.

In summary, this workshop aims to gather leading researchers in this field to discuss common and specific problems, applications, and new models for large-scale phylogenetic and genomic analysis.

Accepted Speakers

James Farris
Molecular Systematics Laboratory, Swedish Museum of Natural History
Walter Fitch
Ecology & Evolutionary Biology, University of California, Irvine
Gonzalo Giribet
Museum of Comparative Zoology, Harvard University
Pablo Goloboff
INSUE, CONICET, Instituto Miguel Lillo
Daniel Janies
Department of Biomedical Informatics, The Ohio State University
Bret Larget
Department of Statistics, University of Wisconsin
Bernard Moret
Department of Computer Science, University of New Mexico
Usman Roshan
Computer Science Department, NJIT-CCS
Alexandros Stamatakis
Institute of Computer Science, Foundation for Research & Technology - Hellas
Andres Varon
Invertebrate Zoology Division, American Museum of Natural History
Ward Wheeler
Division of Invertebrate Zoology, American Museum of Natural History
Thursday, December 1, 2005
Time Session
09:15 AM
10:00 AM
Walter Fitch - Inferring Migration from Phylogenies

I will be presenting material based on the phylogenetic analysis of 181 hemagglutinin sequences from human viruses of the B type. Then, accepting the resulting phylogeny as substantially correct, I construct for those viruses a single character whose character states are the geographic locations from which the viruses were isolated. One can then determine the most parsimonious accounting of the geographic vector on the previously determined tree. A common vector consisting of nucleotides would interpret a change from C to T as a mutation from Cytidine to Thymine. But for my vector it would mean a migration of a virus from California to Texas. One might hope that the resulting data would lend, over several concatenated migration paths a general path by which influenza viruses spread. Initial results suggest multiple available paths which, in turn, will require more sequences to tease out the migration routes from the noise.


Joint work with HoangMinh HoDac, Robert Wallace and Walter M. Fitch.

10:30 AM
11:15 AM
Diego Pol - Strategies for Parallelizing of Heuristic Tree Searches Using Parsimony

Parallel heuristic tree searches for phylogenetic analyses are explored for datasets consisting of 2,000 to 3,000 taxa using the maximum parsimony criterion. The efficiency of alternative strategies is assessed for several benchmark datasets. The searches were conducted in parallel in Beowulf clusters (using 10 to 60 processors) using several heuristic algorithms implemented in TNT. The searches were driven with the scripting language of TNT.


The identification of an efficient combination of heuristic algorithms and tuning of the algorithms parameters in the context of the parallel environment is critical to the overall efficiency of the parallel tree search. Stopping rules for the heuristic search are discussed, including the convergence to the best-known score of each dataset during independent trials and the stabilization of the strict consensus of most parsimonious trees.


The problems of large datasets with poor phylogenetic structure is discussed through the analysis of 2359 hemagglutinin genes of influenza type A that required the use of iterative constraints to achieve independent hits to minimum lengths in reasonable times.


Joint work with Pablo Goloboff and Daniel Janies.

11:45 AM
12:30 PM
Daniel Janies - Applications of Large-scale Phylogenetic Analysis for Research in Emerging Infectious Disease

Here we report on the application of large-scale phylogenetic analysis to track host shifts and measure surveillance quality among influenza A. Emerging infectious diseases and organisms present critical issues for public health and economic welfare and are thus important to monitor.


For example, historically antigenic typing of the hemagglutinin (HA) and neuraminidase (N) genes is the lens through which the World Health Organization tracks influenza A and recommends vaccines. In recent years, highly pathogenic H5N1 subtype strains of avian influenza have emerged in wild and domestic birds and humans and other mammals in Asia. Alarmingly, in the past few months H5N1 has spread through avian populations in Eastern Europe and by some accounts threatens to become pandemic among humans. Related viruses of H5N2 subtype also circulate among birds in the Americas but these so far have only affected birds.


Recently, as demonstrated by the coordinated international response to severe acute respiratory syndrome and avian influenza, emerging infectious diseases are now being addressed via the collection of nucleotide sequence data. However, our ability to derive information from large sequence datasets lags far behind their acquisition. Our work is directed at addressing this gap between data collection and analysis.


To perform a longitudinal study of influenza A, we use the best data currently in the public domain. We have performed phylogenetic and character optimization analysis of 2359 hemagglutinin nucleotide sequences from isolates of influenza A using heuristic searches in parallel TNT. These sequences were derived from viral isolates that range temporally from 1902 to present, occur in human, avian, and several mammal hosts, and represent 16 antigenic subtypes.


Direct human infection by avian strains of influenza A is considered rare. It is hypothesized swine act as an intermediate host. Our phylogenetic analysis of reveals multiple independent events of avian to human transmission without intermediate hosts.


We also use this large comprehensive analysis to assess the quality of surveillance of influenza A. Compared to a null hypothesis of no correspondence among date of isolation of viruses and the temporal pattern implied by the phylogenetic hierarchy of viruses, our ability to reconstruct the spread of H5 viruses is better than expected. However, surveillance quality of H5 viruses in Asia is much better than in the Americas and this has been true since 1975.


Joint work with Pablo Goloboff and Diego Pol.

02:00 PM
02:45 PM
Usman Roshan - Co-evolution of DCMs and Base-methods for Phylogeny Reconstruction

The Disk-Covering Method (DCM) is a divide-and-conquer booster technique for improving upon a given base method. It computes a decomposition on the input set of species into smaller overlapping subproblems, applies an external base method to compute subtrees, merges them into a supertree, and further refines the supertree with a local search method if necessary. DCMs have co-evolved with different base methods and optimization criteria since they were first introduced.


In this talk I will discuss the general framework of DCMs and discuss in detail the dataset decomposition aspect. I will present the successful application of DCMs for large-scale phylogeny reconstruction using (1) the neighbor joining method (under distance-based reconstruction), (2) GRAPPA (under gene-order data), (3) PAUP* and TNT (under maximum parsimony), (4) RAxML (under maximum likelihood), and (5) Poy (under generalized tree alignment). I will highlight how DCM evolved with the different base-methods and optimization criteria accompanied by various experimental performance studies.

03:15 PM
04:00 PM
Pablo Goloboff - On Divide and Conquer Strategies for Parsimony Analysis of Large Data Sets

A recent paper (Roshan et al., 2004) described a "divide-and-conquer" technique for analysis of large data sets, rec-i-dcm3, and stated that it compares very favorably to results using TNT (the fastest parsimony program; Goloboff et al., 2003). The technique is based on creating reduced data sets, finding most parsimonious trees for the reduced data set, and then merging all parts together. The software Roshan et al. developed submits the reduced data sets to TNT; all their software does is creating the reduced data sets (selecting disjoint sets of terminals) and then combining the results produced by TNT. Although not stated in the paper, the programs for rec-i-dcm3 actually do a complete round of TBR swapping to the complete data set (via TNT) every time the results from the reduced data sets are combined (optimality for a reduced data set rarely produces score improvements for the entire data set). The divide-and-conquer method in TNT (sectorial searches; Goloboff, 1999), avoids this by selecting non-disjoint sets of HTU's (which produces exact length calculations). In this talk, I will show that Roshan et al.'s claims that rec-i-dcm3 outperforms the techniques in TNT is poorly substantiated. First, the settings they used for the TNT runs were extremely poor; very simple settings for TNT would have done a much better job. Second, having TNT analyze larger subproblems, with more exhaustive algorithms (earlier versions only used multiple random addition sequences with TBR, or a few rounds of tree-drifting), produces much better results (which is merely a difference in implementation; the basic algorithms are the same). When this is done, the combined techniques used in TNT clearly outperform rec-i-dcm3. On the other side, rec-i-dcm3 depends on a global round of TBR after each cycle of subdivision, so that using as search engine any program other than TNT becomes unfeasible in the case of very large data sets (e.g. for Roshan et al.'s largest data set, the TBR swapper in PAUP* runs about 800 times slower than the one in TNT). The global TBR becomes more and more critical as the data set is divided into smaller subproblems (because then the combination of the results produces a more suboptimal tree), so that there is a clear limit to the number of taxa that can be reasonably analyzed with rec-i-dcm3. Additionally, since creating reduced data sets to improve the results produces invariably a worse tree (which is subsequently improved by global TBR), rec-i-dcm3, is not truly a divide-and-conquer strategy (as publicized), but instead a technique for cyclic perturbations and improvements.


References:



  1. Goloboff, P. 1999. Analyzing large data sets in reasonable times. Cladistics 15:415-428.

  2. Goloboff, P., J. Farris, K. Nixon. 2003. T.N.T.: Tree Analysis Using New Technology. Program and documentation, available at http://www.zmuc.dk/public/phylogeny/tnt.

  3. U. Roshan, B. M. E. Moret, T. L. Williams, T. Warnow, Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees, Proceedings of the IEEE Computational Systems Bioinformatics (CSB04) 2004, Stanford (CA), USA (2004). Software available at http://www.cs.njit.edu/usman/RecIDCM3.html.

04:30 PM
05:15 PM
James Farris - To Process Large Data Sets, Use a Fast Method...

To Process Large Data Sets, Use a Fast Method...

Friday, December 2, 2005
Time Session
09:00 AM
09:45 AM
Bernard Moret - Large-scale Phylogenetic Reconstruction, the Tree of Life, and CIPRES

In this talk, I will review current computational activities aimed at reconstructing the Tree of Life, the evolutionary history of all living organisms (an estimated 10-100 million). Researchers and funding agencies worldwide have put renewed emphasis on the establishment of evolutionary relationships among living species; such relationships are fundamental to research in medicine, drug design, agriculture, ecology, and many other areas.


The CIPRES (Cyber Infrastructure for Phylogenetic Research), which I direct, was founded to develop the informatics tools required to attempt a reconstruction of the Tree of Life. I will sketch the goal and current achievements of CIPRES, comment on future needs, and relate its work to that of other research efforts in phylogeny.


I will then focus on specific challenges that arise in reconstructing trees with 100,000 or more leaves, with particular emphasis on sources of error and on the methodological advances we will need to evaluate the quality of such reconstructions.

10:15 AM
11:00 AM
Andres Varon - Minimum Description Length Phylogenetic Analysis

With the growing size of molecular datasets, additional transformation events are required under certain, not clearly defined, conditions. Hybridizations, duplications, inversions, tandem repeats, horizontal gene transfers, can be taken into consideration, but their relevance for a particular dataset is difficult to assess. We propose the Minimum Description Length Principle (MDL) as a more fundamental optimality criterion for phylogenetic inference. We define a framework for the application in its crude and refined versions; for biological research purposes certain machine details of importance are not relevant in the original MDL, and so, we elaborate (and prefer) the crude version to perform full phylogenetic analysis. Finally both MP and ML under static and dynamic homologies are shown to be two particular cases of this more general framework.

11:30 AM
12:15 PM
Ward Wheeler - Kolmogorov Complexity, Links with Parsimony and Likelihood, and Tests of Methods and Monophyly

Kolmogorov Cmplexity and the Minimum Description Length Principle provide a foundation to compare and understand the relationship between parsimony and likelihood methods. The case of binary characters is presented showing that condition Kolmogorov Complexity (K) of a cladogram given the root, is equal to the parsimony score and that the Tuffley and Steel (1997) likelihood can be derived from this value via the Universal Distribution. Furthermore, the Farris (1973) "maximum evolutionary path" likelihood can be viewed as the composition of edge transformation functions. The role of root complexity is presented in light of tests of monophyly and multiple origins of taxa. An example using metazoan ribosomal data is presented.

01:45 PM
02:30 PM
Gonzalo Giribet - What is Large? Analyzing Sets of Unaligned Sequence Data

While people focusing on "large data sets" generally refers to hundreds or thousands of terminals, the problem of analyzing sets of unaligned data considerably reduces the number of terminals that can be analyzed in an efficient manner. The reason for that is the relationship between the number of possible alignments that exist for a given set of sequences and the number of possible trees for a given number of terminals. Therefore, tree search for unaligned data includes a nested series of NP-hard problems. In this talk I will explore issues about analyzing medium-size data sets when they are not pre-aligned and the strategies we are exploring for such data sets.

03:00 PM
03:45 PM
Alexandros Stamatakis - Computing Huge Trees with Maximum Likelihood: An HPC Perspective

The inference of phylogenies based on the Maximum Likelihood (ML) criterion has recently been demonstrated to be NP-hard. However, significant algorithmic advances have been achieved over the last years which currently allow for ML-based analyses of 1,000 organisms within less than 24 hours on a single CPU. To date, most state-of-the-art ML programs are limited by their memory consumption to tree sizes of approximately 3,000 to 5,000 taxa. Thus, a new category of performance problems arises which mainly concerns technical issues such as memory consumption/organization, cache efficiency, and optimization of the likelihood functions.


Within this context, the talk will -by example of RAxML (Randomized Axelerated ML)- initially focus on the rarely documented technical implementation details, which are of growing importance. RAxML has inherited and extended the technically extremely efficient implementation of fastDNAml. A novel (unpublished) and very simple technical optimization of RAxML, which yields run-time improvements of approximately factor 10 on huge alignments comprising 25,000 taxa, will also be presented. In addition, the exploitation of fine-grained loop-level parallelism on SMPs (Symmetric MultiProcessing) and GPUs (Graphics Processing Units) will be addressed. Finally, potential algorithmic and technical challenges as well as solutions for future large-scale inferences of 100,000 taxa will briefly be discussed.

04:15 PM
05:00 PM
Bret Larget - Bayesian MCMC Approaches for Large Gene-order Phylogenies

Methods for the analysis of genome arrangements for phylogenetic inference are complicated by the relative size of the space of possible arrangements. Unlike DNA, amino acid, or codon sequences where sites can be modeled as independent and there are 4, 20, or 61 possible states, genome arrangements must be considered as a single complicated character. In the case of animal mitochondrial genomes with 37 genes arranged on a circle, there are over 2*10^{52} arrangements. This massive state space requires extensive modifications to the algorithms for both likelihood and parsimony based analysis.


In this talk I will describe the Bayesian approach to phylogenetic inference from genome arrangements with several example data sets. I will make comparisons between the Bayesian approach and the parsimony approach for genome arrangement data.

Name Affiliation
Abdul-Salim, Kobinah abdul-salim.1@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
Ast, Jennifer jca@umich.edu Ecology & Evolutionary Biology, University of Michigan
Bader, David bader@cc.gatech.edu College of Computing, Georgia Institute of Technology
Barrett, Craig barrett.586@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
Best, Janet jbest@mbi.osu.edu Mathematics, The Ohio State University
Brandley, Matthew brandley@berkeley.edu Museum of Vertebrate Zoology, University of California, Berkeley
Brown, Joseph josephwb@umich.edu Bird Division, Ecology & Evolutionary Biology, University of Michigan
Bucheli, Sibyl Rae bucheli.1@osu.edu Entomology, The Ohio State University
Chakrabarty, Prosanta pchakrab@umich.edu Fish Division, Ecology & Evolutionary Biology, University of Michigan
Cintron-Arias, Ariel acintro@unity.ncsu.edu Cornell University, Cornell University
Coddington, Jonathan Coddington@si.edu Invertebrate Zoology, Smithsonian Institution
Deckelman, Steven deckelmans@uwstout.edu Mathematical Biosciences Institute, The Ohio State University
Djordjevic, Marko mdjordjevic@mbi.osu.edu Department of Biophysics, University of Belgrade
Enciso, German German_Enciso@hms.harvard.edu Mathematics Department, University of California, Irvine
Farrington, Heather farrinhl@email.uc.edu Department of Biological Sciences, University of Cincinnati
Farris, James msl-farris@nrm.se Molecular Systematics Laboratory, Swedish Museum of Natural History
Fitch, Walter wfitch@uci.edu Ecology & Evolutionary Biology, University of California, Irvine
Galovich, Jennifer jgalovich@csbsju.edu Mathematics and Statistics, St. John's University
Giribet, Gonzalo gonzalo.giribet@gmail.com Museum of Comparative Zoology, Harvard University
Goel, Pranay goelpra@helix.nih.gov NIDDK, Indian Institute of Science Education and Research
Goloboff, Pablo pablogolo@csnat.unt.edu.ar INSUE, CONICET, Instituto Miguel Lillo
Grajdeanu, Paula pgrajdeanu@mbi.osu.edu Mathematics, Shenandoah University
Guralnick, Robert Robert.Guralnick@colorado.edu EEB and CU Museum, University of Colorado
Howard, Laura laurah@umich.edu Ecology & Evolutionary Biology, University of Michigan
Hsing, Tailen hsing@stat.ohio-state.edu Department of Statistics, The Ohio State University
Janies, Daniel Daniel.Janies@osumc.edu Department of Biomedical Informatics, The Ohio State University
Jin, Ruoming jinr@cse.ohio-state.edu Department of Computer Science, Kent State University
Just, Winfried just@math.ohio.edu Department of Mathematics, Ohio University
Krosnick, Shawn Evolution, Ecology and Organismal Biology, The Ohio State University
Larget, Bret larget@stat.wisc.edu, Department of Statistics, University of Wisconsin
Lekveishvili, Mariam lekveishvili.1@osu.edu Entomology, The Ohio State University
Lim, Sookkyung limsk@math.uc.edu Department of Mathematical Sciences, University of Cincinnati
Lin, Shili lin.328@osu.edu Department of Statistics, The Ohio State University
Lindgren, Annie lindgren.11@osu.edu Evolution, Ecology and Organismal Biology, The Ohio State University
Lipscomb, Diana biodl@gwu.edu Department of Biological Sciences, George Washington University
Lou, Yuan lou@math.ohio-state.edu Department of Mathematics, The Ohio State University
Martin, Floyd martin.687@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
McClellan, David david_mcclellan@byu.edu Department of Integrative Biology, Brigham Young University
Miller, Mark mmiller@sdsc.edu Department of Biology, University of California, San Diego
Mishler, Brent bmishler@calmail.berkeley.edu Department of Integrative Biology, University of California, Berkeley
Moore, Michael mjmoore1@ufl.edu Florida Museum of Natural History, University of Florida
Moret, Bernard moret@cs.unm.edu Department of Computer Science, University of New Mexico
Nevai, Andrew anevai@mbi.osu.edu Department of Mathematics, University of Central Florida
Nixon, Kevin kcn2@cornell.edu Plant Biology, Cornell University
Perkins, Susan perkins@amnh.org Division of Invertebrate Zoology, American Museum of Natural History
Piontkivska, Helen opiontki@kent.edu Biological Sciences, Kent State University
Pol, Diego dpol@mbi.osu.edu Independent Researcher, Museo Paleontologico E. Feruglio
Potter, Dustin potter.153@osu.edu The Ohio State University
Raczkowski, Joe raczkowski.2@osu.edu Entomology, The Ohio State University
Roshan, Usman usman@oak.njit.edu Computer Science Department, NJIT-CCS
Salter Kubatko, Laura salter@math.unm.edu Department of Statistics, University of New Mexico
Schugart, Richard richard.schugart@wku.edu Department of Mathematics, Western Kentucky University
Srinivasan, Parthasarathy psrinivasan@mbi.osu.edu Department of Mathematics, Cleveland State University
Stamatakis, Alexandros stamatak@ics.forth.gr Institute of Computer Science, Foundation for Research & Technology - Hellas
Stigler, Brandilyn bstigler@mbi.osu.edu Department of Mathematics, Southern Methodist University
Terman, David terman@math.ohio-state.edu Mathemathics Department, The Ohio State University
Tian, Jianjun (Paul) tianjj@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Vakalis, Ignatios ivakalis@capital.edu Mathematics & Computer Sc, Capital University
Varon, Andres avaron@verizon.net Invertebrate Zoology Division, American Museum of Natural History
Verducci, Joseph verducci.1@osu.edu Department of Statistics, The Ohio State University
Wang, Zailong zlwang@mbi.osu.edu Integrated Information Sciences, Novartis
Wheeler, Ward wheeler@amnh.org Division of Invertebrate Zoology, American Museum of Natural History
Zhou, Jin jzhou@mbi.osu.edu Department of Mathematics, Northern Michigan University
To Process Large Data Sets, Use a Fast Method...

To Process Large Data Sets, Use a Fast Method...

Inferring Migration from Phylogenies

I will be presenting material based on the phylogenetic analysis of 181 hemagglutinin sequences from human viruses of the B type. Then, accepting the resulting phylogeny as substantially correct, I construct for those viruses a single character whose character states are the geographic locations from which the viruses were isolated. One can then determine the most parsimonious accounting of the geographic vector on the previously determined tree. A common vector consisting of nucleotides would interpret a change from C to T as a mutation from Cytidine to Thymine. But for my vector it would mean a migration of a virus from California to Texas. One might hope that the resulting data would lend, over several concatenated migration paths a general path by which influenza viruses spread. Initial results suggest multiple available paths which, in turn, will require more sequences to tease out the migration routes from the noise.


Joint work with HoangMinh HoDac, Robert Wallace and Walter M. Fitch.

What is Large? Analyzing Sets of Unaligned Sequence Data

While people focusing on "large data sets" generally refers to hundreds or thousands of terminals, the problem of analyzing sets of unaligned data considerably reduces the number of terminals that can be analyzed in an efficient manner. The reason for that is the relationship between the number of possible alignments that exist for a given set of sequences and the number of possible trees for a given number of terminals. Therefore, tree search for unaligned data includes a nested series of NP-hard problems. In this talk I will explore issues about analyzing medium-size data sets when they are not pre-aligned and the strategies we are exploring for such data sets.

On Divide and Conquer Strategies for Parsimony Analysis of Large Data Sets

A recent paper (Roshan et al., 2004) described a "divide-and-conquer" technique for analysis of large data sets, rec-i-dcm3, and stated that it compares very favorably to results using TNT (the fastest parsimony program; Goloboff et al., 2003). The technique is based on creating reduced data sets, finding most parsimonious trees for the reduced data set, and then merging all parts together. The software Roshan et al. developed submits the reduced data sets to TNT; all their software does is creating the reduced data sets (selecting disjoint sets of terminals) and then combining the results produced by TNT. Although not stated in the paper, the programs for rec-i-dcm3 actually do a complete round of TBR swapping to the complete data set (via TNT) every time the results from the reduced data sets are combined (optimality for a reduced data set rarely produces score improvements for the entire data set). The divide-and-conquer method in TNT (sectorial searches; Goloboff, 1999), avoids this by selecting non-disjoint sets of HTU's (which produces exact length calculations). In this talk, I will show that Roshan et al.'s claims that rec-i-dcm3 outperforms the techniques in TNT is poorly substantiated. First, the settings they used for the TNT runs were extremely poor; very simple settings for TNT would have done a much better job. Second, having TNT analyze larger subproblems, with more exhaustive algorithms (earlier versions only used multiple random addition sequences with TBR, or a few rounds of tree-drifting), produces much better results (which is merely a difference in implementation; the basic algorithms are the same). When this is done, the combined techniques used in TNT clearly outperform rec-i-dcm3. On the other side, rec-i-dcm3 depends on a global round of TBR after each cycle of subdivision, so that using as search engine any program other than TNT becomes unfeasible in the case of very large data sets (e.g. for Roshan et al.'s largest data set, the TBR swapper in PAUP* runs about 800 times slower than the one in TNT). The global TBR becomes more and more critical as the data set is divided into smaller subproblems (because then the combination of the results produces a more suboptimal tree), so that there is a clear limit to the number of taxa that can be reasonably analyzed with rec-i-dcm3. Additionally, since creating reduced data sets to improve the results produces invariably a worse tree (which is subsequently improved by global TBR), rec-i-dcm3, is not truly a divide-and-conquer strategy (as publicized), but instead a technique for cyclic perturbations and improvements.


References:



  1. Goloboff, P. 1999. Analyzing large data sets in reasonable times. Cladistics 15:415-428.

  2. Goloboff, P., J. Farris, K. Nixon. 2003. T.N.T.: Tree Analysis Using New Technology. Program and documentation, available at http://www.zmuc.dk/public/phylogeny/tnt.

  3. U. Roshan, B. M. E. Moret, T. L. Williams, T. Warnow, Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees, Proceedings of the IEEE Computational Systems Bioinformatics (CSB04) 2004, Stanford (CA), USA (2004). Software available at http://www.cs.njit.edu/usman/RecIDCM3.html.

Applications of Large-scale Phylogenetic Analysis for Research in Emerging Infectious Disease

Here we report on the application of large-scale phylogenetic analysis to track host shifts and measure surveillance quality among influenza A. Emerging infectious diseases and organisms present critical issues for public health and economic welfare and are thus important to monitor.


For example, historically antigenic typing of the hemagglutinin (HA) and neuraminidase (N) genes is the lens through which the World Health Organization tracks influenza A and recommends vaccines. In recent years, highly pathogenic H5N1 subtype strains of avian influenza have emerged in wild and domestic birds and humans and other mammals in Asia. Alarmingly, in the past few months H5N1 has spread through avian populations in Eastern Europe and by some accounts threatens to become pandemic among humans. Related viruses of H5N2 subtype also circulate among birds in the Americas but these so far have only affected birds.


Recently, as demonstrated by the coordinated international response to severe acute respiratory syndrome and avian influenza, emerging infectious diseases are now being addressed via the collection of nucleotide sequence data. However, our ability to derive information from large sequence datasets lags far behind their acquisition. Our work is directed at addressing this gap between data collection and analysis.


To perform a longitudinal study of influenza A, we use the best data currently in the public domain. We have performed phylogenetic and character optimization analysis of 2359 hemagglutinin nucleotide sequences from isolates of influenza A using heuristic searches in parallel TNT. These sequences were derived from viral isolates that range temporally from 1902 to present, occur in human, avian, and several mammal hosts, and represent 16 antigenic subtypes.


Direct human infection by avian strains of influenza A is considered rare. It is hypothesized swine act as an intermediate host. Our phylogenetic analysis of reveals multiple independent events of avian to human transmission without intermediate hosts.


We also use this large comprehensive analysis to assess the quality of surveillance of influenza A. Compared to a null hypothesis of no correspondence among date of isolation of viruses and the temporal pattern implied by the phylogenetic hierarchy of viruses, our ability to reconstruct the spread of H5 viruses is better than expected. However, surveillance quality of H5 viruses in Asia is much better than in the Americas and this has been true since 1975.


Joint work with Pablo Goloboff and Diego Pol.

Bayesian MCMC Approaches for Large Gene-order Phylogenies

Methods for the analysis of genome arrangements for phylogenetic inference are complicated by the relative size of the space of possible arrangements. Unlike DNA, amino acid, or codon sequences where sites can be modeled as independent and there are 4, 20, or 61 possible states, genome arrangements must be considered as a single complicated character. In the case of animal mitochondrial genomes with 37 genes arranged on a circle, there are over 2*10^{52} arrangements. This massive state space requires extensive modifications to the algorithms for both likelihood and parsimony based analysis.


In this talk I will describe the Bayesian approach to phylogenetic inference from genome arrangements with several example data sets. I will make comparisons between the Bayesian approach and the parsimony approach for genome arrangement data.

Large-scale Phylogenetic Reconstruction, the Tree of Life, and CIPRES

In this talk, I will review current computational activities aimed at reconstructing the Tree of Life, the evolutionary history of all living organisms (an estimated 10-100 million). Researchers and funding agencies worldwide have put renewed emphasis on the establishment of evolutionary relationships among living species; such relationships are fundamental to research in medicine, drug design, agriculture, ecology, and many other areas.


The CIPRES (Cyber Infrastructure for Phylogenetic Research), which I direct, was founded to develop the informatics tools required to attempt a reconstruction of the Tree of Life. I will sketch the goal and current achievements of CIPRES, comment on future needs, and relate its work to that of other research efforts in phylogeny.


I will then focus on specific challenges that arise in reconstructing trees with 100,000 or more leaves, with particular emphasis on sources of error and on the methodological advances we will need to evaluate the quality of such reconstructions.

Strategies for Parallelizing of Heuristic Tree Searches Using Parsimony

Parallel heuristic tree searches for phylogenetic analyses are explored for datasets consisting of 2,000 to 3,000 taxa using the maximum parsimony criterion. The efficiency of alternative strategies is assessed for several benchmark datasets. The searches were conducted in parallel in Beowulf clusters (using 10 to 60 processors) using several heuristic algorithms implemented in TNT. The searches were driven with the scripting language of TNT.


The identification of an efficient combination of heuristic algorithms and tuning of the algorithms parameters in the context of the parallel environment is critical to the overall efficiency of the parallel tree search. Stopping rules for the heuristic search are discussed, including the convergence to the best-known score of each dataset during independent trials and the stabilization of the strict consensus of most parsimonious trees.


The problems of large datasets with poor phylogenetic structure is discussed through the analysis of 2359 hemagglutinin genes of influenza type A that required the use of iterative constraints to achieve independent hits to minimum lengths in reasonable times.


Joint work with Pablo Goloboff and Daniel Janies.

Co-evolution of DCMs and Base-methods for Phylogeny Reconstruction

The Disk-Covering Method (DCM) is a divide-and-conquer booster technique for improving upon a given base method. It computes a decomposition on the input set of species into smaller overlapping subproblems, applies an external base method to compute subtrees, merges them into a supertree, and further refines the supertree with a local search method if necessary. DCMs have co-evolved with different base methods and optimization criteria since they were first introduced.


In this talk I will discuss the general framework of DCMs and discuss in detail the dataset decomposition aspect. I will present the successful application of DCMs for large-scale phylogeny reconstruction using (1) the neighbor joining method (under distance-based reconstruction), (2) GRAPPA (under gene-order data), (3) PAUP* and TNT (under maximum parsimony), (4) RAxML (under maximum likelihood), and (5) Poy (under generalized tree alignment). I will highlight how DCM evolved with the different base-methods and optimization criteria accompanied by various experimental performance studies.

Computing Huge Trees with Maximum Likelihood: An HPC Perspective

The inference of phylogenies based on the Maximum Likelihood (ML) criterion has recently been demonstrated to be NP-hard. However, significant algorithmic advances have been achieved over the last years which currently allow for ML-based analyses of 1,000 organisms within less than 24 hours on a single CPU. To date, most state-of-the-art ML programs are limited by their memory consumption to tree sizes of approximately 3,000 to 5,000 taxa. Thus, a new category of performance problems arises which mainly concerns technical issues such as memory consumption/organization, cache efficiency, and optimization of the likelihood functions.


Within this context, the talk will -by example of RAxML (Randomized Axelerated ML)- initially focus on the rarely documented technical implementation details, which are of growing importance. RAxML has inherited and extended the technically extremely efficient implementation of fastDNAml. A novel (unpublished) and very simple technical optimization of RAxML, which yields run-time improvements of approximately factor 10 on huge alignments comprising 25,000 taxa, will also be presented. In addition, the exploitation of fine-grained loop-level parallelism on SMPs (Symmetric MultiProcessing) and GPUs (Graphics Processing Units) will be addressed. Finally, potential algorithmic and technical challenges as well as solutions for future large-scale inferences of 100,000 taxa will briefly be discussed.

Minimum Description Length Phylogenetic Analysis

With the growing size of molecular datasets, additional transformation events are required under certain, not clearly defined, conditions. Hybridizations, duplications, inversions, tandem repeats, horizontal gene transfers, can be taken into consideration, but their relevance for a particular dataset is difficult to assess. We propose the Minimum Description Length Principle (MDL) as a more fundamental optimality criterion for phylogenetic inference. We define a framework for the application in its crude and refined versions; for biological research purposes certain machine details of importance are not relevant in the original MDL, and so, we elaborate (and prefer) the crude version to perform full phylogenetic analysis. Finally both MP and ML under static and dynamic homologies are shown to be two particular cases of this more general framework.

Kolmogorov Complexity, Links with Parsimony and Likelihood, and Tests of Methods and Monophyly

Kolmogorov Cmplexity and the Minimum Description Length Principle provide a foundation to compare and understand the relationship between parsimony and likelihood methods. The case of binary characters is presented showing that condition Kolmogorov Complexity (K) of a cladogram given the root, is equal to the parsimony score and that the Tuffley and Steel (1997) likelihood can be derived from this value via the Universal Distribution. Furthermore, the Farris (1973) "maximum evolutionary path" likelihood can be viewed as the composition of edge transformation functions. The role of root complexity is presented in light of tests of monophyly and multiple origins of taxa. An example using metazoan ribosomal data is presented.