Organizers
In this workshop, we will focus on recent advances in phylogenetic analysis of large datasets considering two major aspects. The first is the problem of analysis of datasets with a large number of taxa (exceeding several hundreds). This includes recent improvements in tree search algorithms and synergistic application of parallelization strategies using Beowulf clusters and distributed computing. The workshop participants will also compare and discuss the use of improved search tools for competing methods (e.g., parsimony, maximum likelihood, Bayesian). This discussion will illustrate advantages and problems associated specific to various methods with the analysis of datasets with large number of taxa.
The second point of discussion will be the problems associated with analysis of large comparative genomic datasets with various types of information (homology, rearrangement, origin, loss, and multiplication among loci between ancestor and descendant genomes). This problem includes the generalization of edit cost models to account for of multiple kinds of genomic change rather than only DNA substitution.
In summary, this workshop aims to gather leading researchers in this field to discuss common and specific problems, applications, and new models for largescale phylogenetic and genomic analysis.
Accepted Speakers
Thursday, December 1, 2005  

Time  Session 
09:15 AM 10:00 AM  Walter Fitch  Inferring Migration from Phylogenies I will be presenting material based on the phylogenetic analysis of 181 hemagglutinin sequences from human viruses of the B type. Then, accepting the resulting phylogeny as substantially correct, I construct for those viruses a single character whose character states are the geographic locations from which the viruses were isolated. One can then determine the most parsimonious accounting of the geographic vector on the previously determined tree. A common vector consisting of nucleotides would interpret a change from C to T as a mutation from Cytidine to Thymine. But for my vector it would mean a migration of a virus from California to Texas. One might hope that the resulting data would lend, over several concatenated migration paths a general path by which influenza viruses spread. Initial results suggest multiple available paths which, in turn, will require more sequences to tease out the migration routes from the noise. Joint work with HoangMinh HoDac, Robert Wallace and Walter M. Fitch. 
10:30 AM 11:15 AM  Diego Pol  Strategies for Parallelizing of Heuristic Tree Searches Using Parsimony Parallel heuristic tree searches for phylogenetic analyses are explored for datasets consisting of 2,000 to 3,000 taxa using the maximum parsimony criterion. The efficiency of alternative strategies is assessed for several benchmark datasets. The searches were conducted in parallel in Beowulf clusters (using 10 to 60 processors) using several heuristic algorithms implemented in TNT. The searches were driven with the scripting language of TNT. The identification of an efficient combination of heuristic algorithms and tuning of the algorithms parameters in the context of the parallel environment is critical to the overall efficiency of the parallel tree search. Stopping rules for the heuristic search are discussed, including the convergence to the bestknown score of each dataset during independent trials and the stabilization of the strict consensus of most parsimonious trees. The problems of large datasets with poor phylogenetic structure is discussed through the analysis of 2359 hemagglutinin genes of influenza type A that required the use of iterative constraints to achieve independent hits to minimum lengths in reasonable times. Joint work with Pablo Goloboff and Daniel Janies. 
11:45 AM 12:30 PM  Daniel Janies  Applications of Largescale Phylogenetic Analysis for Research in Emerging Infectious Disease Here we report on the application of largescale phylogenetic analysis to track host shifts and measure surveillance quality among influenza A. Emerging infectious diseases and organisms present critical issues for public health and economic welfare and are thus important to monitor. For example, historically antigenic typing of the hemagglutinin (HA) and neuraminidase (N) genes is the lens through which the World Health Organization tracks influenza A and recommends vaccines. In recent years, highly pathogenic H5N1 subtype strains of avian influenza have emerged in wild and domestic birds and humans and other mammals in Asia. Alarmingly, in the past few months H5N1 has spread through avian populations in Eastern Europe and by some accounts threatens to become pandemic among humans. Related viruses of H5N2 subtype also circulate among birds in the Americas but these so far have only affected birds. Recently, as demonstrated by the coordinated international response to severe acute respiratory syndrome and avian influenza, emerging infectious diseases are now being addressed via the collection of nucleotide sequence data. However, our ability to derive information from large sequence datasets lags far behind their acquisition. Our work is directed at addressing this gap between data collection and analysis. To perform a longitudinal study of influenza A, we use the best data currently in the public domain. We have performed phylogenetic and character optimization analysis of 2359 hemagglutinin nucleotide sequences from isolates of influenza A using heuristic searches in parallel TNT. These sequences were derived from viral isolates that range temporally from 1902 to present, occur in human, avian, and several mammal hosts, and represent 16 antigenic subtypes. Direct human infection by avian strains of influenza A is considered rare. It is hypothesized swine act as an intermediate host. Our phylogenetic analysis of reveals multiple independent events of avian to human transmission without intermediate hosts. We also use this large comprehensive analysis to assess the quality of surveillance of influenza A. Compared to a null hypothesis of no correspondence among date of isolation of viruses and the temporal pattern implied by the phylogenetic hierarchy of viruses, our ability to reconstruct the spread of H5 viruses is better than expected. However, surveillance quality of H5 viruses in Asia is much better than in the Americas and this has been true since 1975. Joint work with Pablo Goloboff and Diego Pol. 
02:00 PM 02:45 PM  Usman Roshan  Coevolution of DCMs and Basemethods for Phylogeny Reconstruction The DiskCovering Method (DCM) is a divideandconquer booster technique for improving upon a given base method. It computes a decomposition on the input set of species into smaller overlapping subproblems, applies an external base method to compute subtrees, merges them into a supertree, and further refines the supertree with a local search method if necessary. DCMs have coevolved with different base methods and optimization criteria since they were first introduced. In this talk I will discuss the general framework of DCMs and discuss in detail the dataset decomposition aspect. I will present the successful application of DCMs for largescale phylogeny reconstruction using (1) the neighbor joining method (under distancebased reconstruction), (2) GRAPPA (under geneorder data), (3) PAUP* and TNT (under maximum parsimony), (4) RAxML (under maximum likelihood), and (5) Poy (under generalized tree alignment). I will highlight how DCM evolved with the different basemethods and optimization criteria accompanied by various experimental performance studies. 
03:15 PM 04:00 PM  Pablo Goloboff  On Divide and Conquer Strategies for Parsimony Analysis of Large Data Sets A recent paper (Roshan et al., 2004) described a "divideandconquer" technique for analysis of large data sets, recidcm3, and stated that it compares very favorably to results using TNT (the fastest parsimony program; Goloboff et al., 2003). The technique is based on creating reduced data sets, finding most parsimonious trees for the reduced data set, and then merging all parts together. The software Roshan et al. developed submits the reduced data sets to TNT; all their software does is creating the reduced data sets (selecting disjoint sets of terminals) and then combining the results produced by TNT. Although not stated in the paper, the programs for recidcm3 actually do a complete round of TBR swapping to the complete data set (via TNT) every time the results from the reduced data sets are combined (optimality for a reduced data set rarely produces score improvements for the entire data set). The divideandconquer method in TNT (sectorial searches; Goloboff, 1999), avoids this by selecting nondisjoint sets of HTU's (which produces exact length calculations). In this talk, I will show that Roshan et al.'s claims that recidcm3 outperforms the techniques in TNT is poorly substantiated. First, the settings they used for the TNT runs were extremely poor; very simple settings for TNT would have done a much better job. Second, having TNT analyze larger subproblems, with more exhaustive algorithms (earlier versions only used multiple random addition sequences with TBR, or a few rounds of treedrifting), produces much better results (which is merely a difference in implementation; the basic algorithms are the same). When this is done, the combined techniques used in TNT clearly outperform recidcm3. On the other side, recidcm3 depends on a global round of TBR after each cycle of subdivision, so that using as search engine any program other than TNT becomes unfeasible in the case of very large data sets (e.g. for Roshan et al.'s largest data set, the TBR swapper in PAUP* runs about 800 times slower than the one in TNT). The global TBR becomes more and more critical as the data set is divided into smaller subproblems (because then the combination of the results produces a more suboptimal tree), so that there is a clear limit to the number of taxa that can be reasonably analyzed with recidcm3. Additionally, since creating reduced data sets to improve the results produces invariably a worse tree (which is subsequently improved by global TBR), recidcm3, is not truly a divideandconquer strategy (as publicized), but instead a technique for cyclic perturbations and improvements. References:

04:30 PM 05:15 PM  James Farris  To Process Large Data Sets, Use a Fast Method... To Process Large Data Sets, Use a Fast Method... 
Friday, December 2, 2005  

Time  Session 
09:00 AM 09:45 AM  Bernard Moret  Largescale Phylogenetic Reconstruction, the Tree of Life, and CIPRES In this talk, I will review current computational activities aimed at reconstructing the Tree of Life, the evolutionary history of all living organisms (an estimated 10100 million). Researchers and funding agencies worldwide have put renewed emphasis on the establishment of evolutionary relationships among living species; such relationships are fundamental to research in medicine, drug design, agriculture, ecology, and many other areas. The CIPRES (Cyber Infrastructure for Phylogenetic Research), which I direct, was founded to develop the informatics tools required to attempt a reconstruction of the Tree of Life. I will sketch the goal and current achievements of CIPRES, comment on future needs, and relate its work to that of other research efforts in phylogeny. I will then focus on specific challenges that arise in reconstructing trees with 100,000 or more leaves, with particular emphasis on sources of error and on the methodological advances we will need to evaluate the quality of such reconstructions. 
10:15 AM 11:00 AM  Andres Varon  Minimum Description Length Phylogenetic Analysis With the growing size of molecular datasets, additional transformation events are required under certain, not clearly defined, conditions. Hybridizations, duplications, inversions, tandem repeats, horizontal gene transfers, can be taken into consideration, but their relevance for a particular dataset is difficult to assess. We propose the Minimum Description Length Principle (MDL) as a more fundamental optimality criterion for phylogenetic inference. We define a framework for the application in its crude and refined versions; for biological research purposes certain machine details of importance are not relevant in the original MDL, and so, we elaborate (and prefer) the crude version to perform full phylogenetic analysis. Finally both MP and ML under static and dynamic homologies are shown to be two particular cases of this more general framework. 
11:30 AM 12:15 PM  Ward Wheeler  Kolmogorov Complexity, Links with Parsimony and Likelihood, and Tests of Methods and Monophyly Kolmogorov Cmplexity and the Minimum Description Length Principle provide a foundation to compare and understand the relationship between parsimony and likelihood methods. The case of binary characters is presented showing that condition Kolmogorov Complexity (K) of a cladogram given the root, is equal to the parsimony score and that the Tuffley and Steel (1997) likelihood can be derived from this value via the Universal Distribution. Furthermore, the Farris (1973) "maximum evolutionary path" likelihood can be viewed as the composition of edge transformation functions. The role of root complexity is presented in light of tests of monophyly and multiple origins of taxa. An example using metazoan ribosomal data is presented. 
01:45 PM 02:30 PM  Gonzalo Giribet  What is Large? Analyzing Sets of Unaligned Sequence Data While people focusing on "large data sets" generally refers to hundreds or thousands of terminals, the problem of analyzing sets of unaligned data considerably reduces the number of terminals that can be analyzed in an efficient manner. The reason for that is the relationship between the number of possible alignments that exist for a given set of sequences and the number of possible trees for a given number of terminals. Therefore, tree search for unaligned data includes a nested series of NPhard problems. In this talk I will explore issues about analyzing mediumsize data sets when they are not prealigned and the strategies we are exploring for such data sets. 
03:00 PM 03:45 PM  Alexandros Stamatakis  Computing Huge Trees with Maximum Likelihood: An HPC Perspective The inference of phylogenies based on the Maximum Likelihood (ML) criterion has recently been demonstrated to be NPhard. However, significant algorithmic advances have been achieved over the last years which currently allow for MLbased analyses of 1,000 organisms within less than 24 hours on a single CPU. To date, most stateoftheart ML programs are limited by their memory consumption to tree sizes of approximately 3,000 to 5,000 taxa. Thus, a new category of performance problems arises which mainly concerns technical issues such as memory consumption/organization, cache efficiency, and optimization of the likelihood functions. Within this context, the talk will by example of RAxML (Randomized Axelerated ML) initially focus on the rarely documented technical implementation details, which are of growing importance. RAxML has inherited and extended the technically extremely efficient implementation of fastDNAml. A novel (unpublished) and very simple technical optimization of RAxML, which yields runtime improvements of approximately factor 10 on huge alignments comprising 25,000 taxa, will also be presented. In addition, the exploitation of finegrained looplevel parallelism on SMPs (Symmetric MultiProcessing) and GPUs (Graphics Processing Units) will be addressed. Finally, potential algorithmic and technical challenges as well as solutions for future largescale inferences of 100,000 taxa will briefly be discussed. 
04:15 PM 05:00 PM  Bret Larget  Bayesian MCMC Approaches for Large Geneorder Phylogenies Methods for the analysis of genome arrangements for phylogenetic inference are complicated by the relative size of the space of possible arrangements. Unlike DNA, amino acid, or codon sequences where sites can be modeled as independent and there are 4, 20, or 61 possible states, genome arrangements must be considered as a single complicated character. In the case of animal mitochondrial genomes with 37 genes arranged on a circle, there are over 2*10^{52} arrangements. This massive state space requires extensive modifications to the algorithms for both likelihood and parsimony based analysis. In this talk I will describe the Bayesian approach to phylogenetic inference from genome arrangements with several example data sets. I will make comparisons between the Bayesian approach and the parsimony approach for genome arrangement data. 
Name  Affiliation  

AbdulSalim, Kobinah  abdulsalim.1@osu.edu  Evolution, Ecology, and Organismal Biology, The Ohio State University 
Ast, Jennifer  jca@umich.edu  Ecology & Evolutionary Biology, University of Michigan 
Bader, David  bader@cc.gatech.edu  College of Computing, Georgia Institute of Technology 
Barrett, Craig  barrett.586@osu.edu  Evolution, Ecology, and Organismal Biology, The Ohio State University 
Best, Janet  jbest@mbi.osu.edu  Mathematics, The Ohio State University 
Brandley, Matthew  brandley@berkeley.edu  Museum of Vertebrate Zoology, University of California, Berkeley 
Brown, Joseph  josephwb@umich.edu  Bird Division, Ecology & Evolutionary Biology, University of Michigan 
Bucheli, Sibyl Rae  bucheli.1@osu.edu  Entomology, The Ohio State University 
Chakrabarty, Prosanta  pchakrab@umich.edu  Fish Division, Ecology & Evolutionary Biology, University of Michigan 
CintronArias, Ariel  acintro@unity.ncsu.edu  Cornell University, Cornell University 
Coddington, Jonathan  Coddington@si.edu  Invertebrate Zoology, Smithsonian Institution 
Deckelman, Steven  deckelmans@uwstout.edu  Mathematical Biosciences Institute, The Ohio State University 
Djordjevic, Marko  mdjordjevic@mbi.osu.edu  Department of Biophysics, University of Belgrade 
Enciso, German  German_Enciso@hms.harvard.edu  Mathematics Department, University of California, Irvine 
Farrington, Heather  farrinhl@email.uc.edu  Department of Biological Sciences, University of Cincinnati 
Farris, James  mslfarris@nrm.se  Molecular Systematics Laboratory, Swedish Museum of Natural History 
Fitch, Walter  wfitch@uci.edu  Ecology & Evolutionary Biology, University of California, Irvine 
Galovich, Jennifer  jgalovich@csbsju.edu  Mathematics and Statistics, St. John's University 
Giribet, Gonzalo  gonzalo.giribet@gmail.com  Museum of Comparative Zoology, Harvard University 
Goel, Pranay  goelpra@helix.nih.gov  NIDDK, Indian Institute of Science Education and Research 
Goloboff, Pablo  pablogolo@csnat.unt.edu.ar  INSUE, CONICET, Instituto Miguel Lillo 
Grajdeanu, Paula  pgrajdeanu@mbi.osu.edu  Mathematics, Shenandoah University 
Guralnick, Robert  Robert.Guralnick@colorado.edu  EEB and CU Museum, University of Colorado 
Howard, Laura  laurah@umich.edu  Ecology & Evolutionary Biology, University of Michigan 
Hsing, Tailen  hsing@stat.ohiostate.edu  Department of Statistics, The Ohio State University 
Janies, Daniel  Daniel.Janies@osumc.edu  Department of Biomedical Informatics, The Ohio State University 
Jin, Ruoming  jinr@cse.ohiostate.edu  Department of Computer Science, Kent State University 
Just, Winfried  just@math.ohio.edu  Department of Mathematics, Ohio University 
Krosnick, Shawn  Evolution, Ecology and Organismal Biology, The Ohio State University  
Larget, Bret  larget@stat.wisc.edu,  Department of Statistics, University of Wisconsin 
Lekveishvili, Mariam  lekveishvili.1@osu.edu  Entomology, The Ohio State University 
Lim, Sookkyung  limsk@math.uc.edu  Department of Mathematical Sciences, University of Cincinnati 
Lin, Shili  lin.328@osu.edu  Department of Statistics, The Ohio State University 
Lindgren, Annie  lindgren.11@osu.edu  Evolution, Ecology and Organismal Biology, The Ohio State University 
Lipscomb, Diana  biodl@gwu.edu  Department of Biological Sciences, George Washington University 
Lou, Yuan  lou@math.ohiostate.edu  Department of Mathematics, The Ohio State University 
Martin, Floyd  martin.687@osu.edu  Evolution, Ecology, and Organismal Biology, The Ohio State University 
McClellan, David  david_mcclellan@byu.edu  Department of Integrative Biology, Brigham Young University 
Miller, Mark  mmiller@sdsc.edu  Department of Biology, University of California, San Diego 
Mishler, Brent  bmishler@calmail.berkeley.edu  Department of Integrative Biology, University of California, Berkeley 
Moore, Michael  mjmoore1@ufl.edu  Florida Museum of Natural History, University of Florida 
Moret, Bernard  moret@cs.unm.edu  Department of Computer Science, University of New Mexico 
Nevai, Andrew  anevai@mbi.osu.edu  Department of Mathematics, University of Central Florida 
Nixon, Kevin  kcn2@cornell.edu  Plant Biology, Cornell University 
Perkins, Susan  perkins@amnh.org  Division of Invertebrate Zoology, American Museum of Natural History 
Piontkivska, Helen  opiontki@kent.edu  Biological Sciences, Kent State University 
Pol, Diego  dpol@mbi.osu.edu  Independent Researcher, Museo Paleontologico E. Feruglio 
Potter, Dustin  potter.153@osu.edu  The Ohio State University 
Raczkowski, Joe  raczkowski.2@osu.edu  Entomology, The Ohio State University 
Roshan, Usman  usman@oak.njit.edu  Computer Science Department, NJITCCS 
Salter Kubatko, Laura  salter@math.unm.edu  Department of Statistics, University of New Mexico 
Schugart, Richard  richard.schugart@wku.edu  Department of Mathematics, Western Kentucky University 
Srinivasan, Parthasarathy  psrinivasan@mbi.osu.edu  Department of Mathematics, Cleveland State University 
Stamatakis, Alexandros  stamatak@ics.forth.gr  Institute of Computer Science, Foundation for Research & Technology  Hellas 
Stigler, Brandilyn  bstigler@mbi.osu.edu  Department of Mathematics, Southern Methodist University 
Terman, David  terman@math.ohiostate.edu  Mathemathics Department, The Ohio State University 
Tian, Jianjun (Paul)  tianjj@mbi.osu.edu  Mathematical Biosciences Institute, The Ohio State University 
Vakalis, Ignatios  ivakalis@capital.edu  Mathematics & Computer Sc, Capital University 
Varon, Andres  avaron@verizon.net  Invertebrate Zoology Division, American Museum of Natural History 
Verducci, Joseph  verducci.1@osu.edu  Department of Statistics, The Ohio State University 
Wang, Zailong  zlwang@mbi.osu.edu  Integrated Information Sciences, Novartis 
Wheeler, Ward  wheeler@amnh.org  Division of Invertebrate Zoology, American Museum of Natural History 
Zhou, Jin  jzhou@mbi.osu.edu  Department of Mathematics, Northern Michigan University 
To Process Large Data Sets, Use a Fast Method...
I will be presenting material based on the phylogenetic analysis of 181 hemagglutinin sequences from human viruses of the B type. Then, accepting the resulting phylogeny as substantially correct, I construct for those viruses a single character whose character states are the geographic locations from which the viruses were isolated. One can then determine the most parsimonious accounting of the geographic vector on the previously determined tree. A common vector consisting of nucleotides would interpret a change from C to T as a mutation from Cytidine to Thymine. But for my vector it would mean a migration of a virus from California to Texas. One might hope that the resulting data would lend, over several concatenated migration paths a general path by which influenza viruses spread. Initial results suggest multiple available paths which, in turn, will require more sequences to tease out the migration routes from the noise.
Joint work with HoangMinh HoDac, Robert Wallace and Walter M. Fitch.
While people focusing on "large data sets" generally refers to hundreds or thousands of terminals, the problem of analyzing sets of unaligned data considerably reduces the number of terminals that can be analyzed in an efficient manner. The reason for that is the relationship between the number of possible alignments that exist for a given set of sequences and the number of possible trees for a given number of terminals. Therefore, tree search for unaligned data includes a nested series of NPhard problems. In this talk I will explore issues about analyzing mediumsize data sets when they are not prealigned and the strategies we are exploring for such data sets.
A recent paper (Roshan et al., 2004) described a "divideandconquer" technique for analysis of large data sets, recidcm3, and stated that it compares very favorably to results using TNT (the fastest parsimony program; Goloboff et al., 2003). The technique is based on creating reduced data sets, finding most parsimonious trees for the reduced data set, and then merging all parts together. The software Roshan et al. developed submits the reduced data sets to TNT; all their software does is creating the reduced data sets (selecting disjoint sets of terminals) and then combining the results produced by TNT. Although not stated in the paper, the programs for recidcm3 actually do a complete round of TBR swapping to the complete data set (via TNT) every time the results from the reduced data sets are combined (optimality for a reduced data set rarely produces score improvements for the entire data set). The divideandconquer method in TNT (sectorial searches; Goloboff, 1999), avoids this by selecting nondisjoint sets of HTU's (which produces exact length calculations). In this talk, I will show that Roshan et al.'s claims that recidcm3 outperforms the techniques in TNT is poorly substantiated. First, the settings they used for the TNT runs were extremely poor; very simple settings for TNT would have done a much better job. Second, having TNT analyze larger subproblems, with more exhaustive algorithms (earlier versions only used multiple random addition sequences with TBR, or a few rounds of treedrifting), produces much better results (which is merely a difference in implementation; the basic algorithms are the same). When this is done, the combined techniques used in TNT clearly outperform recidcm3. On the other side, recidcm3 depends on a global round of TBR after each cycle of subdivision, so that using as search engine any program other than TNT becomes unfeasible in the case of very large data sets (e.g. for Roshan et al.'s largest data set, the TBR swapper in PAUP* runs about 800 times slower than the one in TNT). The global TBR becomes more and more critical as the data set is divided into smaller subproblems (because then the combination of the results produces a more suboptimal tree), so that there is a clear limit to the number of taxa that can be reasonably analyzed with recidcm3. Additionally, since creating reduced data sets to improve the results produces invariably a worse tree (which is subsequently improved by global TBR), recidcm3, is not truly a divideandconquer strategy (as publicized), but instead a technique for cyclic perturbations and improvements.
References:
 Goloboff, P. 1999. Analyzing large data sets in reasonable times. Cladistics 15:415428.
 Goloboff, P., J. Farris, K. Nixon. 2003. T.N.T.: Tree Analysis Using New Technology. Program and documentation, available at http://www.zmuc.dk/public/phylogeny/tnt.
 U. Roshan, B. M. E. Moret, T. L. Williams, T. Warnow, RecIDCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees, Proceedings of the IEEE Computational Systems Bioinformatics (CSB04) 2004, Stanford (CA), USA (2004). Software available at http://www.cs.njit.edu/usman/RecIDCM3.html.
Here we report on the application of largescale phylogenetic analysis to track host shifts and measure surveillance quality among influenza A. Emerging infectious diseases and organisms present critical issues for public health and economic welfare and are thus important to monitor.
For example, historically antigenic typing of the hemagglutinin (HA) and neuraminidase (N) genes is the lens through which the World Health Organization tracks influenza A and recommends vaccines. In recent years, highly pathogenic H5N1 subtype strains of avian influenza have emerged in wild and domestic birds and humans and other mammals in Asia. Alarmingly, in the past few months H5N1 has spread through avian populations in Eastern Europe and by some accounts threatens to become pandemic among humans. Related viruses of H5N2 subtype also circulate among birds in the Americas but these so far have only affected birds.
Recently, as demonstrated by the coordinated international response to severe acute respiratory syndrome and avian influenza, emerging infectious diseases are now being addressed via the collection of nucleotide sequence data. However, our ability to derive information from large sequence datasets lags far behind their acquisition. Our work is directed at addressing this gap between data collection and analysis.
To perform a longitudinal study of influenza A, we use the best data currently in the public domain. We have performed phylogenetic and character optimization analysis of 2359 hemagglutinin nucleotide sequences from isolates of influenza A using heuristic searches in parallel TNT. These sequences were derived from viral isolates that range temporally from 1902 to present, occur in human, avian, and several mammal hosts, and represent 16 antigenic subtypes.
Direct human infection by avian strains of influenza A is considered rare. It is hypothesized swine act as an intermediate host. Our phylogenetic analysis of reveals multiple independent events of avian to human transmission without intermediate hosts.
We also use this large comprehensive analysis to assess the quality of surveillance of influenza A. Compared to a null hypothesis of no correspondence among date of isolation of viruses and the temporal pattern implied by the phylogenetic hierarchy of viruses, our ability to reconstruct the spread of H5 viruses is better than expected. However, surveillance quality of H5 viruses in Asia is much better than in the Americas and this has been true since 1975.
Joint work with Pablo Goloboff and Diego Pol.
Methods for the analysis of genome arrangements for phylogenetic inference are complicated by the relative size of the space of possible arrangements. Unlike DNA, amino acid, or codon sequences where sites can be modeled as independent and there are 4, 20, or 61 possible states, genome arrangements must be considered as a single complicated character. In the case of animal mitochondrial genomes with 37 genes arranged on a circle, there are over 2*10^{52} arrangements. This massive state space requires extensive modifications to the algorithms for both likelihood and parsimony based analysis.
In this talk I will describe the Bayesian approach to phylogenetic inference from genome arrangements with several example data sets. I will make comparisons between the Bayesian approach and the parsimony approach for genome arrangement data.
In this talk, I will review current computational activities aimed at reconstructing the Tree of Life, the evolutionary history of all living organisms (an estimated 10100 million). Researchers and funding agencies worldwide have put renewed emphasis on the establishment of evolutionary relationships among living species; such relationships are fundamental to research in medicine, drug design, agriculture, ecology, and many other areas.
The CIPRES (Cyber Infrastructure for Phylogenetic Research), which I direct, was founded to develop the informatics tools required to attempt a reconstruction of the Tree of Life. I will sketch the goal and current achievements of CIPRES, comment on future needs, and relate its work to that of other research efforts in phylogeny.
I will then focus on specific challenges that arise in reconstructing trees with 100,000 or more leaves, with particular emphasis on sources of error and on the methodological advances we will need to evaluate the quality of such reconstructions.
Parallel heuristic tree searches for phylogenetic analyses are explored for datasets consisting of 2,000 to 3,000 taxa using the maximum parsimony criterion. The efficiency of alternative strategies is assessed for several benchmark datasets. The searches were conducted in parallel in Beowulf clusters (using 10 to 60 processors) using several heuristic algorithms implemented in TNT. The searches were driven with the scripting language of TNT.
The identification of an efficient combination of heuristic algorithms and tuning of the algorithms parameters in the context of the parallel environment is critical to the overall efficiency of the parallel tree search. Stopping rules for the heuristic search are discussed, including the convergence to the bestknown score of each dataset during independent trials and the stabilization of the strict consensus of most parsimonious trees.
The problems of large datasets with poor phylogenetic structure is discussed through the analysis of 2359 hemagglutinin genes of influenza type A that required the use of iterative constraints to achieve independent hits to minimum lengths in reasonable times.
Joint work with Pablo Goloboff and Daniel Janies.
The DiskCovering Method (DCM) is a divideandconquer booster technique for improving upon a given base method. It computes a decomposition on the input set of species into smaller overlapping subproblems, applies an external base method to compute subtrees, merges them into a supertree, and further refines the supertree with a local search method if necessary. DCMs have coevolved with different base methods and optimization criteria since they were first introduced.
In this talk I will discuss the general framework of DCMs and discuss in detail the dataset decomposition aspect. I will present the successful application of DCMs for largescale phylogeny reconstruction using (1) the neighbor joining method (under distancebased reconstruction), (2) GRAPPA (under geneorder data), (3) PAUP* and TNT (under maximum parsimony), (4) RAxML (under maximum likelihood), and (5) Poy (under generalized tree alignment). I will highlight how DCM evolved with the different basemethods and optimization criteria accompanied by various experimental performance studies.
The inference of phylogenies based on the Maximum Likelihood (ML) criterion has recently been demonstrated to be NPhard. However, significant algorithmic advances have been achieved over the last years which currently allow for MLbased analyses of 1,000 organisms within less than 24 hours on a single CPU. To date, most stateoftheart ML programs are limited by their memory consumption to tree sizes of approximately 3,000 to 5,000 taxa. Thus, a new category of performance problems arises which mainly concerns technical issues such as memory consumption/organization, cache efficiency, and optimization of the likelihood functions.
Within this context, the talk will by example of RAxML (Randomized Axelerated ML) initially focus on the rarely documented technical implementation details, which are of growing importance. RAxML has inherited and extended the technically extremely efficient implementation of fastDNAml. A novel (unpublished) and very simple technical optimization of RAxML, which yields runtime improvements of approximately factor 10 on huge alignments comprising 25,000 taxa, will also be presented. In addition, the exploitation of finegrained looplevel parallelism on SMPs (Symmetric MultiProcessing) and GPUs (Graphics Processing Units) will be addressed. Finally, potential algorithmic and technical challenges as well as solutions for future largescale inferences of 100,000 taxa will briefly be discussed.
With the growing size of molecular datasets, additional transformation events are required under certain, not clearly defined, conditions. Hybridizations, duplications, inversions, tandem repeats, horizontal gene transfers, can be taken into consideration, but their relevance for a particular dataset is difficult to assess. We propose the Minimum Description Length Principle (MDL) as a more fundamental optimality criterion for phylogenetic inference. We define a framework for the application in its crude and refined versions; for biological research purposes certain machine details of importance are not relevant in the original MDL, and so, we elaborate (and prefer) the crude version to perform full phylogenetic analysis. Finally both MP and ML under static and dynamic homologies are shown to be two particular cases of this more general framework.
Kolmogorov Cmplexity and the Minimum Description Length Principle provide a foundation to compare and understand the relationship between parsimony and likelihood methods. The case of binary characters is presented showing that condition Kolmogorov Complexity (K) of a cladogram given the root, is equal to the parsimony score and that the Tuffley and Steel (1997) likelihood can be derived from this value via the Universal Distribution. Furthermore, the Farris (1973) "maximum evolutionary path" likelihood can be viewed as the composition of edge transformation functions. The role of root complexity is presented in light of tests of monophyly and multiple origins of taxa. An example using metazoan ribosomal data is presented.