Workshop 1: Phylogeography and Phylogenetics

(November 26,2005 - November 30,2005 )

Organizers


Michael Hickerson
Integrative Biology, University of California, Berkeley
Craig Moritz
Integrative Biology, University of California, Berkeley
Dennis Pearl
Department of Statistics, The Ohio State University

The workshop in phylogeography and phylogenetics will focus on the maturation of quantitative techniques that need to occur in these fields. Analytical development is a challenge for researchers seeking clear and unambiguous inferences because both fields use complicated multiparameterized models. A given pattern of genetic diversity between and among species or populations can usually be explained and produced by different scenarios. Maturation of phylogenetic methodologies will be critical if we hope to study such things as the tree of life, linking phenotypic and historical evolution, ancestral character state reconstruction, viral evolution, and the evolution of regulation in protein expression. Likewise, solving the analytical and computational challenges necessary for phylogeographic inferences will be critical for studying dispersal distances, mating systems, sex-biased dispersal, pathogen history, speciation, selection, local adaptation, hybridization, community history, food web stability, the origin of human pathogens, and the evolutionary history of humans.

Accepted Speakers

Elizabeth Allman
Department of Mathematics & Statistics, University of Southern Maine
Stuart Baird
Campus International de Baillarguet, Centre de Biologie et de Gestion des Populations
Mark Beaumont
School of Animal & Microbial Sciences, University of Reading
Chuck Cannon
Department of Biological Sciences, Texas Tech University
Robert Griffiths
Department of Statistics, University of Oxford
Susan Holmes
Department of Statistics, Stanford University
John Huelsenbeck
Division of Biological Sciences, University of California, San Diego
Lacey Knowles
Department of Ecology & Evolutionary Biology, University of Michigan
Dennis Pearl
Department of Statistics, The Ohio State University
Antonis Rokas
HHMI and Laboratory of Molecular Biology, University of Wisconsin
Marc Suchard
Biomathematics and Human Genetics, University of California, Los Angeles
Tandy Warnow
Department of Computer Sciences, University of Texas
Saturday, November 26, 2005
Time Session
Sunday, November 27, 2005
Time Session
Monday, November 28, 2005
Time Session
Tuesday, November 29, 2005
Time Session
Wednesday, November 30, 2005
Time Session
Name Email Affiliation
Allman, Elizabeth eallman@maine.edu Department of Mathematics & Statistics, University of Southern Maine
Baird, Stuart stuart@holyrood.ed.ac.uk Campus International de Baillarguet, Centre de Biologie et de Gestion des Populations
Barrett, Craig barrett.586@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
Beaumont, Mark m.a.beaumont@reading.ac.uk School of Animal & Microbial Sciences, University of Reading
Beerli, Peter beerli@mac.com Computational Evolutionary Biology Group, Florida State University
Belfiore, Natalia nmb@berkeley.edu Museum of Vertebrate Zoology, University of California, Berkeley
Best, Janet jbest@mbi.osu.edu Mathematics, The Ohio State University
Brandley, Matthew brandley@berkeley.edu Museum of Vertebrate Zoology, University of California, Berkeley
Calvino, Carolina ccalvino@life.uiuc.edu Department of Plant Biology, University of California, Berkeley
Cannon, Chuck chuck.cannon@ttu.edu Department of Biological Sciences, Texas Tech University
Carnaval, Ana Carolina carnaval@berkeley.edu Museum of Vertebrate Zoology, University of California, Berkeley
Carstens, Bryan bcarsten@umich.edu Ecology and Evolutionary Biology, University of Michigan
Corey, Sarah Jean corey.14@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
Deckelman, Steven deckelmans@uwstout.edu Mathematical Biosciences Institute, The Ohio State University
Degnan, James jdegnan@hsph.harvard.edu Department of Biostatistics, Harvard University
Edwards, Scott sedwards@fas.harvard.edu Organismic & Evolutionary Biology, Harvard University
Enciso, German German_Enciso@hms.harvard.edu Mathematics Department, University of California, Irvine
Farrington, Heather farrinhl@email.uc.edu Department of Biological Sciences, University of Cincinnati
Fuchs de Jesus, Flavia Genetica & Evolucao, State University of Campinas (UNICAMP)
Galovich, Jennifer jgalovich@csbsju.edu Mathematics and Statistics, St. John's University
Goel, Pranay goelpra@helix.nih.gov NIDDK, Indian Institute of Science Education and Research
Grajdeanu, Paula pgrajdeanu@mbi.osu.edu Mathematics, Shenandoah University
Griffiths, Robert griff@stats.ox.ac.uk Department of Statistics, University of Oxford
Hickerson, Michael mhick@socrates.Berkeley.edu Integrative Biology, University of California, Berkeley
Holmes, Susan susan@stat.stanford.edu Department of Statistics, Stanford University
Huelsenbeck, John johnh@biomail.ucsd.edu Division of Biological Sciences, University of California, San Diego
Jolles, Diana jolles.1@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
Just, Winfried just@math.ohio.edu Mathematical Biosciences Institute, The Ohio State University
Juswara, Lina juswara.1@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
King, Nicole nking@uclink.berkeley.edu Department of Molecular & Cell Biology, University of California, Berkeley
Knowles, Lacey knowlesl@umich.edu Department of Ecology & Evolutionary Biology, University of Michigan
Kuhner, Mary mkkuhner@genetics.washington.edu Department of Genome Sciences, University of Washington
Larget, Bret larget@stat.wisc.edu, Department of Botany, University of Wisconsin
Lim, Sookkyung limsk@math.uc.edu Department of Mathematical Sciences, University of Cincinnati
Liu, Liang LiuLiang@stat.ohio-state.edu Department of Statistics, The Ohio State University
Marschall, Elizabeth marschall.2@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
Martin, Floyd martin.687@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
Mateiu, Ligia Medical Genetics, University of Alberta
Mateiu, Ligia Medical Genetics, University of Alberta
McLachlan, Jason jsmclach@fas.harvard.edu Harvard Forest, Harvard University
Moritz, Craig cmoritz@socrates.berkeley.edu Integrative Biology, University of California, Berkeley
Niedzwiecki, John niedzwjh@UCMAIL.UC.EDU Department of Biological Sciences, University of Cincinnati
Oakley, Todd oakley@lifesci.ucsb.edu Ecology, Evolution, & Marine Biology, University of California, Santa Barbara
Pan, Xueliang (Jeff) xpan@stat.ohio-state.edu Department of Statistics, The Ohio State University
Pearl, Dennis dkp@stat.ohio-state.edu Department of Statistics, The Ohio State University
Petren, Kenneth ken.petren@uc.edu Department of Biological Sciences, University of Cincinnati
Pol, Diego dpol@mbi.osu.edu Independent Researcher, Museo Paleontologico E. Feruglio
Pollack, D. Dennis pollack.1@osu.edu Molecular Virology, Immunology, & Medical Genetics, The Ohio State University
Porter, Mason mason@caltech.edu Department of Physics, California Institute of Technology
Randle, Chris randle@ku.edu Ecology and Evolutionary Biology, University of Kansas
Rhodes, John j.rhodes@alaska.edu Department of Mathematics, Bates College
Rokas, Antonis arokas@wisc.edu HHMI and Laboratory of Molecular Biology, University of Wisconsin
Rosenberg, Noah noahr@cmb.usc.edu College of Biological Sciences, University of Southern California
Russell, Amy alr2@email.arizona.edu Arizona Research Laboratories, University of Arizona
Salter Kubatko, Laura salter@math.unm.edu Department of Statistics, University of New Mexico
Schugart, Richard richard.schugart@wku.edu Department of Mathematics, Western Kentucky University
Srinivasan, Parthasarathy psrinivasan@mbi.osu.edu Department of Mathematics, Cleveland State University
Stahl, Eli estahl@umassd.edu Department of Biology, University of Massachusetts
Steel, Mike Math and Statistics, University of Canterbury
Stigler, Brandilyn bstigler@mbi.osu.edu Department of Mathematics, Southern Methodist University
Stubna, Michael stubna@mbi.osu.edu Engineering Team Leader, Pulsar Informatics
Suchard, Marc msuchard@gmail.com Biomathematics and Human Genetics, University of California, Los Angeles
Tay, David tay.9@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
Taylor, Amelia amelia.taylor@coloradocollege.edu Department of Math, St. Olaf College
Terman, David terman@math.ohio-state.edu Mathemathics Department, The Ohio State University
Thornton, Kevin kt234@cornell.edu Molecular Biology and Genetics, Cornell University
Tian, Jianjun (Paul) tianjj@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Vakalis, Ignatios ivakalis@capital.edu Mathematics & Computer Sc, Capital University
Wang, Zailong zlwang@mbi.osu.edu Integrated Information Sciences, Novartis
Warnow, Tandy tandy@cs.utexas.edu Department of Computer Sciences, University of Texas
Webb, Campbell campbell.webb@yale.edu Ecology and Evolutionary Biology, Yale University
Williams, Joseph williams.1020@osu.edu Evolution, Ecology, and Organismal Biology, The Ohio State University
Yoder, Anne anne.yoder@duke.edu Ecology and Evolutionary Biology, Duke University
Yoshida, Ruriko ruriko.yoshida@uky.edu Department of Mathematics, Duke University
Zhou, Jin jzhou@mbi.osu.edu Department of Mathematics, Northern Michigan University
Progress and Potential for Phylogenetic Invariants

This talk will highlight recent developments in the study of phylogenetic invariants. In particular, assuming a general model of the mutation process of orthologous sequences, 'most' polynomial relationships in expected pattern frequencies can be explicitly constructed. These constructions are tied to specific topological features (edges and nodes) of a phylogenetic tree. This new understanding of invariants leads to theoretical results on the identifiability of the tree topology for models with increased biological realism, such as the covarion model and certain mixture models.

A Lattice Implementation of Wright's Neighborhood Model

Wright's neighborhood size can be seen as a statement about the probability of coalescence of lineages integrated over space. Moving backwards in time neighbourhood size increases and the probability of coalescence decreases. As such Wright's neighborhood model could potentially be used for coalescent inference over structured populations parameterised by parent-offspring dispersal and population density. This is in contrast to models parameterised by the size of panmictic units and migration vectors between them. If we wish to use coalescent inference over a study system, and lack prior knowledge of the scale at which panmixis can be assumed, Wright's neighborhood model seems appropriate. Here I show how Wright's neighborhood model can be implemented on a lattice, allowing sampling of the properties of genealogies in space and time for a set of georeferenced field observations. I contrast two sampling approaches that allow Bayesian inference over these genealogies and discuss the implications for inference over recent timescales (geneflow, population structure) and deeper timescales (phylogeographic process).

Joint Determination of Topology, Time of Splitting and Immigration in Population Trees

I describe a Bayesian method that uses summary statistics measured from microsatellite loci to make inferences about demographic parameters in 2- and 3-population models. Preliminary results with an infinite sites model of sequence evolution are also described. The method can be used to infer effective sizes of current and ancestral populations, immigration rates, splitting times and tree topology (in the 3-population case). A novel method for model selection is introduced. Comparisons are made with the IM program of Hey and Nielsen, and a data set of 19 microsatellite loci from Channel Island foxes is analysed. It is concluded that the method is competitive with IM on 2-population data. There appears to be little scope for accurate inference with microsatellite data unless very large numbers of loci are used.

Applying Phylogenies to Practical Problems in SE Asia: data, methods, and speculation

The results of human action, from the scale of the climate to the niche, will dominate our evolutionary future. Meaningful ways of intersecting theoretical and empirical studies with conservation and management of our natural resources are important. Using tropical tree communities as an example, the application of phylogenetic and biogeographic evidence to mitigating some of this change will be discussed. Emergent questions, with implications for the utility of this data, will be explored. These questions have inspired the development of a DNA microarray based technique for gathering genomic samples of neutral variation in previously unstudied organisms. The approach, called Hyperdispersed Illiterate Primer Screening (HIPS), will be particularly effective for developing a database of genomic signatures that can allow phylogenetically scalable queries, virtual subtractive hybridizations, and the rapid development of simple downstream bioassays for screening large numbers of individuals.

From Gene Trees to Species Trees: empirical data sets from birds and priorities for new implementations of theory

The problem of inferring trees of closely related species from multilocus data sets suffers from a lack of robust implementations of existing theory and from lack of empirical data on which to help set priorities for new directions. We have been accumulating multilocus DNA sequence data sets of anonymous, noncoding regions of Australian songbird genomes to examine the historical demography of speciation and population structure. Using two data sets from northern Australia, one from grassfinches (Poephila) and one from treecreepers (Climacteris), I illustrate the potential of anonymous loci to provide a higher resolving power for current and ancestral population parameters than mitochondrial DNA, and for inferring relationships among closely related species when gene trees conflict with one another. However, our studies also pinpoint several gaps in existing software packages that prevent full exploration of the data. In particular our data reveals the need for an integrated approach to estimating the sequence of speciation events (species phylogeny) from multilocus data sets that does not require a priori assumptions. In addition, the data sets reveal a need for analyses of gene flow that can encompass more than one species even when there is no current gene flow between those species. These studies, like those in Drosophila and humans, show that even phylogeographic analyses focused on single species in general will require analysis of sequence data from multiple species, especially those that continue to share residual polymorphisms with the focal species, and will require implementations of theory that can accommodate multispecies data sets.

Ancestral Inference from Gene Trees

A unique gene tree describing the mutation history of a sample of DNA sequences can be constructed as a perfect phylogeny under an assumption of non-recurrent point mutations. An empirical distribution of the stochastic history of the gene tree, conditional on its topology, can be found by an advanced simulation technique of importance sampling on coalescent histories. The distribution of the time to the most recent common ancestor and ages of mutations in the gene tree, conditional on its topology, can be found from empirical distribution. This talk will present examples of ancestral inference from gene trees, microsatellite data, and sketch the importance sampling technique.

Using Multivariate and Phylogenetic decompositions in the search for Drug Resistant Mutations in HIV

Conditioning out phylogenetic information in HIV sequences, we performed multivariate studies of eventual drug resistant mutations using multidimensionnal scaling and correspondence analyses methods, we propose several approaches to the problem of correlated variables in this context.

Detecting Positive Natural Selection in Protein-coding DNA under a Dirichlet Process Prior

Most methods for detecting Darwinian natural selection at the molecular level rely on estimating the rates or numbers of nonsynonymous and synonymous changes in an alignment of protein- coding DNA sequences. In some of these methods, the nonsynonymous rate of substitution is allowed to vary across the sequence, permitting the identification of single amino-acid positions that are under positive natural selection. However, it is unclear which probability distribution should be used to describe how the nonsynonymous rate of substitution varies across the sequence. One widely used solution is to model variation in the nonsynonymous rate across the sequence as a mixture of several discrete or continuous probability distributions. Unfortunately, there is little population genetics theory to inform us of the appropriate probability distribution for among-site variation in the nonsynonymous rate of substitution. Here, we describe an approach to modeling variation in the nonsynonymous rate of substitution using a Dirichlet process mixture model. The Dirichlet process allows there to be a countably infinite number of nonsynonymous rate classes, and is very flexible in accommodating different potential distributions for the nonsynonymous rate of substitution. We implemented the model in a fully Bayesian approach, with all parameters of the model considered as random variables.

Inferring Species Histories Despite Incomplete Lineage Sorting

It is now well known that incomplete lineage sorting can cause serious difficulties for phylogenetic and phylogeographic inference. Yet, little attention has been paid to methods that attempt to overcome these difficulties by explicitly considering the processes that produce them. Here I explore approaches to historical inference designed to consider retention and sorting of ancestral polymorphism. I examine how the reconstructability of a species (or population) histories is affected by (a) the number of loci used to estimate the phylogeny and (b) the number of individuals sampled per species (or population). Even in difficult cases with considerable incomplete lineage sorting (divergences times separated by less than 1Ne generations), accurate historical reconstructions are possible, as long as a reasonable numbers of individuals and loci are sampled. Moreover, tradeoffs between sampling more loci versus more individuals shift depending on the depth of the species history under study. Taken together these results demonstrate that gene sequences retain enough signal to achieve an accurate estimate of history despite widespread incomplete lineage sorting. Continued methodological improvements for inference near the species level require not only a statistical framework for evaluating the likelihood of particular gene trees, but also a shift to compound models that consider the molecular evolutionary process of nucleotide substitutions, as well as the population genetics processes of lineage sorting.

Animal Evolution and the Molecular Signature of Radiations Compressed in Time

The phylogenetic relationships among most metazoan phyla remain uncertain. Here, we obtained large numbers of gene sequences from metazoans, including key understudied taxa. Despite the amount of data and breadth of taxa analyzed, relationships among most metazoan phyla remained unresolved. In contrast, the same genes robustly resolved phylogenetic relationships within a major clade of Fungi of approximately the same age as the Metazoa. The differences in resolution within the two Kingdoms suggest that the early history of metazoans was a radiation compressed in time, in agreement with paleontological inferences. Furthermore, simulation analyses as well as studies of other radiations in deep time indicate that, given adequate sequence data, the lack of resolution in phylogenetic trees is a signature of closely spaced series of cladogenetic events.

Random Models of Speciation and Extinction, and their Relevance for Phylogeny

Random models for species formation and loss have played an important role in evolutionary biology since Yule's pioneering work in the 1920s. More recently these models have been investigated for the light they shed on both the topological properties (shape, balance, clade distribution, discrete tree reconstuction, tree rooting) and metric properties (branch length distribution, phylogenetic diversity) of phylogenies. In this talk I describe how these models are relevant for tree reconstruction and rooting, and the distribution of clade sizes, as well as the loss (and optimization) of phylogenetic diversity as taxa go extinct. The talk will include some historical survey, as well as some recent (and new) results.

Joint Inference of Alignment and Phylogeny from Molecular Sequence Data

Genomics research is generating vast molecular sequence data ranging from single genes to whole genomes across an increasing number of species. However, a fundamental difficulty in evolutionary studies emerges as the availability of sequences expands. Phylogenetics methods to reconstruct the evolutionary tree relating the sequences traditionally condition on a single, sometimes poorly estimated sequence alignment, where an alignment specifies which residues in the sequences derive from a common origin. This conditioning can cause bias and inappropriate infer in genomic studies, particularly when the sequences are highly diverse. For example, the early branching-order of Bacteria, Archaea and Eukaryotes, the three major domains of life, is troublesome to determine.


As a solution, I describe a novel Bayesian model for simultaneously estimating alignments and the phylogenetic trees that relate the sequences. This sidesteps the bias issue inherent in sequential estimation. Joint estimation also allows one to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) in the sequences to group sister species in the tree. I base this indel process on a Hidden Markov Model that makes use of affine gap penalties and considers indels of multiple residues.


I develop a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. I describe a new MCMC transition kernel based on the Forward-Backward algorithm and a careful choice of parameter marginalization that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Finally, my software implementation can estimate alignment uncertainty and I describe a method for summarizing this uncertainty in a single plot.

The Disk-Covering Method for Phylogenetic Tree Reconstruction

Phylogenetic trees, also known as evolutionary trees, model the evolution of biological species or genes from a common ancestor. Most computational problems associated with phylogenetic tree reconstruction are very hard (specifically, they are NP-hard, and are practically hard, as real datasets can take years of analysis, without provably optimal solutions being found). Finding ways of speeding up the solutions to these problems is of major importance to systematic biologists. Other approaches take only polynomial time and have provable performance guarantees under Markov models of evolution; however, our recent work shows that the sequence lengths that suffice for these methods to be accurate with high probability grows exponentially in the diameter of the underlying tree.


In this talk, we will describe new dataset decomposition techniques, called the Disk-Covering Methods, for phylogenetic tree reconstruction. This basic algorithmic technique uses interesting graph theory, and can be used to reduce the sequence length requirement of polynomial time methods, so that polynomial length sequences suffice for accuracy with high probability (instead of exponential). We also use this technique to speed up the solution of NP-hard optimization problems, such as maximum likelihood and maximum parsimony.