Workshop 4: Inference in Stochastic Models of Sequence Evolution

(February 22,2010 - February 26,2010 )

Organizers


Ian Holmes
Computational & Genomic Biology, UC Berkeley
Gerton Lunter
Dept. of Physiology Anatomy & Genetics, University of Oxford

The past decade has witnessed the transition of computational biology, population genetics, and evolutionary biology, from relatively data-sparse and theory-driven subjects, into highly empirical and data-driven disciplines. The continuing data-explosion has meant that descriptive studies have tended to outpace more in-depth theory development and statistical modeling. Nevertheless, in the past decade our understanding of genome and population biology and evolution has increased dramatically, and combined with the availability of data, it seems that the time is ripe to reap the benefits of these developments, by giving a new impetus to statistical modeling and theory development in computational biology, broadly defined.

In this workshop, we aim to bring together leading experts on genome biology and evolution, with an interest in quantitative modeling. We hope to create an interesting mix of, on the one hand, researchers whose main focus is on biology or evolution, with researchers who are primarily interested in the modeling aspects of these biological problems, from a mathematical, statistical, or algorithmic perspective.

The program is organized around the following five interest areas:

  • Viruses and pathogens
  • Paleogenetics and reconstruction
  • Genome and pathway evolution
  • Human diversity and populations
  • Comparative genomics

The workshop will feature about 20 short talks spread over 5 days, with plenty of time in between. In these slots it will be possible (and strongly encouraged) to organize informal breakout sessions in smaller groups, to discuss topical problems in more depth.

Accepted Speakers

Kees Albers
Dept Haematology, University of Cambridge
Ana Arribas-Gil
Departamento de Estadística, Universidad Carlos III de Madrid
Stephen Chamberlain
FfAME, Foundation for Applied Molecular Evolution
Graham Coop
Department of Evolution and Ecology, University of California, Davis
Daniel Falush
Department of Microbiology, Environmental Research Institute
Cedric Feschotte
Department of Biology, The University of Texas at Arlington
Robin Gutell
Center for Computational Biology and Bioinformatics, The University of Texas at Austin
Matthew Hahn
School of informatics & Computing, Indiana University
Asger Hobolth
Bioinformatics Research Center (BiRC), Aarhus University
Anna Kedzierska
Applied Mathematics, Politechnical Univ. of Catalonia
Carolin Kosiol
Group Leader in Bioinformatics, Institut für Populationsgenetik
David Liberles
Department of Molecular Biology, University of Wyoming
Michael Lässig
Institute for Theoretical Physics, University of Cologne
Thomas Mailund
Bioinformatics Research Center, Aarhus University
Katie Pollard
Gladstone Institutes, UCSF
Mikkel Schierup
Bioinformatics Research Center, University of Aarhus
Adam Siepel
Biological Statistics & Computational Biology, Cornell University
Marc Suchard
Departments of Biomathematics and Human Genetics, UCLA School of Public Health
Monday, February 22, 2010
Time Session
01:00 PM
02:00 PM
Graham Coop - Parallel mutations and partial sweeps
Parallel mutations and partial sweeps
02:30 PM
03:15 PM
Thomas Mailund - A coalescent process Markov in time and space
A coalescent process Markov in time and space
03:30 PM
04:30 PM
Mikkel Schierup - Coalescent HMMs for inference of population genetics processes in ancestral species
Coalescent HMMs for inference of population genetics processes in ancestral species
Tuesday, February 23, 2010
Time Session
09:30 AM
10:15 AM
Asger Hobolth - Efficient calculation of various summary statistics for end-point conditioned continuous-time Markov chains
Continuous-time Markov chains are a key modelling tool in sequence evolution. One major application is estimating the substitution rates, but detailed knowledge about the substitution process itself can also be of interest. I will discuss various approaches for calculating summary statistics for continuous-time Markov chains.
10:15 AM
11:00 AM
Michael Lässig - Molecular Fitness Land- and Seascapes
Molecular Fitness Land- and Seascapes
11:15 AM
12:00 PM
Carolin Kosiol - Using empirical codon models to understand the patterns and pressures of natural variation reflected in genomic re-sequencing data
In the past, two kinds of Markov models have been considered to describe protein sequence evolution. Codon-level models have been mechanistic, with a small number of parameters designed to take into account features such as transition-transversion bias, codon frequency bias and synonymous-nonsynonymous amino acid substitution bias. Amino acid models have been empirical, attempting to summarize the replacement patterns observed in large quantities of data and not explicitly considering the distinct factors that shape protein evolution.

We have estimated the first empirical codon model using Maximum Likelihood techniques. Our results show that modelling the evolutionary process is improved by allowing for single, double and triple nucleotide changes; the affiliation between DNA triplets and the amino acid they encode is a main factor driving evolution; and the nonsynonymous-synonymous rate ratio is a suitable measure to classify substitution patterns observed for different proteins. However, are the double and triple changes instantaneous? We aim to take advantage of newly available re-sequencing data to further improve our models and understanding of the evolution of protein coding sequences. Specifically, we estimate empirical codon substitution models from re-sequencing data of several Drosophila species.
01:00 PM
02:00 PM
Ana Arribas-Gil - Parameter estimation in models for sequence alignment
Models for pairwise alignment based on the TKF (Thorne, Kishino and Felsenstein 1991) indel process fit into the pair-Hidden Markov Model (pair-HMM). Observations in a pair-HMM are formed by the couple of sequences to be aligned and the hidden alignment is a Markov chain. Many efficient algorithms have been developed to estimate alignments and evolution parameters in this context. From a theoretical point of view, it is also interesting to investigate the statistical properties of the estimators computed by these algorithms. In particular, we will discuss consistency of maximum likelihood and Bayesian estimators. In the context of multiple alignment, the classical indel evolution process for the sequences (TKF) provides a complex hidden variable model for the alignment in which the phylogenetic relationships between the sequences must be taken into account. We provide a theoretical framework for this model and study, as for the pairwise alignment, the consistency of estimators.
02:30 PM
03:15 PM
Kees Albers - A Bayesian method for genome-wide identification of small insertions and deletions
In my talk I will describe how we compiled a genome-wide set of over a million insertions and deletions ranging from 1 to 50 nucleotides in the 1000 Genomes Project low coverage data set using the novel Bayesian method Dindel. The main idea is to consider candidate indels and other sequence variants obtained from different approaches such as read-mappers and assembly-based methods, and then evaluate support for these candidates by performing a Bayesian gapped realignment of reads to a set of haplotypes potentially segregating in the population. With a Bayesian EM algorithm we estimate allele frequency and the posterior probability that a non-reference indel variant segregates in the population.

The method has a number of features that distinguishes it from currently available methods. First, we observed that the rate of indels due to sequencing errors is as high as 1% in long homonucleotide runs, and that ~10 % of the called indels occurred in these. Therefore we explicitly account for these context-specific error rates in the realignment. Second, we use read-mapping quality as a prior probability that a read should align to a potential haplotype to avoid spurious calls.

The method successfully identified all capillary-validated indels in a 3.8 kb region from a candidate gene study in 96 individuals without false-positives on artificially downsampled coverage ranging from 10X to 100X. We applied the method to the Illumina data of the CEU, YRI and JPT/CHB populations in the pilot 1 of the 1000 genomes project. I will describe the properties of the called indels and how they are supported by LD structure in the population.
03:15 PM
04:00 PM
Abdoulaye Diallo - Interence of Insertions and Deletions for Ancestral Genome Reconstruction and Phylogenetics Analyses
Interence of Insertions and Deletions for Ancestral Genome Reconstruction and Phylogenetics Analyses
Wednesday, February 24, 2010
Time Session
09:30 AM
10:30 AM
Matthew Hahn - Models for Gene Gain, Gene Loss, and Gene Movement
Models for Gene Gain, Gene Loss, and Gene Movement
11:00 AM
12:00 PM
Marc Suchard - Many-Core Algorithms for Statistical Phylogenetics
Massive numerical integration plagues the statistical inference of partially observed stochastic processes. An important biological example entertains partially observed continuous-time Markov chains (CTMCs) to model molecular sequence evolution. Joint inference of phylogenetic trees and codon-based substitution models of sequence evolution remains computationally impractical. Parallelizing data likelihood calculations is an obvious strategy; however, across a cluster-computer, this scales with the total number of processing cores, incurring considerable cost to achieve reasonable run-time.

To solve this problem, I describe many-core computing algorithms that harness inexpensive graphics processing units (GPUs) for calculation of the likelihood under CTMC models of evolution. High-end GPUs containing hundreds of cores and are low-cost. These novel algorithms are particularly efficient for large state-spaces, including codon models, and large data sets, such as full genome alignments where we demonstrate up to 150-fold speed-up. I conclude with a discussion of the future of many-core computing in statistics and touch upon recent experiences with massively large and high-dimensional mixture models.
01:00 PM
02:00 PM
Cedric Feschotte - Transposable elements: germline invaders with a lasting impact on genome evolution
I will provide an overview of our studies of the evolutionary dynamics of transposable elements and their impact on eukaryotic evolution. The emphasis will be on class 2 or DNA transposons in mammalian genomes. Genomic analyses reveal that a vast diversity of DNA transposons exists in mammalian genomes and that large cohorts of elements have been integrated in a lineage-specific fashion, generating extensive genomic variation among and within the major lineages of placental mammals. The evolutionary dynamics of DNA transposons in mammals is characterized by repeated episodes of horizontal transfer across widely diverged species, a mode of transmission that appears to be facilitated by blood-borne invertebrate parasites. Through several examples, I will illustrate how DNA transposons have been a recurrent source of genetic innovation throughout mammalian evolution, contributing to the shuffling of pre-existing genes and to the emergence of new genes and concomitantly to the assembly of lineage-specific regulatory circuits.
02:30 PM
03:30 PM
Daniel Falush - Inference of population structure using dense genotype data
Inference of population structure using dense genotype data
Thursday, February 25, 2010
Time Session
09:30 AM
10:30 AM
Anna Kedzierska - Model fitting for mixture models
Phylogenetic data arising on different tree topologies might be mixed on a given alignment. Models taking into it account usually contain a large number of parameters and the usual tests for model fitting cannot deal with them. Here we discuss an approach for model fitting of algebraic models on m-tree mixtures. This is joint work in progress with Marta Casanellas and Jesus Fernandez-Sanchez
11:00 AM
12:00 PM
Adam Siepel - Estimation of ancestral population divergence times and effective population sizes from complete human genome sequences
Complete genome sequences are now available for individuals representing several distinct human populations. The primary motivation for collecting these sequences has been to characterize human diversity, facilitate disease association studies, and pave the way for an era of "personalized medicine". However, these data also contain valuable information about human evolution. Here I will describe an effort currently in progress to estimate evolutionary parameters such as the times at which major population groups diverged and the effective sizes of ancestral human populations, based on sequence data for two West Africans, two East Asians, two individuals of European descent, and a Khoisan hunter gatherer from the Kalahari Desert in Southern Africa. This work involves an interesting mixture of traditional phylogenetics and population genetics, and also requires a number of challenging technical issues -- concerning alignment and sequencing error, and genotype estimation -- to be addressed. I will describe our statistical model, which is derived from the MCMCCOAL model by Rannala and Yang, and our Markov chain Monte Carlo methods for inference. I will also present preliminary results from our analysis and discuss their relationship with what is currently known about early human evolution.
01:00 PM
02:00 PM
Katie Pollard - Accelerated and biased nucleotide evolution in the human genome
Comparative genomics provides an opportunity to investigate the evolutionary forces that have shaped the human genome. I will first describe phyloP, a new statistical method for identifying potentially lineage-specific shifts in the rate of nucleotide evolution. We applied this approach in a genome-wide scan for evolutionarily conserved sequences that were extensively changed in the last ~6 million years since divergence of humans from our common ancestor with chimpanzee. Many of these Human Accelerated Regions (HARs) appear to be regulatory regions controlling important developmental transcription factors, with human-specific substitutions changing the predicted ability for particular factors to bind. Interestingly, the HARs show a striking substitution bias resulting from preferential fixation of GC alleles. Fast evolving protein-coding exons and other rapidly evolving regions of the human genome have a similar pattern of GC fixation bias. This bias is highly correlated with recombination rates, suggesting the involvement of neutral evolutionary forces such as biased gene conversion (BGC). To disentangle the effects of BGC and positive selection, we implemented an extension of phyloP that enables tests for lineage-specific changes in the rate and/or pattern of substitutions. Using this approach, we were able to approximate the percentages of mammalian conserved elements that have been altered by selective versus neutral evolutionary forces in humans.
02:30 PM
03:30 PM
Matthew Rasmussen - Bayesian Gene-tree Reconstruction and Learning in Phylogenomics
Advances in genome sequencing have enabled the study of evolution across both genomes and large clades of species, and has been especially useful for studying gene families as they expand and contract over evolutionary time by gene duplication and loss events. Here, we present a new approach for the reconstruction of gene-tree phylogenies that models simultaneously gene and species evolution, which enables significantly increased reconstruction accuracy by incorporating genome-wide information. We introduce SPIDIR v2, a Bayesian method for gene tree reconstruction that incorporates within a unified framework models for gene duplication and loss, gene- and species-specific rate variation, and sequence substitution. The method includes an Expectation-Maximization method for inferring the distribution of species-specific and gene-specific rates across unambiguous orthologs, which we then use as priors in the reconstruction of any family. We have also developed a novel fast search algorithm that uses the Birth-Death process to more efficiently explore tree space leading to faster and more accurate recovery of a Maximum a posteriori (MAP) gene tree. We have applied our method in two clades of fully-sequenced genomes consisting of 12 Drosophila and 16 fungal species as well as simulated data. We use these extensive benchmarks to study the overall and branch-level accuracy for SPIDIR and many other phylogenetic methods. We find that SPIDIR achieves dramatic improvements in reconstruction accuracy over both traditional phylogenetic methods (Maximum Parsimony, Maximum Likelihood, and Bayesian methods) as well as other species-aware methods (SYNERGY, PRIME-GSR) in both real and simulated datasets.

Work done in collaboration with Manolis Kelli.
Friday, February 26, 2010
Time Session
09:30 AM
10:30 AM
Robin Gutell - Independence and dependence of the nucleotide positions in the Evolution of ribosomal RNA structure
Covariation analysis has been utilized to accurately predict RNA structure since the first set of tRNA sequences were determined in the early 1960's, 5S rRNA in the mid 1970's, 16S and 23S rRNA in the early 1980's, and many other RNA molecules within the past 30 years. The molecular biologists did this analysis with the intent of determining a higher-order structure that was similar to a set of sequences within any one RNA family. This fundamental assumption that RNA (and protein) sequences that perform similar functions and members of the same macromolecular family will have the same secondary and other higher-order structure has been substantiated many times.

With significantly larger collections of RNA sequences and the development of a novel database system that provides us with the foundation to perform more sophisticated comparative analysis, we are now identifying more positions within RNA sequences that are dependent on one another during their evolution. Those positions that are most dependent, or have the strongest Covariation, are always base paired in the RNA secondary structure models and high-resolution crystal structures. Other positions in the sequence form a base pair have weaker but still significant Covariation. We are also identifying positions with weak Covariation (less evolutionary dependence on one another) that do not form a base pair but instead are close together in three-dimensional structures and are usually part of a unique structural motif. And last, we have identified many tertiary base pairs that do not have a typical nucleotide Covariation between the two positions that base pair with one another. In total, this analysis is revealing that the evolution of RNA structure contains many positions that are dependent on one another. An evolutionary biologist might be interested in determining a network of fitness functions for the same set of data.

Work done in collaboration with Lei Shang, Weijia Xu, and Stuart Ozer.
11:00 AM
12:00 PM
David Liberles - Models for Amino Acid Substitution and Gene Duplication With Roots in Molecular and Evolutionary Processes
In Darwinian evolution, mutations occur approximately at random in a gene, turned into amino acid mutations by the genetic code. Some mutations are fixed to become substitutions and some are eliminated from the population. Partitioning pairs of closely related species with complete genome sequences by population size, we look at the PAM matrices generated for these partitions and compare the substitution patterns between species. A population genetic model is generated that relates the relative fixation probabilities of different types of mutations to the selective pressure and population size. Parameterizations of the average and distribution of selective pressures for different amino acid substitution types in different population size comparisons are generated using a Bayesian framework. We find that partitions in population size as well as in substitution type are required to explain the substitution data. Mechanistic explanations of this will be discussed.

In a second line of work, models for gene duplication will be discussed, with potential extension to the Gene tree/species tree reconciliation problem. This second trajectory involves an extension of birth death models based upon an exponential distribution using a generalization based upon a Weibull distribution. Distinct parameterizations of the Weibull distribution are expected to be consistent with the neofunctionalization, subfunctionalization, and dosage compensation processes. Ongoing work involves the extension of the Weibull-based models to the gene tree/species tree reconciliation approach using a Bayesian framework, with potential utility in additionally identifying underlying evolutionary processes.
01:00 PM
02:00 PM
Stephen Chamberlain - Extracting useful information from protein sequences

The application of mathematical, statistical, and algorithmic perspectives to problems in genome and population biology, physiology, and evolution yields quantitative models of the "spherical cow" variety. Even in their most sophisticated forms, such models cannot capture the individuality of protein sequences as proxies for an underlying chemical reality, where our well-developed understanding of molecular reactivity strongly denies the notion that molecular behavior is the sum of the behavior of the parts of a molecule. This need not be defeating, however. The details of how proteins divergently evolve under functional constraints differ from how quantitative models expect them to evolve holds a signal of form and function. Thus, functional information can be extracted from that difference, providing a value for quantitative models, even if their ability to "inch towards reality" is imperfect. Already, those who exploit those differences are able to predict the folded structure of proteins, identify functional residues and interaction sites, detect errors in gene finding and annotation, identify protein pathways, guide protein engineering, and correlate function of individual proteins with organismic physiology and ecology, all from a set of homologous sequences and a reasonable quantitative model. Indeed, quantitative models should not be exceptionally perfect for them to be most useful. This workshop contribution will describe how mis-predictions of quantitative models can be placed within evolutionary, chemical, and organismic contexts, to extract useful information from protein sequences.

Name Affiliation
Albers, Kees caa@sanger.ac.uk Dept Haematology, University of Cambridge
Arribas-Gil, Ana aarribas@est-econ.uc3m.es Departamento de Estadística, Universidad Carlos III de Madrid
Becquet, Celine celine.becquet@gmail.com Institute for Human Genetics, UCSF
Boussau, Bastien boussau@berkeley.edu Department of Integrative Biology, UC Berkeley
Bundschuh, Ralf bundschuh@mbi.osu.edu Department of Physics, Ohio State University
Chamberlain, Stephen Romaine FfAME, Foundation for Applied Molecular Evolution
Coop, Graham gcoop79@gmail.com Department of Evolution and Ecology, University of California, Davis
Crow, Karen crow@sfsu.edu Biology, San Francisco State University
Diallo, Abdoulaye diallo.abdoulaye@uqam.ca Computer science department, Université du Québec in Montreal
Falush, Daniel d.falush@ucc.ie Department of Microbiology, Environmental Research Institute
Fatakia, Sarosh N. fatakias@mail.nih.gov Laboratory of Biological Modeling, NIDDK/NIH
Feschotte , Cedric cedric@uta.edu Department of Biology, The University of Texas at Arlington
Gao, Kun kgao@oist.jp Physics and Biology Unit, Okinawa Institute of Science and Technology
Gutell, Robin robin.gutell@gmail.com Center for Computational Biology and Bioinformatics, The University of Texas at Austin
Hahn, Matthew mwh.indiana.edu School of informatics & Computing, Indiana University
Han, Mira mirhan@indiana.edu School of Informatics, Indiana University
Hobolth, Asger asger@cs.au.dk Bioinformatics Research Center (BiRC), Aarhus University
Holmes, Ian ihh@berkeley.edu Computational & Genomic Biology, UC Berkeley
Jhwueng , Dwueng Chwuan djhwueng@umail.iu.edu Mathematics, Indiana University Bloomington
Kedzierska, Anna malapoczta@yahoo.com Applied Mathematics, Politechnical Univ. of Catalonia
Kosiol, Carolin Carolin.Kosiol@vu-wien.ac.at Group Leader in Bioinformatics, Institut für Populationsgenetik
Liberles, David liberles@uwyo.edu Department of Molecular Biology, University of Wyoming
Lunter, Gerton gerton.lunter@well.ox.ac.uk Dept. of Physiology Anatomy & Genetics, University of Oxford
Lssig, Michael mlaessig@uni-koeln.de Institute for Theoretical Physics, University of Cologne
Mailund, Thomas mailund@birc.au.dk Bioinformatics Research Center, Aarhus University
Mitra, Indranil imitra@clemson.edu Mathematical Sciences, Clemson University
Osadolor, Anthony a_osadololor@msn.com 'Ecole Polytechnique
Palacios, Julia Adela jpalacio@stat.washington.edu Department of Statistics, University of Washington
Pollard, Dan dp96@nyu.edu Biology Department, New York University
Pollard, Katie kpollard@gladstone.ucsf.edu Gladstone Institutes, UCSF
Ralph, Peter plralph@ucdavis.edu Evolution and Ecology, UC Davis
Rasmussen, Matthew rasmus@mit.edu Computer Science & Artificial Int. Lab, MIT
Redelings, Benjamin benjamin_redelings@ncsu.edu NESCent, Duke University
Riesenfeld, Samantha sriesenfeld@gmail.com Gladstone Institute for Cardiovascular Disease, UCSF
Schierup, Mikkel mheide@birc.au.dk Bioinformatics Research Center, University of Aarhus
Siepel, Adam acs4@cornell.edu Biological Statistics & Computational Biology, Cornell University
Suchard, Marc msuchard@gmail.com Departments of Biomathematics and Human Genetics, UCLA School of Public Health
Taslim, Cenny taslim.2@osu.edu Statistics/Comprehensive Cancer Center, The Ohio State University
Venkatesan, Sathish sathish.venkatesan@oist.jp Physics and Biology Unit, Okinawa Institute of Science and Technology
Welch, Lonnie welch@ohio.edu Bioinformatics Laboratory, Ohio University
A Bayesian method for genome-wide identification of small insertions and deletions
In my talk I will describe how we compiled a genome-wide set of over a million insertions and deletions ranging from 1 to 50 nucleotides in the 1000 Genomes Project low coverage data set using the novel Bayesian method Dindel. The main idea is to consider candidate indels and other sequence variants obtained from different approaches such as read-mappers and assembly-based methods, and then evaluate support for these candidates by performing a Bayesian gapped realignment of reads to a set of haplotypes potentially segregating in the population. With a Bayesian EM algorithm we estimate allele frequency and the posterior probability that a non-reference indel variant segregates in the population.

The method has a number of features that distinguishes it from currently available methods. First, we observed that the rate of indels due to sequencing errors is as high as 1% in long homonucleotide runs, and that ~10 % of the called indels occurred in these. Therefore we explicitly account for these context-specific error rates in the realignment. Second, we use read-mapping quality as a prior probability that a read should align to a potential haplotype to avoid spurious calls.

The method successfully identified all capillary-validated indels in a 3.8 kb region from a candidate gene study in 96 individuals without false-positives on artificially downsampled coverage ranging from 10X to 100X. We applied the method to the Illumina data of the CEU, YRI and JPT/CHB populations in the pilot 1 of the 1000 genomes project. I will describe the properties of the called indels and how they are supported by LD structure in the population.
Parameter estimation in models for sequence alignment
Models for pairwise alignment based on the TKF (Thorne, Kishino and Felsenstein 1991) indel process fit into the pair-Hidden Markov Model (pair-HMM). Observations in a pair-HMM are formed by the couple of sequences to be aligned and the hidden alignment is a Markov chain. Many efficient algorithms have been developed to estimate alignments and evolution parameters in this context. From a theoretical point of view, it is also interesting to investigate the statistical properties of the estimators computed by these algorithms. In particular, we will discuss consistency of maximum likelihood and Bayesian estimators. In the context of multiple alignment, the classical indel evolution process for the sequences (TKF) provides a complex hidden variable model for the alignment in which the phylogenetic relationships between the sequences must be taken into account. We provide a theoretical framework for this model and study, as for the pairwise alignment, the consistency of estimators.
Extracting useful information from protein sequences

The application of mathematical, statistical, and algorithmic perspectives to problems in genome and population biology, physiology, and evolution yields quantitative models of the "spherical cow" variety. Even in their most sophisticated forms, such models cannot capture the individuality of protein sequences as proxies for an underlying chemical reality, where our well-developed understanding of molecular reactivity strongly denies the notion that molecular behavior is the sum of the behavior of the parts of a molecule. This need not be defeating, however. The details of how proteins divergently evolve under functional constraints differ from how quantitative models expect them to evolve holds a signal of form and function. Thus, functional information can be extracted from that difference, providing a value for quantitative models, even if their ability to "inch towards reality" is imperfect. Already, those who exploit those differences are able to predict the folded structure of proteins, identify functional residues and interaction sites, detect errors in gene finding and annotation, identify protein pathways, guide protein engineering, and correlate function of individual proteins with organismic physiology and ecology, all from a set of homologous sequences and a reasonable quantitative model. Indeed, quantitative models should not be exceptionally perfect for them to be most useful. This workshop contribution will describe how mis-predictions of quantitative models can be placed within evolutionary, chemical, and organismic contexts, to extract useful information from protein sequences.

Parallel mutations and partial sweeps
Parallel mutations and partial sweeps
Interence of Insertions and Deletions for Ancestral Genome Reconstruction and Phylogenetics Analyses
Interence of Insertions and Deletions for Ancestral Genome Reconstruction and Phylogenetics Analyses
Inference of population structure using dense genotype data
Inference of population structure using dense genotype data
Transposable elements: germline invaders with a lasting impact on genome evolution
I will provide an overview of our studies of the evolutionary dynamics of transposable elements and their impact on eukaryotic evolution. The emphasis will be on class 2 or DNA transposons in mammalian genomes. Genomic analyses reveal that a vast diversity of DNA transposons exists in mammalian genomes and that large cohorts of elements have been integrated in a lineage-specific fashion, generating extensive genomic variation among and within the major lineages of placental mammals. The evolutionary dynamics of DNA transposons in mammals is characterized by repeated episodes of horizontal transfer across widely diverged species, a mode of transmission that appears to be facilitated by blood-borne invertebrate parasites. Through several examples, I will illustrate how DNA transposons have been a recurrent source of genetic innovation throughout mammalian evolution, contributing to the shuffling of pre-existing genes and to the emergence of new genes and concomitantly to the assembly of lineage-specific regulatory circuits.
Independence and dependence of the nucleotide positions in the Evolution of ribosomal RNA structure
Covariation analysis has been utilized to accurately predict RNA structure since the first set of tRNA sequences were determined in the early 1960's, 5S rRNA in the mid 1970's, 16S and 23S rRNA in the early 1980's, and many other RNA molecules within the past 30 years. The molecular biologists did this analysis with the intent of determining a higher-order structure that was similar to a set of sequences within any one RNA family. This fundamental assumption that RNA (and protein) sequences that perform similar functions and members of the same macromolecular family will have the same secondary and other higher-order structure has been substantiated many times.

With significantly larger collections of RNA sequences and the development of a novel database system that provides us with the foundation to perform more sophisticated comparative analysis, we are now identifying more positions within RNA sequences that are dependent on one another during their evolution. Those positions that are most dependent, or have the strongest Covariation, are always base paired in the RNA secondary structure models and high-resolution crystal structures. Other positions in the sequence form a base pair have weaker but still significant Covariation. We are also identifying positions with weak Covariation (less evolutionary dependence on one another) that do not form a base pair but instead are close together in three-dimensional structures and are usually part of a unique structural motif. And last, we have identified many tertiary base pairs that do not have a typical nucleotide Covariation between the two positions that base pair with one another. In total, this analysis is revealing that the evolution of RNA structure contains many positions that are dependent on one another. An evolutionary biologist might be interested in determining a network of fitness functions for the same set of data.

Work done in collaboration with Lei Shang, Weijia Xu, and Stuart Ozer.
Models for Gene Gain, Gene Loss, and Gene Movement
Models for Gene Gain, Gene Loss, and Gene Movement
Efficient calculation of various summary statistics for end-point conditioned continuous-time Markov chains
Continuous-time Markov chains are a key modelling tool in sequence evolution. One major application is estimating the substitution rates, but detailed knowledge about the substitution process itself can also be of interest. I will discuss various approaches for calculating summary statistics for continuous-time Markov chains.
Model fitting for mixture models
Phylogenetic data arising on different tree topologies might be mixed on a given alignment. Models taking into it account usually contain a large number of parameters and the usual tests for model fitting cannot deal with them. Here we discuss an approach for model fitting of algebraic models on m-tree mixtures. This is joint work in progress with Marta Casanellas and Jesus Fernandez-Sanchez
Using empirical codon models to understand the patterns and pressures of natural variation reflected in genomic re-sequencing data
In the past, two kinds of Markov models have been considered to describe protein sequence evolution. Codon-level models have been mechanistic, with a small number of parameters designed to take into account features such as transition-transversion bias, codon frequency bias and synonymous-nonsynonymous amino acid substitution bias. Amino acid models have been empirical, attempting to summarize the replacement patterns observed in large quantities of data and not explicitly considering the distinct factors that shape protein evolution.

We have estimated the first empirical codon model using Maximum Likelihood techniques. Our results show that modelling the evolutionary process is improved by allowing for single, double and triple nucleotide changes; the affiliation between DNA triplets and the amino acid they encode is a main factor driving evolution; and the nonsynonymous-synonymous rate ratio is a suitable measure to classify substitution patterns observed for different proteins. However, are the double and triple changes instantaneous? We aim to take advantage of newly available re-sequencing data to further improve our models and understanding of the evolution of protein coding sequences. Specifically, we estimate empirical codon substitution models from re-sequencing data of several Drosophila species.
Models for Amino Acid Substitution and Gene Duplication With Roots in Molecular and Evolutionary Processes
In Darwinian evolution, mutations occur approximately at random in a gene, turned into amino acid mutations by the genetic code. Some mutations are fixed to become substitutions and some are eliminated from the population. Partitioning pairs of closely related species with complete genome sequences by population size, we look at the PAM matrices generated for these partitions and compare the substitution patterns between species. A population genetic model is generated that relates the relative fixation probabilities of different types of mutations to the selective pressure and population size. Parameterizations of the average and distribution of selective pressures for different amino acid substitution types in different population size comparisons are generated using a Bayesian framework. We find that partitions in population size as well as in substitution type are required to explain the substitution data. Mechanistic explanations of this will be discussed.

In a second line of work, models for gene duplication will be discussed, with potential extension to the Gene tree/species tree reconciliation problem. This second trajectory involves an extension of birth death models based upon an exponential distribution using a generalization based upon a Weibull distribution. Distinct parameterizations of the Weibull distribution are expected to be consistent with the neofunctionalization, subfunctionalization, and dosage compensation processes. Ongoing work involves the extension of the Weibull-based models to the gene tree/species tree reconciliation approach using a Bayesian framework, with potential utility in additionally identifying underlying evolutionary processes.
Molecular Fitness Land- and Seascapes
Molecular Fitness Land- and Seascapes
A coalescent process Markov in time and space
A coalescent process Markov in time and space
Accelerated and biased nucleotide evolution in the human genome
Comparative genomics provides an opportunity to investigate the evolutionary forces that have shaped the human genome. I will first describe phyloP, a new statistical method for identifying potentially lineage-specific shifts in the rate of nucleotide evolution. We applied this approach in a genome-wide scan for evolutionarily conserved sequences that were extensively changed in the last ~6 million years since divergence of humans from our common ancestor with chimpanzee. Many of these Human Accelerated Regions (HARs) appear to be regulatory regions controlling important developmental transcription factors, with human-specific substitutions changing the predicted ability for particular factors to bind. Interestingly, the HARs show a striking substitution bias resulting from preferential fixation of GC alleles. Fast evolving protein-coding exons and other rapidly evolving regions of the human genome have a similar pattern of GC fixation bias. This bias is highly correlated with recombination rates, suggesting the involvement of neutral evolutionary forces such as biased gene conversion (BGC). To disentangle the effects of BGC and positive selection, we implemented an extension of phyloP that enables tests for lineage-specific changes in the rate and/or pattern of substitutions. Using this approach, we were able to approximate the percentages of mammalian conserved elements that have been altered by selective versus neutral evolutionary forces in humans.
Bayesian Gene-tree Reconstruction and Learning in Phylogenomics
Advances in genome sequencing have enabled the study of evolution across both genomes and large clades of species, and has been especially useful for studying gene families as they expand and contract over evolutionary time by gene duplication and loss events. Here, we present a new approach for the reconstruction of gene-tree phylogenies that models simultaneously gene and species evolution, which enables significantly increased reconstruction accuracy by incorporating genome-wide information. We introduce SPIDIR v2, a Bayesian method for gene tree reconstruction that incorporates within a unified framework models for gene duplication and loss, gene- and species-specific rate variation, and sequence substitution. The method includes an Expectation-Maximization method for inferring the distribution of species-specific and gene-specific rates across unambiguous orthologs, which we then use as priors in the reconstruction of any family. We have also developed a novel fast search algorithm that uses the Birth-Death process to more efficiently explore tree space leading to faster and more accurate recovery of a Maximum a posteriori (MAP) gene tree. We have applied our method in two clades of fully-sequenced genomes consisting of 12 Drosophila and 16 fungal species as well as simulated data. We use these extensive benchmarks to study the overall and branch-level accuracy for SPIDIR and many other phylogenetic methods. We find that SPIDIR achieves dramatic improvements in reconstruction accuracy over both traditional phylogenetic methods (Maximum Parsimony, Maximum Likelihood, and Bayesian methods) as well as other species-aware methods (SYNERGY, PRIME-GSR) in both real and simulated datasets.

Work done in collaboration with Manolis Kelli.
Coalescent HMMs for inference of population genetics processes in ancestral species
Coalescent HMMs for inference of population genetics processes in ancestral species
Estimation of ancestral population divergence times and effective population sizes from complete human genome sequences
Complete genome sequences are now available for individuals representing several distinct human populations. The primary motivation for collecting these sequences has been to characterize human diversity, facilitate disease association studies, and pave the way for an era of "personalized medicine". However, these data also contain valuable information about human evolution. Here I will describe an effort currently in progress to estimate evolutionary parameters such as the times at which major population groups diverged and the effective sizes of ancestral human populations, based on sequence data for two West Africans, two East Asians, two individuals of European descent, and a Khoisan hunter gatherer from the Kalahari Desert in Southern Africa. This work involves an interesting mixture of traditional phylogenetics and population genetics, and also requires a number of challenging technical issues -- concerning alignment and sequencing error, and genotype estimation -- to be addressed. I will describe our statistical model, which is derived from the MCMCCOAL model by Rannala and Yang, and our Markov chain Monte Carlo methods for inference. I will also present preliminary results from our analysis and discuss their relationship with what is currently known about early human evolution.
Many-Core Algorithms for Statistical Phylogenetics
Massive numerical integration plagues the statistical inference of partially observed stochastic processes. An important biological example entertains partially observed continuous-time Markov chains (CTMCs) to model molecular sequence evolution. Joint inference of phylogenetic trees and codon-based substitution models of sequence evolution remains computationally impractical. Parallelizing data likelihood calculations is an obvious strategy; however, across a cluster-computer, this scales with the total number of processing cores, incurring considerable cost to achieve reasonable run-time.

To solve this problem, I describe many-core computing algorithms that harness inexpensive graphics processing units (GPUs) for calculation of the likelihood under CTMC models of evolution. High-end GPUs containing hundreds of cores and are low-cost. These novel algorithms are particularly efficient for large state-spaces, including codon models, and large data sets, such as full genome alignments where we demonstrate up to 150-fold speed-up. I conclude with a discussion of the future of many-core computing in statistics and touch upon recent experiences with massively large and high-dimensional mixture models.
video image

Using empirical codon models to understand the patterns and pressures of natural variation reflected in genomic re-sequencing data
Carolin Kosiol In the past, two kinds of Markov models have been considered to describe protein sequence evolution. Codon-level models have been mechanistic, with a small number of parameters designed to take into account features such as transition-transversion bi

video image

Parameter estimation in models for sequence alignment
Ana Arribas-Gil Models for pairwise alignment based on the TKF (Thorne, Kishino and Felsenstein 1991) indel process fit into the pair-Hidden Markov Model (pair-HMM). Observations in a pair-HMM are formed by the couple of sequences to be aligned and the hidden alignm

video image

Many-Core Algorithms for Statistical Phylogenetics
Marc Suchard Massive numerical integration plagues the statistical inference of partially observed stochastic processes. An important biological example entertains partially observed continuous-time Markov chains (CTMCs) to model molecular sequence evolution. Joi

video image

Transposable elements: germline invaders with a lasting impact on genome evolution
Cedric Feschotte I will provide an overview of our studies of the evolutionary dynamics of transposable elements and their impact on eukaryotic evolution. The emphasis will be on class 2 or DNA transposons in mammalian genomes. Genomic analyses reveal that a vast div

video image

Inference of population structure using dense genotype data
Daniel Falush Inference of population structure using dense genotype data

video image

Model fitting for mixture models
Anna Kedzierska Phylogenetic data arising on different tree topologies might be mixed on a given alignment. Models taking into it account usually contain a large number of parameters and the usual tests for model fitting cannot deal with them. Here we discuss an app

video image

Estimation of ancestral population divergence times and effective population sizes from complete human genome sequences
Adam Siepel Complete genome sequences are now available for individuals representing several distinct human populations. The primary motivation for collecting these sequences has been to characterize human diversity, facilitate disease association studies, and p

video image

Bayesian Gene-tree Reconstruction and Learning in Phylogenomics
Matthew Rasmussen Advances in genome sequencing have enabled the study of evolution across both genomes and large clades of species, and has been especially useful for studying gene families as they expand and contract over evolutionary time by gene duplication and lo

video image

Models for Amino Acid Substitution and Gene Duplication With Roots in Molecular and Evolutionary Processes
David Liberles In Darwinian evolution, mutations occur approximately at random in a gene, turned into amino acid mutations by the genetic code. Some mutations are fixed to become substitutions and some are eliminated from the population. Partitioning pairs of close

video image

Parallel mutations and partial sweeps
Graham Coop Parallel mutations and partial sweeps

video image

A coalescent process Markov in time and space
Thomas Mailund A coalescent process Markov in time and space