## Workshop 6: Recombination: Hotspots and Haplotype Structure

### Organizers

Rick Durrett
Department of Mathematics, Cornell University
Paul Fuerst
College of Biological Sciences, The Ohio State University

Understanding the nature and causes of linkage disequilibrium in the human genome is important for mapping complex disease loci through association studies. The sequencing of the human genome revealed a remarkable haplotype structure and led to the HapMap project whose goal is to understand the patterns of DNA sequence variation. In a parallel development, recent studies have shown that much recombination occurs in hot spots. This workshop will concentrate on mathematical, statistical, and computational approaches to estimating recombination rates and determining the causes of haplotype structure.

### Accepted Speakers

Carlos Bustamante
Biostatistics & Computational Biology, Cornell University
Ranajit Chakraborty
Environmental Health, University of Cincinnati
Rick Durrett
Department of Mathematics, Cornell University
Department of Mathematics & Statistics, Lancaster University
Eran Halperin
International Computer Science Institute (ICSI)
Paul Joyce
Mathematics, Statistics, & Bioinformatics, University of Idaho
Rasmus Nielsen
Center for Bioinformatics
Jonathan Pritchard
Department of Human Genetics, University of Chicago
Susan Ptak
Evolutionary Genetics, Max Planck Institute for Evolutionary Genetics
Noah Rosenberg
Molecular and Computational Biology, University of Southern California
Fengzhu Sun
Department of Mathematics, University of Southern California
Marcy Uyenoyama
Department of Biology, Duke University
Monday, June 13, 2005
Time Session
09:15 AM
10:15 AM
Paul Fearnhead - Likelihood-based Methods for Detecting Recombination Hotspots from Population Data

Recent evidence suggests that recombination hotspots are common across the human genome. We show how (approximate) likelihood-based methods for estimating recombination rates from population data can be adapted to the problem of detecting recombination hotspots. We extend an existing method for detecting recombination hotspots, which uses likelihood curves for the recombination rate across small sub-regions of the genome. This new method appears more powerful than existing methods at detecting hotspots - simulation results suggest a power of around 60-80% with a false positive rate of 1-5%.

Analysis of Seattle SNP genes suggests that recombination hotspots are randomly distributed across the genome, with an average spacing of around 1 per 30-40kb. Many genes contain more than one hotspot. There is little evidence for hotspots which occur in only one of the two (European American and African American) populations.

10:45 AM
11:45 AM
Noah Rosenberg - Population Structure and Homozygosity-based Measures of Linkage Disequilibrium

Inferences about linkage disequilibrium are often based on haplotypes estimated from genotype data. To avoid using estimated haplotypes in measuring pairwise linkage disequilibrium, it is possible to employ statistics of Ohta (1980) and Sabatti and Risch (2002) that utilize the difference between the observed proportion of double homozgygotes and the prediction made about double homozygosity from the homozygosities of individual loci. In this talk, I investigate some properties of homozygosity-based linkage disequilibrium statistics, paying particular attention to how the statistics are affected by population structure.

02:00 PM
03:00 PM
Fengzhu Sun - Haplotype Block Partition and Tag SNP Selection and Their Applications to Association Studies

The HapMap project will generate enormous amount of data on human genomic variation. Due to high linkage disequilibrium in many regions, a small fraction of SNPs (tag SNPs) are sufficient to capture most of the haplotype structure of the human genome. We developed a suite of dynamic algorithms for haplotype block partition and tag SNP selection to minimize the total number of tag SNPs across the region of interest or the whole genome. Our algorithm can be applied to both haplotype and genotype data as well as any pedigree structures. We also studied the power issues in association studies related to tag SNP selection using simulated as well as real data.

03:30 PM
04:30 PM
Susan Ptak - Fine-scale Recombination Patterns Differ between Chimpanzees and Humans

Two recent studies examined single human recombination hotspots in other primates and found no evidence for an increased rate of recombination. These findings raised the question of how conserved recombination rates are among closely related species. To address this, we estimated recombination rates from 14 Mb of linkage disequilibrium data in chimpanzees and in humans. The results suggest that recombination hotspots are not conserved between the two species and that recombination rates in larger (50 kb) genomic regions are only weakly conserved. Thus, the recombination landscape has changed dramatically between the two species.

Tuesday, June 14, 2005
Time Session
09:00 AM
10:00 AM
Paul Joyce - Efficient Simulation Methods for a Class of Nonneutral Population Genetics Models

Many of the current methods for uncovering the genetic basis of common complex diseases in humans aim to exploit linkage disequilibrium (LD). Patterns of LD depend crucially on the shape of genealogical trees at the loci involved, so there is considerable interest in understanding how these would be affected be selection. An algorithm for exact simulation from the genealogical history of a sample, for population genetics models with general diploid selection and parent independent mutation was developed by Stephens and Donnelly (2003) based on the work of Slade (2000). Central to their approach is the need to calculate the constant of integration for the $K$ allele model with selection. Donnelly, Nordborg, and Joyce (DNJ 2001) developed methods for likelihood analysis under the $K$ allele model with selection. Here we present a new method for likelihood analysis that is substantially more efficient than DNJ (2001) and can be used to improve the efficiency of Stephens and Donnelly (2003). The method uses numerical analysis techniques, including fast Fourier transforms to calculate the intractable constant of integration. The method provides a perfect simulation approach for directly drawing allele frequency samples from the distribution under selection. This research is joint work with Alan Genz, Washington State University.

10:30 AM
11:30 AM
Rasmus Nielsen - Analysis of Ascertained SNP Data

Most available human Single Nucleotide Polymorphism (SNP) data have been obtained through a complicated process in which SNPs are first discovered in a small sample and then genotyped in a larger sample. This fundamentally affects the data and affects many properties of the data including linkage disequilibrium, frequency spectrum, levels of population differentiation, etc. It also implies that standard population genetic analyses are not aplicable to the vast amjority of the human SNP data. I wll discuss the ascertainment process in some of the major SNP data sets available (Perlegen and HapMap), discuss how the ascertainment process has affected the data, and how various corrections methods can implemented to allow valid populaiton genetic inferences.

Wednesday, June 15, 2005
Time Session
09:00 AM
10:00 AM
Jonathan Pritchard - Detecting Partial Selective Sweeps from SNP Data

The new HapMap and Perlegen data sets offer the first opportunities to scan the human genome for signatures of natural selection. One promising approach is to search for polymorphic variants that have undergone recent directional selection using patterns of long-range LD (e.g., as in Sabeti et al 2002). In this talk I outline a new approach that extends the PAC-likelihood model of Li and Stephens (2003) in order to test for this type of signal in an approximate likelihood framework. The new test controls appropriately for local recombination rate heterogeneity, which may confound simpler approaches. I discuss applications to the genome-wide data sets.

10:30 AM
11:30 AM
Marcy Uyenoyama - Likelihoods from Summary Statistics

I address methods for inferring population parameters from multiple summary statistics. We generated a maximum-likelihood estimate of the rate of recombination between a neutral marker locus and the target of strong balancing selection to which it shows nearly completely linkage. Recently, we developed an importance sampling (IS) approximation to the time-consuming computation of exact likelihoods on which this method relies. In a study of the demographic history of closely related Drosophila species, we found that the IS approach yielded accurate estimates under a much reduced computational burden. I end with a discussion of ongoing explorations of the effects of genomic location, including cosegregation with incompatibility genes, targets of selection, or centromeres, on rates of introgression.

02:00 PM
03:00 PM
Eran Halperin - Estimating Haplotype Frequencies Efficiently

In this talk I will introduce a new method, HAPLOFREQ, to estimate haplotype frequencies over a short genomic region given the genotypes or haplotypes with missing data and/or sequencing errors. Our approach is based on rigorous analysis of the likelihood function, and in particular the method is guaranteed to efficiently converge to the global optimum of the likelihood function. Finally, I will discuss the relations between haplotype frequency estimation and tag SNP selection.

03:30 PM
04:30 PM
Rick Durrett - The Impact of Spatial Structure on Genetic Data

We will review recent results on the stepping stone model which show that it has a much different impact on genetic data than the often used island model. We will describe theoretical results for coalescing times done in collaboration with Ted Cox and Iljana Zahle, as well as simulation results of Arkendra De for the site frequncy spectrum and decay of linkage disequilibirum along a chromosome.

Thursday, June 16, 2005
Time Session
09:00 AM
10:00 AM
Ranajit Chakraborty - Effects of Mutation and Population Demography on the Dynamics of Linkage Disequilibria and their Relevance for Mapping Complex Disease Genes

Complex diseases constitute the major public health burden in all societies around the world. However, the success in determining etiology of such diseases has been rather limited for several reasons. This presentation starts with a brief outline of possible reasons of the difficulties involved in elucidating the genetic basis of complex diseases. From these discussions, we argue that population-based association studies are more likely to provide insights of genetic basis of complex diseases, rather than traditional family-based study designs. However, since disease-gene association at population level stems from inter-locus association of alleles, a thorough understanding of population genetic properties of linkage disequilibrium (LD) is needed for appropriate genetic interpretation of disease-gene association data. To this effect, some properties of genome-wide background LD are examined through a coalescence-based simulation study. We show that when microsatellite loci are used as genomic markers for disease-gene association studies, the expectation of the weighted normalized LD between two loci decreases with recombination distance between loci. However, the extent and trend of such decay is dependent on the rate and pattern of mutations as well as on the demographic history of populations. For example, for any specified recombination distance, the simulation results show that the power of detection of LD is larger in populations of constant smaller size. In a growing population, the power of detecting LD is substantially reduced, making it comparable to that expected in a constant population of the largest size reached by the population. In presence of population growth, the enhancement of LD detection power with increasing sample size is less conspicuous than in populations of constant size. Power of detection of LD is also larger for loci with higher mutation rate in populations of constant size, although under population growth, the effect of mutation rate is reversed, particularly for markers of larger recombination distances. Multistep forward-backward mutations at microsatellite loci actually increase the power. Finally, presence of multiple alleles at microsatellite loci makes such markers more powerful to detect LD, than the common single nucleotide polymorphism sites (SNPs) residing at the same recombination distance. (Research supported by US Public Health Research Grants from the National Institutes of Health).

10:30 AM
11:30 AM
Carlos Bustamante - Inferring the Distribution of Selective Effects among Mutation, SNPs, and Fixed Differences using Polymorphism and Divergence Data

I will present recent work on frequentist and Bayesian approaches to the problem of inferring the distribution of selection coefficients on newly arising mutations within the context of "shift-models". Methods that use whole-genome SNP frequency data, polymorphism and divergence across protein-coding gene data, and combined SNP frequency, invariant, and divergence data will be presented. Forward simulation with selection and recombination are used to gauge the sensitivity, robustness, and accuracy of our models. Lastly, we apply the method to human polymorphism and divegence data to estimate the proportion of mutations, SNPs, and nucleotide substitutions in the human genome that are deleterious, neutral, and adaptive.

[Joint work w/ Rasmus Nielsen, Andy Clark, Scott Williamson, Adi Fledel-Alon, and Ryan Hernandez]

Name Email Affiliation
Andres-Moran, Aida aa335@cornell.edu Molecular Biology and Evolution, Cornell University
Bansal, Vikas vibansal@cs.ucsd.edu Computer Science and Engineering, University of California, San Diego
Bazaliy, Borys Institute of Applied Mathematics and Mechanics, National Academy of Sciences of Ukraine
Best, Janet jbest@mbi.osu.edu Mathematics, The Ohio State University
Boni, Maciej maciek@stanford.edu College of Biological Sciences, Stanford University
Borisyuk, Alla borisyuk@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Bustamante, Carlos cdb28@cornell.edu Biostatistics & Computational Biology, Cornell University
Chakraborty, Ranajit ranajit.chakraborty@uc.edu Environmental Health, University of Cincinnati
Chen, Hua hchen@berkeley.edu Integrative Biology, University of California, Berkeley
Chen, Yuguo yuguo@stat.duke.edu Institute of Statistics & Decision Sciences, Duke University
Coop, Graham gcoop79@gmail.com Department of Human Genetics, University of Chicago
Cracium, Gheorghe craciun@math.wisc.edu Dept. of Mathematics, University of Wisconsin-Madison
Davison, Dan davison@uchicago.edu Committee on Evolutionary Biology, University of Chicago
De, Arkendra ade1@cam.cornell.edu Department of Statistical Sciences, Cornell University
Dougherty, Daniel dpdoughe@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Durrett, Rick rtd1@cornell.edu Department of Mathematics, Cornell University
Fuerst, Paul fuerst.1@osu.edu College of Biological Sciences, The Ohio State University
Goel, Pranay goelpra@helix.nih.gov NIDDK, Indian Institute of Science Education and Research
Guo, Yixin yixin@math.drexel.edu Department of Psychology, The Ohio State University
Halperin, Eran heran@icsi.berkeley.edu International Computer Science Institute (ICSI)
Joyce, Paul joyce@uidaho.edu Mathematics, Statistics, & Bioinformatics, University of Idaho
Kim, Su skim@galton.uchicago.edu Department of Human Genetics, University of Chicago
Lim, Sookkyung limsk@math.uc.edu Department of Mathematical Sciences, University of Cincinnati
Lin, Shili lin.328@osu.edu Department of Statistics, The Ohio State University
Ma, Xiaotu xiaotuma@hotmail.com Computational Molecular Biology, University of Southern California
Melfi, Vincent melfi@mbi.osu.edu Mathematics, Michigan State University
Nielsen, Rasmus rn28@cornell.edu Center for Bioinformatics
Plagnol, Vincent vincent.plagnol@normalesup.org Computational Biology, University of Southern California
Pol, Diego dpol@mbi.osu.edu Independent Researcher, Museo Paleontologico E. Feruglio
Pritchard, Jonathan tcharley@bsd.uchicago.edu Department of Human Genetics, University of Chicago
Ptak, Susan ptak@eva.mpg.de Evolutionary Genetics, Max Planck Institute for Evolutionary Genetics
Rassoul-Agha, Firas firas@math.ohio-state.edu Department of Mathematics, University of Utah
Rejniak, Katarzyna rejniak@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Rosenberg, Noah noahr@cmb.usc.edu Molecular and Computational Biology, University of Southern California
Schmidt, Deena dschmidt@mbi.osu.edu
Stubna, Michael stubna@mbi.osu.edu Engineering Team Leader, Pulsar Informatics
Sun, Fengzhu fsun@hto.usc.edu Department of Mathematics, University of Southern California
Tayo, Bamidele botayo@buffalo.edu Social and Preventive Medicine, University at Buffalo (SUNY)
Terman, David terman@math.ohio-state.edu Mathemathics Department, The Ohio State University
Tian, Jianjun Paul tianjj@mbi.osu.edu Mathematics, College of William and Mary
Uyenoyama, Marcy marcy@duke.edu Department of Biology, Duke University
Wang, Zailong zlwang@mbi.osu.edu Integrated Information Sciences, Novartis
Wechselberger, Martin wm@mbi.osu.edu Mathematical Biosciences Insitute, The Ohio State University
Williamson, Scott sw292@cornell.edu Biological Statistics & Computational Biology, Cornell University
Xu, Haiyan haiyan@stat.ohio-state.edu Department of Statistics, The Ohio State University
Zhang, Kui kzhang@ms.soph.uab.edu Department of Biostatistics, University of Alabama at Birmingham
Zhou, Jin jzhou@mbi.osu.edu Department of Mathematics, Northern Michigan University
Zoellner, Sebastian szoellne@bsd.uchicago.edu Department of Human Genetics, University of Chicago
Inferring the Distribution of Selective Effects among Mutation, SNPs, and Fixed Differences using Polymorphism and Divergence Data

I will present recent work on frequentist and Bayesian approaches to the problem of inferring the distribution of selection coefficients on newly arising mutations within the context of "shift-models". Methods that use whole-genome SNP frequency data, polymorphism and divergence across protein-coding gene data, and combined SNP frequency, invariant, and divergence data will be presented. Forward simulation with selection and recombination are used to gauge the sensitivity, robustness, and accuracy of our models. Lastly, we apply the method to human polymorphism and divegence data to estimate the proportion of mutations, SNPs, and nucleotide substitutions in the human genome that are deleterious, neutral, and adaptive.

[Joint work w/ Rasmus Nielsen, Andy Clark, Scott Williamson, Adi Fledel-Alon, and Ryan Hernandez]

Effects of Mutation and Population Demography on the Dynamics of Linkage Disequilibria and their Relevance for Mapping Complex Disease Genes

Complex diseases constitute the major public health burden in all societies around the world. However, the success in determining etiology of such diseases has been rather limited for several reasons. This presentation starts with a brief outline of possible reasons of the difficulties involved in elucidating the genetic basis of complex diseases. From these discussions, we argue that population-based association studies are more likely to provide insights of genetic basis of complex diseases, rather than traditional family-based study designs. However, since disease-gene association at population level stems from inter-locus association of alleles, a thorough understanding of population genetic properties of linkage disequilibrium (LD) is needed for appropriate genetic interpretation of disease-gene association data. To this effect, some properties of genome-wide background LD are examined through a coalescence-based simulation study. We show that when microsatellite loci are used as genomic markers for disease-gene association studies, the expectation of the weighted normalized LD between two loci decreases with recombination distance between loci. However, the extent and trend of such decay is dependent on the rate and pattern of mutations as well as on the demographic history of populations. For example, for any specified recombination distance, the simulation results show that the power of detection of LD is larger in populations of constant smaller size. In a growing population, the power of detecting LD is substantially reduced, making it comparable to that expected in a constant population of the largest size reached by the population. In presence of population growth, the enhancement of LD detection power with increasing sample size is less conspicuous than in populations of constant size. Power of detection of LD is also larger for loci with higher mutation rate in populations of constant size, although under population growth, the effect of mutation rate is reversed, particularly for markers of larger recombination distances. Multistep forward-backward mutations at microsatellite loci actually increase the power. Finally, presence of multiple alleles at microsatellite loci makes such markers more powerful to detect LD, than the common single nucleotide polymorphism sites (SNPs) residing at the same recombination distance. (Research supported by US Public Health Research Grants from the National Institutes of Health).

The Impact of Spatial Structure on Genetic Data

We will review recent results on the stepping stone model which show that it has a much different impact on genetic data than the often used island model. We will describe theoretical results for coalescing times done in collaboration with Ted Cox and Iljana Zahle, as well as simulation results of Arkendra De for the site frequncy spectrum and decay of linkage disequilibirum along a chromosome.

Likelihood-based Methods for Detecting Recombination Hotspots from Population Data

Recent evidence suggests that recombination hotspots are common across the human genome. We show how (approximate) likelihood-based methods for estimating recombination rates from population data can be adapted to the problem of detecting recombination hotspots. We extend an existing method for detecting recombination hotspots, which uses likelihood curves for the recombination rate across small sub-regions of the genome. This new method appears more powerful than existing methods at detecting hotspots - simulation results suggest a power of around 60-80% with a false positive rate of 1-5%.

Analysis of Seattle SNP genes suggests that recombination hotspots are randomly distributed across the genome, with an average spacing of around 1 per 30-40kb. Many genes contain more than one hotspot. There is little evidence for hotspots which occur in only one of the two (European American and African American) populations.

Estimating Haplotype Frequencies Efficiently

In this talk I will introduce a new method, HAPLOFREQ, to estimate haplotype frequencies over a short genomic region given the genotypes or haplotypes with missing data and/or sequencing errors. Our approach is based on rigorous analysis of the likelihood function, and in particular the method is guaranteed to efficiently converge to the global optimum of the likelihood function. Finally, I will discuss the relations between haplotype frequency estimation and tag SNP selection.

Efficient Simulation Methods for a Class of Nonneutral Population Genetics Models

Many of the current methods for uncovering the genetic basis of common complex diseases in humans aim to exploit linkage disequilibrium (LD). Patterns of LD depend crucially on the shape of genealogical trees at the loci involved, so there is considerable interest in understanding how these would be affected be selection. An algorithm for exact simulation from the genealogical history of a sample, for population genetics models with general diploid selection and parent independent mutation was developed by Stephens and Donnelly (2003) based on the work of Slade (2000). Central to their approach is the need to calculate the constant of integration for the $K$ allele model with selection. Donnelly, Nordborg, and Joyce (DNJ 2001) developed methods for likelihood analysis under the $K$ allele model with selection. Here we present a new method for likelihood analysis that is substantially more efficient than DNJ (2001) and can be used to improve the efficiency of Stephens and Donnelly (2003). The method uses numerical analysis techniques, including fast Fourier transforms to calculate the intractable constant of integration. The method provides a perfect simulation approach for directly drawing allele frequency samples from the distribution under selection. This research is joint work with Alan Genz, Washington State University.

Analysis of Ascertained SNP Data

Most available human Single Nucleotide Polymorphism (SNP) data have been obtained through a complicated process in which SNPs are first discovered in a small sample and then genotyped in a larger sample. This fundamentally affects the data and affects many properties of the data including linkage disequilibrium, frequency spectrum, levels of population differentiation, etc. It also implies that standard population genetic analyses are not aplicable to the vast amjority of the human SNP data. I wll discuss the ascertainment process in some of the major SNP data sets available (Perlegen and HapMap), discuss how the ascertainment process has affected the data, and how various corrections methods can implemented to allow valid populaiton genetic inferences.

Detecting Partial Selective Sweeps from SNP Data

The new HapMap and Perlegen data sets offer the first opportunities to scan the human genome for signatures of natural selection. One promising approach is to search for polymorphic variants that have undergone recent directional selection using patterns of long-range LD (e.g., as in Sabeti et al 2002). In this talk I outline a new approach that extends the PAC-likelihood model of Li and Stephens (2003) in order to test for this type of signal in an approximate likelihood framework. The new test controls appropriately for local recombination rate heterogeneity, which may confound simpler approaches. I discuss applications to the genome-wide data sets.

Fine-scale Recombination Patterns Differ between Chimpanzees and Humans

Two recent studies examined single human recombination hotspots in other primates and found no evidence for an increased rate of recombination. These findings raised the question of how conserved recombination rates are among closely related species. To address this, we estimated recombination rates from 14 Mb of linkage disequilibrium data in chimpanzees and in humans. The results suggest that recombination hotspots are not conserved between the two species and that recombination rates in larger (50 kb) genomic regions are only weakly conserved. Thus, the recombination landscape has changed dramatically between the two species.

Population Structure and Homozygosity-based Measures of Linkage Disequilibrium

Inferences about linkage disequilibrium are often based on haplotypes estimated from genotype data. To avoid using estimated haplotypes in measuring pairwise linkage disequilibrium, it is possible to employ statistics of Ohta (1980) and Sabatti and Risch (2002) that utilize the difference between the observed proportion of double homozgygotes and the prediction made about double homozygosity from the homozygosities of individual loci. In this talk, I investigate some properties of homozygosity-based linkage disequilibrium statistics, paying particular attention to how the statistics are affected by population structure.

Haplotype Block Partition and Tag SNP Selection and Their Applications to Association Studies

The HapMap project will generate enormous amount of data on human genomic variation. Due to high linkage disequilibrium in many regions, a small fraction of SNPs (tag SNPs) are sufficient to capture most of the haplotype structure of the human genome. We developed a suite of dynamic algorithms for haplotype block partition and tag SNP selection to minimize the total number of tag SNPs across the region of interest or the whole genome. Our algorithm can be applied to both haplotype and genotype data as well as any pedigree structures. We also studied the power issues in association studies related to tag SNP selection using simulated as well as real data.

Likelihoods from Summary Statistics

I address methods for inferring population parameters from multiple summary statistics. We generated a maximum-likelihood estimate of the rate of recombination between a neutral marker locus and the target of strong balancing selection to which it shows nearly completely linkage. Recently, we developed an importance sampling (IS) approximation to the time-consuming computation of exact likelihoods on which this method relies. In a study of the demographic history of closely related Drosophila species, we found that the IS approach yielded accurate estimates under a much reduced computational burden. I end with a discussion of ongoing explorations of the effects of genomic location, including cosegregation with incompatibility genes, targets of selection, or centromeres, on rates of introgression.

Videos

### Print

Full Schedule Participant List