I will present recent work on frequentist and Bayesian approaches to the problem of inferring the distribution of selection coefficients on newly arising mutations within the context of "shift-models". Methods that use whole-genome SNP frequency data, polymorphism and divergence across protein-coding gene data, and combined SNP frequency, invariant, and divergence data will be presented. Forward simulation with selection and recombination are used to gauge the sensitivity, robustness, and accuracy of our models. Lastly, we apply the method to human polymorphism and divegence data to estimate the proportion of mutations, SNPs, and nucleotide substitutions in the human genome that are deleterious, neutral, and adaptive.
[Joint work w/ Rasmus Nielsen, Andy Clark, Scott Williamson, Adi Fledel-Alon, and Ryan Hernandez]
Complex diseases constitute the major public health burden in all societies around the world. However, the success in determining etiology of such diseases has been rather limited for several reasons. This presentation starts with a brief outline of possible reasons of the difficulties involved in elucidating the genetic basis of complex diseases. From these discussions, we argue that population-based association studies are more likely to provide insights of genetic basis of complex diseases, rather than traditional family-based study designs. However, since disease-gene association at population level stems from inter-locus association of alleles, a thorough understanding of population genetic properties of linkage disequilibrium (LD) is needed for appropriate genetic interpretation of disease-gene association data. To this effect, some properties of genome-wide background LD are examined through a coalescence-based simulation study. We show that when microsatellite loci are used as genomic markers for disease-gene association studies, the expectation of the weighted normalized LD between two loci decreases with recombination distance between loci. However, the extent and trend of such decay is dependent on the rate and pattern of mutations as well as on the demographic history of populations. For example, for any specified recombination distance, the simulation results show that the power of detection of LD is larger in populations of constant smaller size. In a growing population, the power of detecting LD is substantially reduced, making it comparable to that expected in a constant population of the largest size reached by the population. In presence of population growth, the enhancement of LD detection power with increasing sample size is less conspicuous than in populations of constant size. Power of detection of LD is also larger for loci with higher mutation rate in populations of constant size, although under population growth, the effect of mutation rate is reversed, particularly for markers of larger recombination distances. Multistep forward-backward mutations at microsatellite loci actually increase the power. Finally, presence of multiple alleles at microsatellite loci makes such markers more powerful to detect LD, than the common single nucleotide polymorphism sites (SNPs) residing at the same recombination distance. (Research supported by US Public Health Research Grants from the National Institutes of Health).
We will review recent results on the stepping stone model which show that it has a much different impact on genetic data than the often used island model. We will describe theoretical results for coalescing times done in collaboration with Ted Cox and Iljana Zahle, as well as simulation results of Arkendra De for the site frequncy spectrum and decay of linkage disequilibirum along a chromosome.
Recent evidence suggests that recombination hotspots are common across the human genome. We show how (approximate) likelihood-based methods for estimating recombination rates from population data can be adapted to the problem of detecting recombination hotspots. We extend an existing method for detecting recombination hotspots, which uses likelihood curves for the recombination rate across small sub-regions of the genome. This new method appears more powerful than existing methods at detecting hotspots - simulation results suggest a power of around 60-80% with a false positive rate of 1-5%.
Analysis of Seattle SNP genes suggests that recombination hotspots are randomly distributed across the genome, with an average spacing of around 1 per 30-40kb. Many genes contain more than one hotspot. There is little evidence for hotspots which occur in only one of the two (European American and African American) populations.
In this talk I will introduce a new method, HAPLOFREQ, to estimate haplotype frequencies over a short genomic region given the genotypes or haplotypes with missing data and/or sequencing errors. Our approach is based on rigorous analysis of the likelihood function, and in particular the method is guaranteed to efficiently converge to the global optimum of the likelihood function. Finally, I will discuss the relations between haplotype frequency estimation and tag SNP selection.
Many of the current methods for uncovering the genetic basis of common complex diseases in humans aim to exploit linkage disequilibrium (LD). Patterns of LD depend crucially on the shape of genealogical trees at the loci involved, so there is considerable interest in understanding how these would be affected be selection. An algorithm for exact simulation from the genealogical history of a sample, for population genetics models with general diploid selection and parent independent mutation was developed by Stephens and Donnelly (2003) based on the work of Slade (2000). Central to their approach is the need to calculate the constant of integration for the $K$ allele model with selection. Donnelly, Nordborg, and Joyce (DNJ 2001) developed methods for likelihood analysis under the $K$ allele model with selection. Here we present a new method for likelihood analysis that is substantially more efficient than DNJ (2001) and can be used to improve the efficiency of Stephens and Donnelly (2003). The method uses numerical analysis techniques, including fast Fourier transforms to calculate the intractable constant of integration. The method provides a perfect simulation approach for directly drawing allele frequency samples from the distribution under selection. This research is joint work with Alan Genz, Washington State University.
Most available human Single Nucleotide Polymorphism (SNP) data have been obtained through a complicated process in which SNPs are first discovered in a small sample and then genotyped in a larger sample. This fundamentally affects the data and affects many properties of the data including linkage disequilibrium, frequency spectrum, levels of population differentiation, etc. It also implies that standard population genetic analyses are not aplicable to the vast amjority of the human SNP data. I wll discuss the ascertainment process in some of the major SNP data sets available (Perlegen and HapMap), discuss how the ascertainment process has affected the data, and how various corrections methods can implemented to allow valid populaiton genetic inferences.
The new HapMap and Perlegen data sets offer the first opportunities to scan the human genome for signatures of natural selection. One promising approach is to search for polymorphic variants that have undergone recent directional selection using patterns of long-range LD (e.g., as in Sabeti et al 2002). In this talk I outline a new approach that extends the PAC-likelihood model of Li and Stephens (2003) in order to test for this type of signal in an approximate likelihood framework. The new test controls appropriately for local recombination rate heterogeneity, which may confound simpler approaches. I discuss applications to the genome-wide data sets.
Two recent studies examined single human recombination hotspots in other primates and found no evidence for an increased rate of recombination. These findings raised the question of how conserved recombination rates are among closely related species. To address this, we estimated recombination rates from 14 Mb of linkage disequilibrium data in chimpanzees and in humans. The results suggest that recombination hotspots are not conserved between the two species and that recombination rates in larger (50 kb) genomic regions are only weakly conserved. Thus, the recombination landscape has changed dramatically between the two species.
Inferences about linkage disequilibrium are often based on haplotypes estimated from genotype data. To avoid using estimated haplotypes in measuring pairwise linkage disequilibrium, it is possible to employ statistics of Ohta (1980) and Sabatti and Risch (2002) that utilize the difference between the observed proportion of double homozgygotes and the prediction made about double homozygosity from the homozygosities of individual loci. In this talk, I investigate some properties of homozygosity-based linkage disequilibrium statistics, paying particular attention to how the statistics are affected by population structure.
The HapMap project will generate enormous amount of data on human genomic variation. Due to high linkage disequilibrium in many regions, a small fraction of SNPs (tag SNPs) are sufficient to capture most of the haplotype structure of the human genome. We developed a suite of dynamic algorithms for haplotype block partition and tag SNP selection to minimize the total number of tag SNPs across the region of interest or the whole genome. Our algorithm can be applied to both haplotype and genotype data as well as any pedigree structures. We also studied the power issues in association studies related to tag SNP selection using simulated as well as real data.
I address methods for inferring population parameters from multiple summary statistics. We generated a maximum-likelihood estimate of the rate of recombination between a neutral marker locus and the target of strong balancing selection to which it shows nearly completely linkage. Recently, we developed an importance sampling (IS) approximation to the time-consuming computation of exact likelihoods on which this method relies. In a study of the demographic history of closely related Drosophila species, we found that the IS approach yielded accurate estimates under a much reduced computational burden. I end with a discussion of ongoing explorations of the effects of genomic location, including cosegregation with incompatibility genes, targets of selection, or centromeres, on rates of introgression.
Motivated by the statistical inference problem in population genetics, we present a new sequential importance sampling with resampling strategy. The idea of resampling is key to the recent surge of popularity of sequential Monte Carlo methods in the statistics and engineering communities, but existing resampling techniques do not work well for coalescent-based inference problems in population genetics. We develop a new method called "stopping-time resampling," which allows one to compare partially simulated samples at different stages so as to terminate unpromising partial samples and multiply promising ones early on.
There is increasing evidence that hotspots of meiotic recombination in humans, as well as in other organisms, are a transient features of the genome. This observation is commonly believed to be the result of biased gene conversion in favour of alleles that locally disrupt hotspots. We investigate the effect of such alleles on the short-term evolution of hotspots through population genetic models. Our results indicate that a lack of sharing of intense hotspots between species is to be expected even if there are few sites where hotspot-disrupting alleles can arise. Effective population size is found to play a significant role in the fate of hotspots. The distribution of hotspot intensities in a population under different models of hotspot genesis is discussed. The effect of alleles that influence the intensity of a hotspot on patterns of diversity are explored, we find that alleles that reduce the intensity of a hotspot leave little trace of their presence in the patterns found in population data.
We focus in this paper on the issue of model fitting for the history of two human populations: European and African-American. Most of the analysis is based on the Seattle SNPS's database, but it also highlights the issue of ascertainment bias when dealing with the HapMap dataset. We designed a simple model whose features can explain the pattern of variation observed in the data. After estimating its parameters and assessing its goodness-of-fit we use our model to understand the relation between frequency and age of SNPs in each subpopulation. It illustrates key differences between the genealogical histories of African and European populations. These findings have implications for disease mapping and scan for selection.
Haplotype reconstruction for tight linked markers in general pedigrees remains a challenging problem. Not only a few methods are available to efficiently estimate haplotype frequencies and accurately infer haplotype configurations in general pedigrees with a large number of tightly linked SNPs, especially in the presence of missing data, but also performances of them have not been carefully evaluated. We have developed an efficient computer program, HAPLORE, for haplotype frequency estimation and reconstruction in general pedigrees with tightly linked SNP markers. In this report, we compare and contrast HAPLORE with other two previously published methods. We review the methods and point out the differences between them in terms of the models and computational strategies they use. The performances of them are assessed through simulated haplotypes based on real pedigrees. Our results indicate HAPLORE outperforms other methods.