Advisers: Radford Neal and Celia Greenwood Haplotype Inference Using a Hidden Markov Model with Efficient Markov Chain Sampling

Shuying Sun
Mathematical Biosciences Institute (MBI), The Ohio State University

(March 6, 2006 10:30 AM - 11:30 AM)

Advisers: Radford Neal and Celia Greenwood Haplotype Inference Using a Hidden Markov Model with Efficient Markov Chain Sampling

Abstract

Knowledge of haplotypes is useful for understanding block structure and disease risk associations. Direct measurement of haplotypes in the absence of family data is presently impractical. Hence several methods have been developed previously for reconstructing haplotypes from population data. We have developed a new population-based method using a Hidden Markov Model (HMM) for the source of the ancestral haplotype segments. For the ancestral haplotypes, a higher order Markov model has been used to account for the linkage disequilibrium. Our model includes parameters for the genotyping error rate, the mutation rate and the recombination rate at each position. Parameters of the model are inferred by Bayesian methods, specifically, Markov Chain Monte Carlo (MCMC) methods. Crucial to the efficiency of the Markov Chain sampling is the use of a forward-backward algorithm for summing over all possible state sequences of the HMM. We have used the model to reconstruct the haplotypes of 129 children in the data set of Daly et al. 2001 and of 30 children in the CEU and YRI data of the HAPMAP project. For these data sets, the family-based reconstructions were found using Merlin (Abecasis et al. 2002). Our haplotype reconstruction method does not require division into small blocks of loci. It produces results that are quite close to the family-based reconstructions and comparable to the state-of-the-art PHASE program of Stephens et al. 2001 and 2003. The recombination rates inferred from our model can help to estimate the recombination hotspots, such as in the data set of Daly et al. 2001 and in the YRI data of the HAPMAP project.