Workshop 1: Analysis of Gene Expression Data: Principles and Applications

(October 11,2004 - October 15,2004 )

Organizers


Shili Lin
Department of Statistics, The Ohio State University
Terence Speed
Department of Statistics, University of California, Berkeley

A (protein coding) gene is determined to be expressed in a cell or group of cells when its transcribed messenger RNA (mRNA), or the resulting protein product, is detected. There are a wide variety of techniques for determining and quantifying gene expression, and most of these have substantial analytical components to them.

We measure gene expression in order to compare the expression levels of one or more genes in cells from different sources. Comparisons of interest include tumor versus normal cells, cells from a specific organ in a mutant, or genetically modified organism versus cells from the same organ in a normal organism of the same strain, and cells before and after an intervention such as a drug treatment.

There are many techniques for measuring gene expression, but perhaps most common at the moment are ones which rely on DNA-RNA or DNA-DNA hybridization. This is the process through which single-stranded DNA and RNA molecules find and base-pair with their complementary sequences amidst a complex mixture of many molecules of the same kind.

The older cellular-wide method for measuring gene expression at the protein level was two-dimensional gel (2D-Gel) analysis, where complex mixtures were separated by pH and size using isoelectric focusing and polyacrylamide gel electrophoresis (PAGE). The technique was combined with mass spectrometry (MS) in the 1990s, and now there are a number of electrophoresis-free MS based approaches to measuring protein levels. More recently, protein arrays have been developed, and some of these will be discussed later in the year in Workshop 4.

On what scale do we measure gene expression? Much of the recent interest by statisticians in this area stems from the availability of data sets giving expression measurements on tens of thousands of genes; so-called microarray gene expression data. However, nylon membrane filters with thousands of genes spotted on them have been around for over a decade, and smaller-scale quantitative expression data for much longer. Similarly 2D-Gel data are quite extensive, and MS-techniques, especially when done in conjunction with other separation techniques can produce up to 10^8 data points per sample. There are many differences between these different technologies, but from the analytical viewpoint, many similarities as well.

In this workshop, we will survey some of the computational, mathematical, and statistical models and methods used in analyzing gene expression data. Much of our focus will be on approaches quantifying mRNA, as that is the most well developed. We shall also present a small sample of the extensive biological and technological background to gene expression anaylsis.

Accepted Speakers

David Allison
Department of Biostatistics, University of Alabama at Birmingham
Harmen Bussemaker
Computational Biology & Bioinformatics, Columbia University
Raymond Carroll
Department of Statistics, Texas A & M University
Susmita Datta
Department of Mathematics & Statistics, Georgia State University
Kim-Anh Do
Department of Biostatistics, University of Texas M. D. Anderson Cancer Center
Darlene Goldstein
Institute of Mathematics, EPFL
Ina Hoeschele
Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University
Jason Hsu
Department of Statistics, The Ohio State University
Earl Hubbell
Department of Statistics, Affymetrix
William Evan Johnson
Department of Biostatistics, Harvard University
M. Kathleen Kerr
Department of Biostatistics, University of Washington
David Kreil
European Bioinformatics Institute, University of Cambridge
Eric Schadt
Research Genetics, Rosetta Inpharmatics
Rainer Spang
Computational Diagnostics Group, Max Planck Institute for Molecular Genetics
Terence Speed
Department of Statistics, University of California, Berkeley
Hongyu Zhao
EPID/Public Health & Genetics, Yale University
Monday, October 11, 2004
Time Session
09:30 AM
10:30 AM
Earl Hubbell - Designing Estimators for Low Level Expression Analysis

The analysis of gene expression using oligonucleotide arrays commonly requires estimating the expression level of a transcript using information from multiple probes. Many transcripts are expressed at such low levels that the nonspecific hybridization is a significant proportion of the observed probe intensity, and so it is an interesting problem to design estimators that function well on transcripts that have concentrations near or at zero. Working from simple assumptions about the behavior of probes, PLIER is a M-estimator model-based framework for finding expression estimates that is designed to handle near-background probe intensities well with minimal positive bias to the results. While the estimates from PLIER are by design not variance stabilized, PLIER shows good performance at detecting differential change, and can be variance stabilized by standard means.

11:00 AM
11:30 AM
M. Kathleen Kerr - Comparison of Affymetrix and Quantitative rtPCR Measurements of Relative Gene Expression

In the spirit of empirical evaluation, we compare measurements of relative gene expression from quantitative rtPCR to measurements from Affymetrix? gene chips. Our particular interest is how different methodologies for processing Affy data influence the agreement between Affy and qrtPCR measurements.


Note: Joint work with Dick Beyer, Noel Hudson, Nancy Linford, Li-Xuan Qin.

02:00 PM
03:00 PM
David Kreil - From Spot to Biology: Challenges in microarray data analysis

I will introduce our variational Bayesian implementation of Independent Component Analysis and report on successes and problems experienced in specific microarray data analysis applications. I will further discuss how the complex experimental process underlying microarray experiments affects data, and why 'low level' data analysis is still a major challenge in the field that often limits the detection of subtle interactions, particularly in typical size experimental designs.

Tuesday, October 12, 2004
Time Session
09:00 AM
10:00 AM
Darlene Goldstein - Strategies for Quantifying GeneChip Expression for Large Studies

Studies of gene expression using Affymetrix GeneChips have become standard in several area of clinical research, particularly cancer. Of the expression measures that have been proposed to quantify genechip expression, multiarray-based measures have been shown to perform well. As clinical gene expression studies increase in size, however, utilizing multiarray strategies is more challenging in terms of computing memory requirements and time. I will report on results of a study examining properties and tradeoffs of single and multiarray strategies for quantifying expression in large size studies.

10:30 AM
11:30 AM
William Evan Johnson - Adjusting for the Batch Effect: An Empirical Bayes Approach to Combining Microarray Data from Multiple Sources

Meta-analyses are often important tools for the generation of scientific hypotheses and the discovery of new science. In this light, there is increasing interest to consider the feasibility of conducting meta-analytic research using the existing and ever increasing pool of gene microarray data sets. This research could play an important role in the advancement of biological research, possibly defraying the time, expense and other difficulties associated with collecting and analyzing such data.


However, there are many problems associated with conducting meta-analyses on microarray data. In addition to natural biological variation across batches of such data, data are often subject to differences attributable to inconsistent data collection methods, such as different experimental conditions, researchers, or even lab protocols. Often the non-biological variation across batches of data is greater than the variation between the tissue types or treatment groups of interest. As a result, comparability is often suspect across studies.


Here we describe an empirical Bayes method of adjusting data for batch effects. This method consists of individual adjustments for each gene within each batch. The Bayesian framework shrinks gene-wise batch adjustments by pooling information across genes within batches, adjusting for gene-by-batch interactions while respecting systematic nature of differences in expression estimates across batches. This method is compared with univariate gene-wise location-scale adjustments and with other methods present in the literature.


The empirical Bayes adjustments are very robust in the presence of outlying observations, even when batch sample sizes are small (n < 15). Additionally, the empirical Bayes method often improves the consistency of within-batch fold changes for treatment effects, as compared with the unadjusted data.


Note: Joint work with Cheng Li, Department of Biostatistics, Harvard School of Public Health.

02:00 PM
03:00 PM
Raymond Carroll - Efficient Estimation of Gene-Environment Interactions in Case-Control Studies with Quantitative Gene Information

A recent paper by Morley, et al. (Nature, 2004, 743-747) used microarray data to perform gene-at-a-time linkage analysis in a family-based study. The phenotype in question was the gene-expression level. They conclude that the human gene expression phenotype "is a trait like many others . amenable to genetic analysis". The discussion by N. Cox (Nature, 2004, 733-734) notes this point and states that "To veterans of linkage-mapping studies on complex human phenotypes, this may seem to be an understatement on a par with Watson and Crick's "it has not escaped our notice"."


While not even remotely as earth-shattering, this paper works in the same type of context. We consider the analysis of standard and family-based case-control studies of a disease. We are particularly interested in gene-environment interactions and main effects when the gene is quantitative, as it would be for a gene-expression level from a microarray. In some cases, it is reasonable to assume that the gene-expression level and the environmental covariates are independent in the population (marginally). Under this assumption, and if the distribution of the gene-expression level is modeled parametrically while that of the environmental factors is not modeled, we construct a semiparametric profile likelihood, profiled over the nonparametric distribution of the environmental factors. We show that this profile likelihood acts as if it were a proper likelihood. In addition, it leads to estimation of main effects and interactions that is more efficient than a standard logistic regression approach. If the marginal probability in the population of case status is known, even greater gains in efficiency are possible. We describe potential parametric models for the gene-expression levels. It appears that the profile likelihood approach also allows for single-index modeling of multiple gene-expression levels.


Note: Joint work with Nilanjan Chatterjee of the National Cancer Institute.

Wednesday, October 13, 2004
Time Session
09:00 AM
10:00 AM
Jason Hsu - Statistically Designing Microarray Experiments and Analyzing Gene Expression Data in a Decision-Making Processes

Microarray experiments are no longer exploratory in nature; they are fast becoming partsof well-defined decision processes. For example, they may be used to select genes toward the fabrication of prognostic chips, or the elimination of patient subpopulations prone to serious adverse events. So we believe microarray experimental design and gene expression analysis should be viewed as integral parts of specific decision-making processes.


An an example, in choosing genes to fabricate a prognostic microarray, one should keep in mind microarray manufacturing economics, as well as FDA regulatory requirments. In this presentation, I will report on a joint project to design a micorarray experiment statistically with randomization, replication, and blocking which will allow assessment of the sensitivity and specificity of genetic profiling prognostics chips.


As part of the decision process, we believe the analysis of gene expressions from a microarray experiment should control a statistical error rate appropriate for the purpose of the experiment. Controlling familywise error rate vs. false discovery rate will be discussed in the context of specific applications. The principle of partition multiple testing, a form of conditional frequentist inference, will be given. We will also describe the conditions under which stepwise testing becomes a valid computational shortcut to partition testing. Subtleties of these conditions have been not always been appreciated in bioinformatics.


Work done in collaboration with Jane Chang, Tao Wang, and Yifan Huang.

10:30 AM
11:30 AM
Susmita Datta - Significant Analysis Using P- values for Multiple Hypotheses Testing in Microarray Experiments

An empirical Bayes adjustment to multiple t-tests has been shown to improve the sensitivity of the overall procedure (Datta et al, Bioinformatics, 2004). Here we propose an empirical Bayes adjustment to the p-values rather than the test statistics. As a result, it is applicable to other types of multiple tests (e.g., F, chi-squared) as well. Both parametric and nonparametric versions of the empirical Bayes adjustment are considered. Thus, each p-value, in turn, borrows evidence from other p-values across the tests. A new set of accept/reject decisions are reached for each null hypothesis using the empirical Bayes adjusted p-values through a resampling based step-down p-value calculation that protects the analyst against the overall (familywise) type 1 error rate. The new procedure is shown to produce further improvement in sensitivity in a number of examples.

02:00 PM
03:00 PM
David Allison - Opportunities, Challenges, and Issues Posed by Massive Multiple Inference in High Dimensional Biology

In microarray research and other types of high dimensional biological research, multiple testing issues arise in new ways, on new orders of magnitude, and present new opportunities and challenges. These features call for new ways of thinking about some aspects of inferential testing. Herein, I will discuss methods for multiple testing control that involve capitalizing on, rather than penalizing oneself for, the large number of tests conducted through mixture modeling procedures. A novel extension to power and sample size estimation will be presented. Finally, issues involving composite hypothesis testing involving both union-intersection testing and intersection-union testing will be presented.

03:30 PM
04:30 PM
Eric Schadt - Complex Systems to Understand Complex Traits: Beyond Reagent Driven Science

The reconstruction of genetic networks in mammalian systems is one of the primary goals in biological research, especially as such reconstructions relate to elucidating not only common, polygenic human diseases, but living systems more generally. Here I present a statistical procedure for inferring causal relationships between gene expression traits and more classic clinical traits, including complex disease traits. This procedure has been generalized to the gene network reconstruction problem, where naturally occurring genetic variations in segregating mouse populations are used as a source of perturbations to elucidate tissue-specific gene networks. Differences in the extent of genetic control between genders and among four different tissues are highlighted. I also demonstrate that the networks derived from expression data in segregating mouse populations using the novel network reconstruction algorithm are able to capture causal associations between genes that result in increased predictive power, compared to more classically reconstructed networks derived from the same data. This approach to causal inference in large segregating mouse populations over multiple tissues not only elucidates fundamental aspects of transcriptional control, it also allows for the objective identification of key drivers of common human diseases.

Thursday, October 14, 2004
Time Session
09:00 AM
10:00 AM
Kim-Anh Do - A Bayesian Mixture Model for Differential Gene Expression

Model-based inference is proposed for differential gene expression, using a non-parametric Bayesian probability model for the distribution of gene intensities under different conditions. The probability model is essentially a mixture of normals. Specifically, it is a variation of traditional Dirichlet process (DP) mixture models. The model includes an additional mixture corresponding to the assumption that transcription levels arise as a mixture over non-differentially and differentially expressed genes.


Inference proceeds as in DP mixture models, with an additional set of latent indicators to resolve this additional mixture. The use of fully model-based inference mitigates some of the necessary limitations of the empirical Bayes method (Efron, JASA 2001). However, the increased generality of our method comes at a price. Computation is not as straightforward as in the empirical Bayes scheme. But we argue that inference is no more difficult than posterior simulation in a traditional nonparametric mixture of normal models. We illustrate the proposed method in two examples, including a simulation study and a a microarray experiment to screen for genes with differential expression in colon cancer versus normal tissue.


We will illustrate the ease of making joint inference about a sub-group of genes being differentially expressed and of estimating the total number of significantly expressing genes. Further, we also elaborate on how the control of false positive rates can be automatically incorporated into this approach.


Collaborators: Peter Mueller and Feng Tang.

10:30 AM
11:30 AM
Rainer Spang - Differential Co-Expression of Genes

Gene expression is a tightly regulated process, crucial for the proper functioning of a cell. In microarray data, coregulation is reflected by strong correlations between expression. Molecular disease mechanisms typically constitute abnormalities in the coregulation of genes. Resulting changes in expression profiles help identifying disease related genes, and in several cases facilitate improved diagnosis and even prognosis of disease outcome. Alteration of gene regulation often results in up or down regulated genes.


Common analysis strategies look for these differentially expressed genes, or for genes, which play an important role in some supervised learning algorithm, applied to the data. We opt to use a complementary approach. Not all changes in coregulation are manifested by up or down regulation of individual genes. Alterations of the correlation structure of expression should be investigated as well.


We address the problem of detecting sets of differentially co-expressed genes in two phenotypically distinct sets of expression profiles. We introduce a score for differential coexpression, and suggest a computationally efficient algorithm for finding high scoring sets of genes. The use of our novel method is demonstrated in the context of simulations and on real expression data from a clinical study.

02:00 PM
03:00 PM
Ina Hoeschele - Genetical Genomics Analysis to Infer Gene Regulatory Networks

Genetic analysis of gene expression in a segregating population, which is expression profiled and genotyped at DNA markers throughout the genome, can reveal regulatory networks of polymorphic genes. We propose an analysis strategy with several steps: (1) Genome-wide QTL analysis of all expression profiles to identify eQTL confidence regions, followed by fine-mapping of identified eQTL; (2) identification of regulatory candidate genes in each eQTL region; (3) correlation analysis of the expression profiles of the candidates in any eQTL region with the gene affected by the eQTL to reduce the number of candidates; (4) drawing directional links from retained regulatory candidate genes to genes affected by the eQTL and joining links to form networks, and (5) statistical validation and refinement of the inferred network structure via structural equation modeling. Here, we apply an initial implementation of this strategy to a segregating yeast population. In 65%, 7%, and 28% of the identified eQTL regions, a single candidate regulatory gene, no gene, or several (at most six) genes were retained in step (3), respectively. Overall, 768 putative regulatory links were retained, 331 of which are the strongest candidate links, as they were retained in the expression correlation analysis and also were located within or near an eQTL sub-region identified by a multi-marker analysis separating multiple linked QTL. Biological processes were statistically over-represented in highly interconnected sub-networks of genes. Transcription factors had few connections. To statistically validate the reconstructed networks and to compare alternative structures, we have begun investigating Structural Equation Modeling by simulating artificial data under nonlinear models of gene regulation with alternative network topologies. The asymptotic chi-square distribution of the model fit criterion, which is based on the assumption of multivariate normality, is well approximated in the simulated data, and in an initial comparison of alternative structures of small networks the correct models, or similar models, had the smallest Bayesian Information Criterion.

Friday, October 15, 2004
Time Session
09:00 AM
10:00 AM
Harmen Bussemaker - Inferring Regulatory Circuitry through Model-based Analysis of mRNA Expression and ChIP Data

Functional genomics studies are yielding information about regulatory processes in the biological cell at an unprecedented scale. Not only have DNA microarrays been used to measure, for all genes simultaneously, the mRNA abundance in a variety of conditions, but the level of occupancy of their promoter regions by a large number of transcription factors has also been determined. The challenge is to extract useful information about the global regulatory network from these data. We present an integrative modeling framework that combines libraries of expression and occupancy data to define the functional targets of each transcription factor: Multivariate regression analysis is used to infer transcription factor activity levels for each condition, and the correlation between the mRNA expression profile of an individual gene and the inferred activity profile of a transcription factor is interpreted as regulatory coupling strength. Applying our method for the yeast S. cerevisiae, we find that on average 58% of the genes whose promoter region is bound by a transcription factor are true regulatory targets. Moreover, our results enable us to assign directionality to transcription factors controlling divergently transcribed genes that share the same promoter region.

10:30 AM
11:30 AM
Hongyu Zhao - Integrated Statistical Analysis of Gene Expression Data

Recent advances in large-scale RNA expression measurements, DNA-protein interactions, , protein-protein interactions and the availability of genome sequences from many organisms have opened the opportunity for massively parallel biological data acquisition and integrated understanding of the genetic networks underlying complex biological phenotypes. Many existing statistical procedures have been proposed to analyze a single data type, e.g. clustering algorithms for microarray data and motif finding methods for sequence data. Different data sources offer different perspectives on the same underlying system, and they can be combined to increase our chance of uncovering underlying biological mechanisms. In this talk, we will describe our attempts to develop a statistical framework to integrate diverse genomics and proteomics information to dissect transcriptional regulatory networks and signal transduction pathways. The developed methods will be illustrated through their applications in yeast. This is joint work with Ning Sun, Liang Chen, Baolin Wu, and Yin Liu.

02:00 PM
03:00 PM
Terence Speed - Some Problems in the Statistical Analysis of Microarray Data

Some Problems in the Statistical Analysis of Microarray Data

Name Affiliation
Alexandridis, Roxana roxana@stat.ohio-state.edu Department of Statistics, The Ohio State University
Allison, David Dallison@UAB.edu Department of Biostatistics, University of Alabama at Birmingham
Bazaliy, Borys Institute of Applied Mathematics and Mechanics, National Academy of Sciences of Ukraine
Best, Janet jbest@mbi.osu.edu Mathematics, The Ohio State University
Borisyuk, Alla borisyuk@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Buechler, Steven buechler.1@nd.edu Department of Mathematics, University of Notre Dame
Bundschuh, Ralf bundschuh@mbi.osu.edu Department of Physics, The Ohio State University
Bussemaker, Harmen hjb2004@columbia.edu Computational Biology & Bioinformatics, Columbia University
Carl, Joe carl.30@osu.edu IBGP, The Ohio State University
Carroll, Raymond joyce@stat.tamu.edu Department of Statistics, Texas A & M University
Chang, Jane changj@bgnet.bgsu.edu Applied Statistics and Operations Research, Bowling Green State University
Cracium, Gheorghe craciun@math.wisc.edu Dept. of Mathematics, University of Wisconsin-Madison
Datta, Susmita sdatta@mathstat.gsu.edu Department of Mathematics & Statistics, Georgia State University
Davuluri, Ramana ramana.davuluri@osumc.edu Biomedical Informatics, The Ohio State University
Do, Kim-Anh kim@mdanderson.org Department of Biostatistics, University of Texas M. D. Anderson Cancer Center
Doss, Hani Department of Statistics, The Ohio State University
Dougherty, Daniel dpdoughe@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Dudala, Kalyan kdudala@iastate.edu Genetics/Ecology & Evolutionary Biology, Iowa State University
Edwards, David Mathematical Biosciences Institute, The Ohio State University
Erdal, Selnur erdal.3@osu.edu Department of Biophysics, The Ohio State University
Fang, Chih gt1091b@yahoo.com Molecular Virology, Immunology & Med Genetics, The Ohio State University
Gifford, David gifford@lcs.mit.edu Programming Systems Research Group, Massachusetts Institute of Technology
Goel, Pranay goelpra@helix.nih.gov NIDDK, Indian Institute of Science Education and Research
Goldstein, Darlene Darlene.Goldstein@epfl.ch Institute of Mathematics, EPFL
Gu, Weisong weisong@osc.edu Ohio Supercomputer Center, The Ohio State University
Guo, Yixin yixin@math.drexel.edu Department of Psychology, The Ohio State University
Hassanali, Ali hassanali@osu.edu Department of Biophysics, The Ohio State University
Hoeschele, Ina inah@vt.edu Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University
Hsu, Jason hsu.1@osu.edu Department of Statistics, The Ohio State University
Hu, Bei Department of Mathematics, University of Notre Dame
Huang, Yifan huang.338@osu.edu Department of Statistics, The Ohio State University
Hubbell, Earl Earl_Hubbell@affymetrix.com Department of Statistics, Affymetrix
Huebner, Marianne Huebner.Marianne@mayo.edu Statistics and Probability, Michigan State University
Johnson, William Evan wjohnson@hsph.harvard.edu Department of Biostatistics, Harvard University
Jung, Peter Quantitative Biology Institute, Ohio University
Kannan, Dan kannan@uga.edu Department of Mathematics, University of Georgia
Kendziorski, Christina kendzior@biostat.wisc.edu Biostatistics and Medical Informatics, University of Wisconsin
Kerr, M. Kathleen katiek@u.washington.edu Department of Biostatistics, University of Washington
Kreil, David kreil@ebi.ac.uk European Bioinformatics Institute, University of Cambridge
Lee, Yoonkyung Department of Statistics, The Ohio State University
Lim, Sookkyung limsk@math.uc.edu Department of Mathematical Sciences, University of Cincinnati
Lin, Shili lin.328@osu.edu Department of Statistics, The Ohio State University
Melfi, Vincent melfi@mbi.osu.edu Mathematics, Michigan State University
Nagaraja, Haikady Department of Statistics, The Ohio State University
Parrish, Rudolph recumm01@gwise.louisville.edu Bioinformatics and Biostatistics, University of Louisville
Pol, Diego dpol@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Polanska, Joanna jkp@stat.rice.edu System Engineering Group, Silesian University of Technology
Rassoul-Agha, Firas firas@math.ohio-state.edu Mathematical Biosciences Institute, The Ohio State University
Ray, William ray.29@osu.edu Pediatrics, The Ohio State University
Rejniak, Katarzyna rejniak@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Rodriguez, Ben Rodriquez-1@medctr.osu.edu Molecular Virology & Immunology, The Ohio State University
Rosa, Guilherme Department of Animal Science, Michigan State University
Santner, Thomas tjs@stat.ohio-state.edu Department of Statistics, The Ohio State University
Schadt, Eric kristina_brysz@merck.com Research Genetics, Rosetta Inpharmatics
Spang, Rainer spang@molgen.mpg.de Computational Diagnostics Group, Max Planck Institute for Molecular Genetics
Speed, Terence terry@stat.Berkeley.edu Department of Statistics, University of California, Berkeley
Stubna, Michael stubna@mbi.osu.edu Engineering Team Leader, Pulsar Informatics
Sun, Junfeng sun@stat.ohio-state.edu Department of Statistics, The Ohio State University
Terman, David terman@math.ohio-state.edu Mathemathics Department, The Ohio State University
Tian, Jianjun Paul tianjj@mbi.osu.edu Mathematics, College of William and Mary
Varbanov, Alex varbanov.ar@pg.com Department of Statistics, Proctor & Gamble
Verducci, Joseph verducci.1@osu.edu Department of Statistics, The Ohio State University
Wang, Huixia (Judy) Department of Statistics, University of Illinois at Urbana-Champaign
Wang, Tao wangtao@stat.ohio-state.edu Department of Statistics, The Ohio State University
Wang, Zailong zlwang@mbi.osu.edu Integrated Information Sciences, Novartis
Wechselberger, Martin wm@mbi.osu.edu Mathematical Biosciences Insitute, The Ohio State University
Wright, Geraldine wright.572@osu.edu School of Biology, Newcastle University
Xu, Haiyan haiyan@stat.ohio-state.edu Department of Statistics, The Ohio State University
Yin, Lijie yin.39@osu.edu Department of Pathology, The Ohio State University
Zacharaki, Evangelia ezachar@biosim.ntua.gr
Zhang, Xuan zhangx@cse.ohio-state.edu Computer Science & Engineering, The Ohio State University
Zhao, Hongyu hz27@email.med.yale.edu EPID/Public Health & Genetics, Yale University
Zhou, Jianhui zhou@uiuc.edu Department of Statistics, University of Illinois at Urbana-Champaign
Zhou, Jin jzhou@mbi.osu.edu Department of Mathematics, Northern Michigan University
Zhou, Penghui zhou.154@osu.edu Department of Molecular Genetics, The Ohio State University
Zinner, Bertram zinnebe@auburn.edu Mathematics and Statistics, Auburn University
Opportunities, Challenges, and Issues Posed by Massive Multiple Inference in High Dimensional Biology

In microarray research and other types of high dimensional biological research, multiple testing issues arise in new ways, on new orders of magnitude, and present new opportunities and challenges. These features call for new ways of thinking about some aspects of inferential testing. Herein, I will discuss methods for multiple testing control that involve capitalizing on, rather than penalizing oneself for, the large number of tests conducted through mixture modeling procedures. A novel extension to power and sample size estimation will be presented. Finally, issues involving composite hypothesis testing involving both union-intersection testing and intersection-union testing will be presented.

Inferring Regulatory Circuitry through Model-based Analysis of mRNA Expression and ChIP Data

Functional genomics studies are yielding information about regulatory processes in the biological cell at an unprecedented scale. Not only have DNA microarrays been used to measure, for all genes simultaneously, the mRNA abundance in a variety of conditions, but the level of occupancy of their promoter regions by a large number of transcription factors has also been determined. The challenge is to extract useful information about the global regulatory network from these data. We present an integrative modeling framework that combines libraries of expression and occupancy data to define the functional targets of each transcription factor: Multivariate regression analysis is used to infer transcription factor activity levels for each condition, and the correlation between the mRNA expression profile of an individual gene and the inferred activity profile of a transcription factor is interpreted as regulatory coupling strength. Applying our method for the yeast S. cerevisiae, we find that on average 58% of the genes whose promoter region is bound by a transcription factor are true regulatory targets. Moreover, our results enable us to assign directionality to transcription factors controlling divergently transcribed genes that share the same promoter region.

Efficient Estimation of Gene-Environment Interactions in Case-Control Studies with Quantitative Gene Information

A recent paper by Morley, et al. (Nature, 2004, 743-747) used microarray data to perform gene-at-a-time linkage analysis in a family-based study. The phenotype in question was the gene-expression level. They conclude that the human gene expression phenotype "is a trait like many others . amenable to genetic analysis". The discussion by N. Cox (Nature, 2004, 733-734) notes this point and states that "To veterans of linkage-mapping studies on complex human phenotypes, this may seem to be an understatement on a par with Watson and Crick's "it has not escaped our notice"."


While not even remotely as earth-shattering, this paper works in the same type of context. We consider the analysis of standard and family-based case-control studies of a disease. We are particularly interested in gene-environment interactions and main effects when the gene is quantitative, as it would be for a gene-expression level from a microarray. In some cases, it is reasonable to assume that the gene-expression level and the environmental covariates are independent in the population (marginally). Under this assumption, and if the distribution of the gene-expression level is modeled parametrically while that of the environmental factors is not modeled, we construct a semiparametric profile likelihood, profiled over the nonparametric distribution of the environmental factors. We show that this profile likelihood acts as if it were a proper likelihood. In addition, it leads to estimation of main effects and interactions that is more efficient than a standard logistic regression approach. If the marginal probability in the population of case status is known, even greater gains in efficiency are possible. We describe potential parametric models for the gene-expression levels. It appears that the profile likelihood approach also allows for single-index modeling of multiple gene-expression levels.


Note: Joint work with Nilanjan Chatterjee of the National Cancer Institute.

Significant Analysis Using P- values for Multiple Hypotheses Testing in Microarray Experiments

An empirical Bayes adjustment to multiple t-tests has been shown to improve the sensitivity of the overall procedure (Datta et al, Bioinformatics, 2004). Here we propose an empirical Bayes adjustment to the p-values rather than the test statistics. As a result, it is applicable to other types of multiple tests (e.g., F, chi-squared) as well. Both parametric and nonparametric versions of the empirical Bayes adjustment are considered. Thus, each p-value, in turn, borrows evidence from other p-values across the tests. A new set of accept/reject decisions are reached for each null hypothesis using the empirical Bayes adjusted p-values through a resampling based step-down p-value calculation that protects the analyst against the overall (familywise) type 1 error rate. The new procedure is shown to produce further improvement in sensitivity in a number of examples.

A Bayesian Mixture Model for Differential Gene Expression

Model-based inference is proposed for differential gene expression, using a non-parametric Bayesian probability model for the distribution of gene intensities under different conditions. The probability model is essentially a mixture of normals. Specifically, it is a variation of traditional Dirichlet process (DP) mixture models. The model includes an additional mixture corresponding to the assumption that transcription levels arise as a mixture over non-differentially and differentially expressed genes.


Inference proceeds as in DP mixture models, with an additional set of latent indicators to resolve this additional mixture. The use of fully model-based inference mitigates some of the necessary limitations of the empirical Bayes method (Efron, JASA 2001). However, the increased generality of our method comes at a price. Computation is not as straightforward as in the empirical Bayes scheme. But we argue that inference is no more difficult than posterior simulation in a traditional nonparametric mixture of normal models. We illustrate the proposed method in two examples, including a simulation study and a a microarray experiment to screen for genes with differential expression in colon cancer versus normal tissue.


We will illustrate the ease of making joint inference about a sub-group of genes being differentially expressed and of estimating the total number of significantly expressing genes. Further, we also elaborate on how the control of false positive rates can be automatically incorporated into this approach.


Collaborators: Peter Mueller and Feng Tang.

Strategies for Quantifying GeneChip Expression for Large Studies

Studies of gene expression using Affymetrix GeneChips have become standard in several area of clinical research, particularly cancer. Of the expression measures that have been proposed to quantify genechip expression, multiarray-based measures have been shown to perform well. As clinical gene expression studies increase in size, however, utilizing multiarray strategies is more challenging in terms of computing memory requirements and time. I will report on results of a study examining properties and tradeoffs of single and multiarray strategies for quantifying expression in large size studies.

Genetical Genomics Analysis to Infer Gene Regulatory Networks

Genetic analysis of gene expression in a segregating population, which is expression profiled and genotyped at DNA markers throughout the genome, can reveal regulatory networks of polymorphic genes. We propose an analysis strategy with several steps: (1) Genome-wide QTL analysis of all expression profiles to identify eQTL confidence regions, followed by fine-mapping of identified eQTL; (2) identification of regulatory candidate genes in each eQTL region; (3) correlation analysis of the expression profiles of the candidates in any eQTL region with the gene affected by the eQTL to reduce the number of candidates; (4) drawing directional links from retained regulatory candidate genes to genes affected by the eQTL and joining links to form networks, and (5) statistical validation and refinement of the inferred network structure via structural equation modeling. Here, we apply an initial implementation of this strategy to a segregating yeast population. In 65%, 7%, and 28% of the identified eQTL regions, a single candidate regulatory gene, no gene, or several (at most six) genes were retained in step (3), respectively. Overall, 768 putative regulatory links were retained, 331 of which are the strongest candidate links, as they were retained in the expression correlation analysis and also were located within or near an eQTL sub-region identified by a multi-marker analysis separating multiple linked QTL. Biological processes were statistically over-represented in highly interconnected sub-networks of genes. Transcription factors had few connections. To statistically validate the reconstructed networks and to compare alternative structures, we have begun investigating Structural Equation Modeling by simulating artificial data under nonlinear models of gene regulation with alternative network topologies. The asymptotic chi-square distribution of the model fit criterion, which is based on the assumption of multivariate normality, is well approximated in the simulated data, and in an initial comparison of alternative structures of small networks the correct models, or similar models, had the smallest Bayesian Information Criterion.

Statistically Designing Microarray Experiments and Analyzing Gene Expression Data in a Decision-Making Processes

Microarray experiments are no longer exploratory in nature; they are fast becoming partsof well-defined decision processes. For example, they may be used to select genes toward the fabrication of prognostic chips, or the elimination of patient subpopulations prone to serious adverse events. So we believe microarray experimental design and gene expression analysis should be viewed as integral parts of specific decision-making processes.


An an example, in choosing genes to fabricate a prognostic microarray, one should keep in mind microarray manufacturing economics, as well as FDA regulatory requirments. In this presentation, I will report on a joint project to design a micorarray experiment statistically with randomization, replication, and blocking which will allow assessment of the sensitivity and specificity of genetic profiling prognostics chips.


As part of the decision process, we believe the analysis of gene expressions from a microarray experiment should control a statistical error rate appropriate for the purpose of the experiment. Controlling familywise error rate vs. false discovery rate will be discussed in the context of specific applications. The principle of partition multiple testing, a form of conditional frequentist inference, will be given. We will also describe the conditions under which stepwise testing becomes a valid computational shortcut to partition testing. Subtleties of these conditions have been not always been appreciated in bioinformatics.


Work done in collaboration with Jane Chang, Tao Wang, and Yifan Huang.

Designing Estimators for Low Level Expression Analysis

The analysis of gene expression using oligonucleotide arrays commonly requires estimating the expression level of a transcript using information from multiple probes. Many transcripts are expressed at such low levels that the nonspecific hybridization is a significant proportion of the observed probe intensity, and so it is an interesting problem to design estimators that function well on transcripts that have concentrations near or at zero. Working from simple assumptions about the behavior of probes, PLIER is a M-estimator model-based framework for finding expression estimates that is designed to handle near-background probe intensities well with minimal positive bias to the results. While the estimates from PLIER are by design not variance stabilized, PLIER shows good performance at detecting differential change, and can be variance stabilized by standard means.

Adjusting for the Batch Effect: An Empirical Bayes Approach to Combining Microarray Data from Multiple Sources

Meta-analyses are often important tools for the generation of scientific hypotheses and the discovery of new science. In this light, there is increasing interest to consider the feasibility of conducting meta-analytic research using the existing and ever increasing pool of gene microarray data sets. This research could play an important role in the advancement of biological research, possibly defraying the time, expense and other difficulties associated with collecting and analyzing such data.


However, there are many problems associated with conducting meta-analyses on microarray data. In addition to natural biological variation across batches of such data, data are often subject to differences attributable to inconsistent data collection methods, such as different experimental conditions, researchers, or even lab protocols. Often the non-biological variation across batches of data is greater than the variation between the tissue types or treatment groups of interest. As a result, comparability is often suspect across studies.


Here we describe an empirical Bayes method of adjusting data for batch effects. This method consists of individual adjustments for each gene within each batch. The Bayesian framework shrinks gene-wise batch adjustments by pooling information across genes within batches, adjusting for gene-by-batch interactions while respecting systematic nature of differences in expression estimates across batches. This method is compared with univariate gene-wise location-scale adjustments and with other methods present in the literature.


The empirical Bayes adjustments are very robust in the presence of outlying observations, even when batch sample sizes are small (n < 15). Additionally, the empirical Bayes method often improves the consistency of within-batch fold changes for treatment effects, as compared with the unadjusted data.


Note: Joint work with Cheng Li, Department of Biostatistics, Harvard School of Public Health.

Comparison of Affymetrix and Quantitative rtPCR Measurements of Relative Gene Expression

In the spirit of empirical evaluation, we compare measurements of relative gene expression from quantitative rtPCR to measurements from Affymetrix? gene chips. Our particular interest is how different methodologies for processing Affy data influence the agreement between Affy and qrtPCR measurements.


Note: Joint work with Dick Beyer, Noel Hudson, Nancy Linford, Li-Xuan Qin.

From Spot to Biology: Challenges in microarray data analysis

I will introduce our variational Bayesian implementation of Independent Component Analysis and report on successes and problems experienced in specific microarray data analysis applications. I will further discuss how the complex experimental process underlying microarray experiments affects data, and why 'low level' data analysis is still a major challenge in the field that often limits the detection of subtle interactions, particularly in typical size experimental designs.

Complex Systems to Understand Complex Traits: Beyond Reagent Driven Science

The reconstruction of genetic networks in mammalian systems is one of the primary goals in biological research, especially as such reconstructions relate to elucidating not only common, polygenic human diseases, but living systems more generally. Here I present a statistical procedure for inferring causal relationships between gene expression traits and more classic clinical traits, including complex disease traits. This procedure has been generalized to the gene network reconstruction problem, where naturally occurring genetic variations in segregating mouse populations are used as a source of perturbations to elucidate tissue-specific gene networks. Differences in the extent of genetic control between genders and among four different tissues are highlighted. I also demonstrate that the networks derived from expression data in segregating mouse populations using the novel network reconstruction algorithm are able to capture causal associations between genes that result in increased predictive power, compared to more classically reconstructed networks derived from the same data. This approach to causal inference in large segregating mouse populations over multiple tissues not only elucidates fundamental aspects of transcriptional control, it also allows for the objective identification of key drivers of common human diseases.

Differential Co-Expression of Genes

Gene expression is a tightly regulated process, crucial for the proper functioning of a cell. In microarray data, coregulation is reflected by strong correlations between expression. Molecular disease mechanisms typically constitute abnormalities in the coregulation of genes. Resulting changes in expression profiles help identifying disease related genes, and in several cases facilitate improved diagnosis and even prognosis of disease outcome. Alteration of gene regulation often results in up or down regulated genes.


Common analysis strategies look for these differentially expressed genes, or for genes, which play an important role in some supervised learning algorithm, applied to the data. We opt to use a complementary approach. Not all changes in coregulation are manifested by up or down regulation of individual genes. Alterations of the correlation structure of expression should be investigated as well.


We address the problem of detecting sets of differentially co-expressed genes in two phenotypically distinct sets of expression profiles. We introduce a score for differential coexpression, and suggest a computationally efficient algorithm for finding high scoring sets of genes. The use of our novel method is demonstrated in the context of simulations and on real expression data from a clinical study.

Some Problems in the Statistical Analysis of Microarray Data

Some Problems in the Statistical Analysis of Microarray Data

Integrated Statistical Analysis of Gene Expression Data

Recent advances in large-scale RNA expression measurements, DNA-protein interactions, , protein-protein interactions and the availability of genome sequences from many organisms have opened the opportunity for massively parallel biological data acquisition and integrated understanding of the genetic networks underlying complex biological phenotypes. Many existing statistical procedures have been proposed to analyze a single data type, e.g. clustering algorithms for microarray data and motif finding methods for sequence data. Different data sources offer different perspectives on the same underlying system, and they can be combined to increase our chance of uncovering underlying biological mechanisms. In this talk, we will describe our attempts to develop a statistical framework to integrate diverse genomics and proteomics information to dissect transcriptional regulatory networks and signal transduction pathways. The developed methods will be illustrated through their applications in yeast. This is joint work with Ning Sun, Liang Chen, Baolin Wu, and Yin Liu.