In microarray research and other types of high dimensional biological research, multiple testing issues arise in new ways, on new orders of magnitude, and present new opportunities and challenges. These features call for new ways of thinking about some aspects of inferential testing. Herein, I will discuss methods for multiple testing control that involve capitalizing on, rather than penalizing oneself for, the large number of tests conducted through mixture modeling procedures. A novel extension to power and sample size estimation will be presented. Finally, issues involving composite hypothesis testing involving both union-intersection testing and intersection-union testing will be presented.
Functional genomics studies are yielding information about regulatory processes in the biological cell at an unprecedented scale. Not only have DNA microarrays been used to measure, for all genes simultaneously, the mRNA abundance in a variety of conditions, but the level of occupancy of their promoter regions by a large number of transcription factors has also been determined. The challenge is to extract useful information about the global regulatory network from these data. We present an integrative modeling framework that combines libraries of expression and occupancy data to define the functional targets of each transcription factor: Multivariate regression analysis is used to infer transcription factor activity levels for each condition, and the correlation between the mRNA expression profile of an individual gene and the inferred activity profile of a transcription factor is interpreted as regulatory coupling strength. Applying our method for the yeast S. cerevisiae, we find that on average 58% of the genes whose promoter region is bound by a transcription factor are true regulatory targets. Moreover, our results enable us to assign directionality to transcription factors controlling divergently transcribed genes that share the same promoter region.
A recent paper by Morley, et al. (Nature, 2004, 743-747) used microarray data to perform gene-at-a-time linkage analysis in a family-based study. The phenotype in question was the gene-expression level. They conclude that the human gene expression phenotype "is a trait like many others . amenable to genetic analysis". The discussion by N. Cox (Nature, 2004, 733-734) notes this point and states that "To veterans of linkage-mapping studies on complex human phenotypes, this may seem to be an understatement on a par with Watson and Crick's "it has not escaped our notice"."
While not even remotely as earth-shattering, this paper works in the same type of context. We consider the analysis of standard and family-based case-control studies of a disease. We are particularly interested in gene-environment interactions and main effects when the gene is quantitative, as it would be for a gene-expression level from a microarray. In some cases, it is reasonable to assume that the gene-expression level and the environmental covariates are independent in the population (marginally). Under this assumption, and if the distribution of the gene-expression level is modeled parametrically while that of the environmental factors is not modeled, we construct a semiparametric profile likelihood, profiled over the nonparametric distribution of the environmental factors. We show that this profile likelihood acts as if it were a proper likelihood. In addition, it leads to estimation of main effects and interactions that is more efficient than a standard logistic regression approach. If the marginal probability in the population of case status is known, even greater gains in efficiency are possible. We describe potential parametric models for the gene-expression levels. It appears that the profile likelihood approach also allows for single-index modeling of multiple gene-expression levels.
Note: Joint work with Nilanjan Chatterjee of the National Cancer Institute.
An empirical Bayes adjustment to multiple t-tests has been shown to improve the sensitivity of the overall procedure (Datta et al, Bioinformatics, 2004). Here we propose an empirical Bayes adjustment to the p-values rather than the test statistics. As a result, it is applicable to other types of multiple tests (e.g., F, chi-squared) as well. Both parametric and nonparametric versions of the empirical Bayes adjustment are considered. Thus, each p-value, in turn, borrows evidence from other p-values across the tests. A new set of accept/reject decisions are reached for each null hypothesis using the empirical Bayes adjusted p-values through a resampling based step-down p-value calculation that protects the analyst against the overall (familywise) type 1 error rate. The new procedure is shown to produce further improvement in sensitivity in a number of examples.
Model-based inference is proposed for differential gene expression, using a non-parametric Bayesian probability model for the distribution of gene intensities under different conditions. The probability model is essentially a mixture of normals. Specifically, it is a variation of traditional Dirichlet process (DP) mixture models. The model includes an additional mixture corresponding to the assumption that transcription levels arise as a mixture over non-differentially and differentially expressed genes.
Inference proceeds as in DP mixture models, with an additional set of latent indicators to resolve this additional mixture. The use of fully model-based inference mitigates some of the necessary limitations of the empirical Bayes method (Efron, JASA 2001). However, the increased generality of our method comes at a price. Computation is not as straightforward as in the empirical Bayes scheme. But we argue that inference is no more difficult than posterior simulation in a traditional nonparametric mixture of normal models. We illustrate the proposed method in two examples, including a simulation study and a a microarray experiment to screen for genes with differential expression in colon cancer versus normal tissue.
We will illustrate the ease of making joint inference about a sub-group of genes being differentially expressed and of estimating the total number of significantly expressing genes. Further, we also elaborate on how the control of false positive rates can be automatically incorporated into this approach.
Collaborators: Peter Mueller and Feng Tang.
Studies of gene expression using Affymetrix GeneChips have become standard in several area of clinical research, particularly cancer. Of the expression measures that have been proposed to quantify genechip expression, multiarray-based measures have been shown to perform well. As clinical gene expression studies increase in size, however, utilizing multiarray strategies is more challenging in terms of computing memory requirements and time. I will report on results of a study examining properties and tradeoffs of single and multiarray strategies for quantifying expression in large size studies.
Genetic analysis of gene expression in a segregating population, which is expression profiled and genotyped at DNA markers throughout the genome, can reveal regulatory networks of polymorphic genes. We propose an analysis strategy with several steps: (1) Genome-wide QTL analysis of all expression profiles to identify eQTL confidence regions, followed by fine-mapping of identified eQTL; (2) identification of regulatory candidate genes in each eQTL region; (3) correlation analysis of the expression profiles of the candidates in any eQTL region with the gene affected by the eQTL to reduce the number of candidates; (4) drawing directional links from retained regulatory candidate genes to genes affected by the eQTL and joining links to form networks, and (5) statistical validation and refinement of the inferred network structure via structural equation modeling. Here, we apply an initial implementation of this strategy to a segregating yeast population. In 65%, 7%, and 28% of the identified eQTL regions, a single candidate regulatory gene, no gene, or several (at most six) genes were retained in step (3), respectively. Overall, 768 putative regulatory links were retained, 331 of which are the strongest candidate links, as they were retained in the expression correlation analysis and also were located within or near an eQTL sub-region identified by a multi-marker analysis separating multiple linked QTL. Biological processes were statistically over-represented in highly interconnected sub-networks of genes. Transcription factors had few connections. To statistically validate the reconstructed networks and to compare alternative structures, we have begun investigating Structural Equation Modeling by simulating artificial data under nonlinear models of gene regulation with alternative network topologies. The asymptotic chi-square distribution of the model fit criterion, which is based on the assumption of multivariate normality, is well approximated in the simulated data, and in an initial comparison of alternative structures of small networks the correct models, or similar models, had the smallest Bayesian Information Criterion.
Microarray experiments are no longer exploratory in nature; they are fast becoming partsof well-defined decision processes. For example, they may be used to select genes toward the fabrication of prognostic chips, or the elimination of patient subpopulations prone to serious adverse events. So we believe microarray experimental design and gene expression analysis should be viewed as integral parts of specific decision-making processes.
An an example, in choosing genes to fabricate a prognostic microarray, one should keep in mind microarray manufacturing economics, as well as FDA regulatory requirments. In this presentation, I will report on a joint project to design a micorarray experiment statistically with randomization, replication, and blocking which will allow assessment of the sensitivity and specificity of genetic profiling prognostics chips.
As part of the decision process, we believe the analysis of gene expressions from a microarray experiment should control a statistical error rate appropriate for the purpose of the experiment. Controlling familywise error rate vs. false discovery rate will be discussed in the context of specific applications. The principle of partition multiple testing, a form of conditional frequentist inference, will be given. We will also describe the conditions under which stepwise testing becomes a valid computational shortcut to partition testing. Subtleties of these conditions have been not always been appreciated in bioinformatics.
Work done in collaboration with Jane Chang, Tao Wang, and Yifan Huang.
The analysis of gene expression using oligonucleotide arrays commonly requires estimating the expression level of a transcript using information from multiple probes. Many transcripts are expressed at such low levels that the nonspecific hybridization is a significant proportion of the observed probe intensity, and so it is an interesting problem to design estimators that function well on transcripts that have concentrations near or at zero. Working from simple assumptions about the behavior of probes, PLIER is a M-estimator model-based framework for finding expression estimates that is designed to handle near-background probe intensities well with minimal positive bias to the results. While the estimates from PLIER are by design not variance stabilized, PLIER shows good performance at detecting differential change, and can be variance stabilized by standard means.
Meta-analyses are often important tools for the generation of scientific hypotheses and the discovery of new science. In this light, there is increasing interest to consider the feasibility of conducting meta-analytic research using the existing and ever increasing pool of gene microarray data sets. This research could play an important role in the advancement of biological research, possibly defraying the time, expense and other difficulties associated with collecting and analyzing such data.
However, there are many problems associated with conducting meta-analyses on microarray data. In addition to natural biological variation across batches of such data, data are often subject to differences attributable to inconsistent data collection methods, such as different experimental conditions, researchers, or even lab protocols. Often the non-biological variation across batches of data is greater than the variation between the tissue types or treatment groups of interest. As a result, comparability is often suspect across studies.
Here we describe an empirical Bayes method of adjusting data for batch effects. This method consists of individual adjustments for each gene within each batch. The Bayesian framework shrinks gene-wise batch adjustments by pooling information across genes within batches, adjusting for gene-by-batch interactions while respecting systematic nature of differences in expression estimates across batches. This method is compared with univariate gene-wise location-scale adjustments and with other methods present in the literature.
The empirical Bayes adjustments are very robust in the presence of outlying observations, even when batch sample sizes are small (n < 15). Additionally, the empirical Bayes method often improves the consistency of within-batch fold changes for treatment effects, as compared with the unadjusted data.
Note: Joint work with Cheng Li, Department of Biostatistics, Harvard School of Public Health.
In the spirit of empirical evaluation, we compare measurements of relative gene expression from quantitative rtPCR to measurements from Affymetrix® gene chips. Our particular interest is how different methodologies for processing Affy data influence the agreement between Affy and qrtPCR measurements.
Note: Joint work with Dick Beyer, Noel Hudson, Nancy Linford, Li-Xuan Qin.
I will introduce our variational Bayesian implementation of Independent Component Analysis and report on successes and problems experienced in specific microarray data analysis applications. I will further discuss how the complex experimental process underlying microarray experiments affects data, and why 'low level' data analysis is still a major challenge in the field that often limits the detection of subtle interactions, particularly in typical size experimental designs.
The reconstruction of genetic networks in mammalian systems is one of the primary goals in biological research, especially as such reconstructions relate to elucidating not only common, polygenic human diseases, but living systems more generally. Here I present a statistical procedure for inferring causal relationships between gene expression traits and more classic clinical traits, including complex disease traits. This procedure has been generalized to the gene network reconstruction problem, where naturally occurring genetic variations in segregating mouse populations are used as a source of perturbations to elucidate tissue-specific gene networks. Differences in the extent of genetic control between genders and among four different tissues are highlighted. I also demonstrate that the networks derived from expression data in segregating mouse populations using the novel network reconstruction algorithm are able to capture causal associations between genes that result in increased predictive power, compared to more classically reconstructed networks derived from the same data. This approach to causal inference in large segregating mouse populations over multiple tissues not only elucidates fundamental aspects of transcriptional control, it also allows for the objective identification of key drivers of common human diseases.
Gene expression is a tightly regulated process, crucial for the proper functioning of a cell. In microarray data, coregulation is reflected by strong correlations between expression. Molecular disease mechanisms typically constitute abnormalities in the coregulation of genes. Resulting changes in expression profiles help identifying disease related genes, and in several cases facilitate improved diagnosis and even prognosis of disease outcome. Alteration of gene regulation often results in up or down regulated genes.
Common analysis strategies look for these differentially expressed genes, or for genes, which play an important role in some supervised learning algorithm, applied to the data. We opt to use a complementary approach. Not all changes in coregulation are manifested by up or down regulation of individual genes. Alterations of the correlation structure of expression should be investigated as well.
We address the problem of detecting sets of differentially co-expressed genes in two phenotypically distinct sets of expression profiles. We introduce a score for differential coexpression, and suggest a computationally efficient algorithm for finding high scoring sets of genes. The use of our novel method is demonstrated in the context of simulations and on real expression data from a clinical study.
Recent advances in large-scale RNA expression measurements, DNA-protein interactions, , protein-protein interactions and the availability of genome sequences from many organisms have opened the opportunity for massively parallel biological data acquisition and integrated understanding of the genetic networks underlying complex biological phenotypes. Many existing statistical procedures have been proposed to analyze a single data type, e.g. clustering algorithms for microarray data and motif finding methods for sequence data. Different data sources offer different perspectives on the same underlying system, and they can be combined to increase our chance of uncovering underlying biological mechanisms. In this talk, we will describe our attempts to develop a statistical framework to integrate diverse genomics and proteomics information to dissect transcriptional regulatory networks and signal transduction pathways. The developed methods will be illustrated through their applications in yeast. This is joint work with Ning Sun, Liang Chen, Baolin Wu, and Yin Liu.