|
Workshop 1 Abstracts and Lecture Materials:
Author: David B. Allison, Department of Biostatistics, University
of Alabama at Birmingham
Title: Opportunities, Challenges, and Issues Posed by Massive Multiple
Inference in High Dimensional Biology
Streaming Video: Real
Media
In microarray research and other types of high dimensional biological
research, multiple testing issues arise in new ways, on new orders
of magnitude, and present new opportunities and challenges. These
features call for new ways of thinking about some aspects of inferential
testing. Herein, I will discuss methods for multiple testing control
that involve capitalizing on, rather than penalizing oneself for,
the large number of tests conducted through mixture modeling procedures.
A novel extension to power and sample size estimation will be presented.
Finally, issues involving composite hypothesis testing involving
both union-intersection testing and intersection-union testing will
be presented.
Author: Harmen Bussemaker, Department of Biological Sciences, Center
for Computational Biology and Bioinformatics, Columbia University
Title: Inferring regulatory circuitry through model-based analysis
of mRNA expression and ChIP data
Presentation Materials: PDF
Streaming Video: Real
Media
Functional genomics studies are yielding information about regulatory
processes in the biological cell at an unprecedented scale. Not
only have DNA microarrays been used to measure, for all genes simultaneously,
the mRNA abundance in a variety of conditions, but the level of
occupancy of their promoter regions by a large number of transcription
factors has also been determined. The challenge is to extract useful
information about the global regulatory network from these data.
We present an integrative modeling framework that combines libraries
of expression and occupancy data to define the functional targets
of each transcription factor: Multivariate regression analysis is
used to infer transcription factor activity levels for each condition,
and the correlation between the mRNA expression profile of an individual
gene and the inferred activity profile of a transcription factor
is interpreted as regulatory coupling strength. Applying our method
for the yeast S. cerevisiae, we find that on average 58% of the
genes whose promoter region is bound by a transcription factor are
true regulatory targets. Moreover, our results enable us to assign
directionality to transcription factors controlling divergently
transcribed genes that share the same promoter region.
Author: Raymond J. Carroll, Department of Statistics, Texas A&M
University
Title: Efficient Estimation of Gene-Environment Interactions in
Case-Control Studies with Quantitative Gene Information
Presentation Materials: PPT
Streaming Video: Real
Media
A recent paper by Morley, et al. (Nature, 2004, 743-747) used microarray
data to perform gene-at-a-time linkage analysis in a family-based
study. The phenotype in question was the gene-expression level.
They conclude that the human gene expression phenotype "is a
trait like many others . amenable to genetic analysis". The
discussion by N. Cox (Nature, 2004, 733-734) notes this point and
states that "To veterans of linkage-mapping studies on complex
human phenotypes, this may seem to be an understatement on a par
with Watson and Crick's "it has not escaped our notice"".
While not even remotely as earth-shattering, this paper works in
the same type of context. We consider the analysis of standard and
family-based case-control studies of a disease. We are particularly
interested in gene-environment interactions and main effects when
the gene is quantitative, as it would be for a gene-expression level
from a microarray. In some cases, it is reasonable to assume that
the gene-expression level and the environmental covariates are independent
in the population (marginally). Under this assumption, and if the
distribution of the gene-expression level is modeled parametrically
while that of the environmental factors is not modeled, we construct
a semiparametric profile likelihood, profiled over the nonparametric
distribution of the environmental factors. We show that this profile
likelihood acts as if it were a proper likelihood. In addition,
it leads to estimation of main effects and interactions that is
more efficient than a standard logistic regression approach. If
the marginal probability in the population of case status is known,
even greater gains in efficiency are possible. We describe potential
parametric models for the gene-expression levels. It appears that
the profile likelihood approach also allows for single-index modeling
of multiple gene-expression levels.
Note: Joint work with Nilanjan Chatterjee of the National
Cancer Institute.
Author: Susmita Datta, Department of Mathematics and Statistics,
Georgia State University
Title: Significant Analysis Using P- values for Multiple Hypotheses
Testing in Microarray Experiments
Presentation Materials: PDF
Streaming Video: Real
Media
An empirical Bayes adjustment to multiple t-tests has been shown
to improve the sensitivity of the overall procedure (Datta et al,
Bioinformatics, 2004). Here we propose an empirical Bayes adjustment
to the p-values rather than the test statistics. As a result, it
is applicable to other types of multiple tests (e.g., F, chi-squared)
as well. Both parametric and nonparametric versions of the empirical
Bayes adjustment are considered. Thus, each p-value, in turn, borrows
evidence from other p-values across the tests. A new set of accept/reject
decisions are reached for each null hypothesis using the empirical
Bayes adjusted p-values through a resampling based step-down p-value
calculation that protects the analyst against the overall (familywise)
type 1 error rate. The new procedure is shown to produce further
improvement in sensitivity in a number of examples.
Speaker: Kim-Anh Do, Department of Biostatistics & Applied Mathematics,
The University of Texas M.D. Anderson Cancer Center
Collaborators: Peter Mueller and Feng Tang
Title: A Bayesian Mixture Model for Differential Gene Expression
Presentation Materials: PDF
Streaming Video: Real
Media
Model-based inference is proposed for differential gene expression,
using a non-parametric Bayesian probability model for the distribution
of gene intensities under different conditions. The probability
model is
essentially a mixture of normals. Specifically, it is a variation
of traditional Dirichlet process (DP) mixture models. The model
includes an additional mixture corresponding to the assumption that
transcription levels arise as a mixture over non-differentially
and differentially expressed genes.
Inference proceeds as in DP mixture models, with an additional set
of latent indicators to resolve this additional mixture. The use
of fully model-based inference mitigates some of the necessary limitations
of the empirical Bayes method (Efron, JASA 2001). However, the increased
generality of our method comes at a price. Computation is not as
straightforward as in the empirical Bayes scheme. But we argue that
inference is no more difficult than posterior simulation in a traditional
nonparametric mixture of normal models. We illustrate the proposed
method in two examples, including a simulation study and a a microarray
experiment to screen for genes with differential expression in colon
cancer versus normal tissue.
We will illustrate the ease of making joint inference about a sub-group
of genes being differentially expressed and of estimating the total
number of significantly expressing genes. Further, we also elaborate
on how the control of false positive rates can be automatically
incorporated into this approach.
Author: Darlene Goldstein, Institute of Mathematics, EPFL
Title: Strategies for quantifying GeneChip expression for large
studies
Presentation Materials: PDF
Streaming Video: Real
Media
Studies of gene expression using Affymetrix GeneChips have become
standard in several area of clinical research, particularly cancer.
Of the expression measures that have been proposed to quantify genechip
expression, multiarray-based measures have been shown to perform
well. As clinical gene expression studies increase in size, however,
utilizing multiarray strategies is more challenging in terms of
computing memory requirements and time. I will report on results
of a study examining properties and tradeoffs of single and multiarray
strategies for quantifying expression in large size studies.
Author: Ina Hoeschele, Virginia Bioinformatics Institute and Department
of Statistics, Virginia Tech
Title: Genetical genomics analysis to infer gene regulatory networks
Genetic analysis of gene expression in a segregating population,
which is expression profiled and genotyped at DNA markers throughout
the genome, can reveal regulatory networks of polymorphic genes.
We propose an analysis strategy with several steps: (1) Genome-wide
QTL analysis of all expression profiles to identify eQTL confidence
regions, followed by fine-mapping of identified eQTL; (2) identification
of regulatory candidate genes in each eQTL region; (3) correlation
analysis of the expression profiles of the candidates in any eQTL
region with the gene affected by the eQTL to reduce the number of
candidates; (4) drawing directional links from retained regulatory
candidate genes to genes affected by the eQTL and joining links
to form networks, and (5) statistical validation and refinement
of the inferred network structure via structural equation modeling.
Here, we apply an initial implementation of this strategy to a segregating
yeast population. In 65%, 7%, and 28% of the identified eQTL regions,
a single candidate regulatory gene, no gene, or several (at most
six) genes were retained in step (3), respectively. Overall, 768
putative regulatory links were retained, 331 of which are the strongest
candidate links, as they were retained in the expression correlation
analysis and also were located within or near an eQTL sub-region
identified by a multi-marker analysis separating multiple linked
QTL. Biological processes were statistically over-represented in
highly interconnected sub-networks of genes. Transcription factors
had few connections. To statistically validate the reconstructed
networks and to compare alternative structures, we have begun investigating
Structural Equation Modeling by simulating artificial data under
nonlinear models of gene regulation with alternative network topologies.
The asymptotic chi-square distribution of the model fit criterion,
which is based on the assumption of multivariate normality, is well
approximated in the simulated data, and in an initial comparison
of alternative structures of small networks the correct models,
or similar models, had the smallest Bayesian Information Criterion.
Speaker: Jason Hsu, Department of Statistics, The Ohio State University
Authors: Jason Hsu, Jane Chang, Tao Wang, and Yifan Huang
Title: Statistically Designing Microarray Experiments and Analyzing
Gene Expression Data in a Decision-Making Processes
Microarray experiments are no longer exploratory in nature; they
are fast becoming partsof well-defined decision processes. For example,
they may be used to select genes toward the fabrication of prognostic
chips, or the elimination of patient subpopulations prone to serious
adverse events. So we believe microarray experimental design and
gene expression analysis should be viewed as integral parts of specific
decision-making processes.
An an example, in choosing genes to fabricate a prognostic microarray,
one should keep in mind microarray manufacturing economics, as well
as FDA regulatory requirments. In this presentation, I will report
on a joint project to design a micorarray experiment statistically
with randomization, replication, and blocking which will allow assessment
of the sensitivity and specificity of genetic profiling prognostics
chips.
As part of the decision process, we believe the analysis of gene
expressions from a microarray experiment should control a statistical
error rate appropriate for the purpose of the experiment. Controlling
familywise error rate vs. false discovery rate will be discussed
in the context of specific applications. The principle of partition
multiple testing, a form of conditional frequentist inference, will
be given. We will also describe the conditions under which stepwise
testing becomes a valid computational shortcut to partition testing.
Subtleties of these conditions have been not always been appreciated
in bioinformatics.
Author: Earl Hubbell, Affymetrix
Title: Designing Estimators for Low Level Expression Analysis
Presentation Materials: PPT
Streaming Video: Real
Media
The analysis of gene expression using oligonucleotide arrays commonly
requires estimating the expression level of a transcript using information
from multiple probes. Many transcripts are expressed at such low
levels that the nonspecific hybridization is a significant proportion
of the observed probe intensity, and so it is an interesting problem
to design estimators that function well on transcripts that have
concentrations near or at zero. Working from simple assumptions
about the behavior of probes, PLIER is a M-estimator model-based
framework for finding expression estimates that is designed to handle
near-background probe intensities well with minimal positive bias
to the results. While the estimates from PLIER are by design not
variance stabilized, PLIER shows good performance at detecting differential
change, and can be variance stabilized by standard means.
Author: W. Evan Johnson, Department of Biostatistics, Harvard School
of Public Health
Title: Adjusting for the Batch Effect: An Empirical Bayes Approach
to Combining Microarray Data from Multiple Sources
Presentation Materials: PDF
Streaming Video: Real
Media
Meta-analyses are often important tools for the generation of scientific
hypotheses and the discovery of new science. In this light, there
is increasing interest to consider the feasibility of conducting
meta-analytic research using the existing and ever increasing pool
of gene microarray data sets. This research could play an important
role in the advancement of biological research, possibly defraying
the time, expense and other difficulties associated with collecting
and analyzing such data.
However, there are many problems associated with conducting meta-analyses
on microarray data. In addition to natural biological variation
across batches of such data, data are often subject to differences
attributable to inconsistent data collection methods, such as different
experimental conditions, researchers, or even lab protocols. Often
the non-biological variation across batches of data is greater than
the variation between the tissue types or treatment groups of interest.
As a result, comparability is often suspect across studies.
Here we describe an empirical Bayes method of adjusting data for
batch effects. This method consists of individual adjustments for
each gene within each batch. The Bayesian framework shrinks gene-wise
batch adjustments by pooling information across genes within batches,
adjusting for gene-by-batch interactions while respecting systematic
nature of differences in expression estimates across batches. This
method is compared with univariate gene-wise location-scale adjustments
and with other methods present in the literature.
The empirical Bayes adjustments are very robust in the presence
of outlying observations, even when batch sample sizes are small
(n<15). Additionally, the empirical Bayes method often improves
the consistency of within-batch fold changes for treatment effects,
as compared with the unadjusted data.
Note: Joint work with Cheng Li, Department of Biostatistics,
Harvard School of Public Health
Speaker: Kathleen Kerr, Department of Biostatistics, University
of Washington
Title: Comparison of Affymetrix and quantitative rtPCR measurements
of relative gene expression
Presentation Materials: PPT
Streaming Video: Real
Media
In the spirit of empirical evaluation, we compare measurements
of relative gene expression from quantitative rtPCR to measurements
from Affymetrix® gene chips. Our particular interest is how different
methodologies for processing Affy data influence the agreement between
Affy and qrtPCR measurements.
Note: Joint work with Dick Beyer, Noel Hudson, Nancy Linford,
Li-Xuan Qin
Author: David Kreil, University of Cambridge
Title: From Spot to Biology: Challenges in microarray data analysis
Streaming Video: Real
Media
I will introduce our variational Bayesian implementation of Independent
Component Analysis and report on successes and problems experienced
in specific microarray data analysis applications. I will further
discuss how the complex experimental process underlying microarray
experiments affects data, and why `low level' data analysis is still
a major challenge in the field that often limits the detection of
subtle interactions, particularly in typical size experimental designs.
Author: Eric Schadt, Research Genetics, Rosetta Inpharmatics/Merck
Title: Complex Systems to Understand Complex Traits: Beyond Reagent
Driven Science
The reconstruction of genetic networks in mammalian systems is
one of the primary goals in biological research, especially as such
reconstructions relate to elucidating not only common, polygenic
human diseases, but living systems more generally. Here I present
a statistical procedure for inferring causal relationships between
gene expression traits and more classic clinical traits, including
complex disease traits. This procedure has been generalized to the
gene network reconstruction problem, where naturally occurring genetic
variations in segregating mouse populations are used as a source
of perturbations to elucidate tissue-specific gene networks. Differences
in the extent of genetic control between genders and among four
different tissues are highlighted. I also demonstrate that the networks
derived from expression data in segregating mouse populations using
the novel network reconstruction algorithm are able to capture causal
associations between genes that result in increased predictive power,
compared to more classically reconstructed networks derived from
the same data. This approach to causal inference in large segregating
mouse populations over multiple tissues not only elucidates fundamental
aspects of transcriptional control, it also allows for the objective
identification of key drivers of common human diseases.
Author: Rainer Spang, Computational Diagnostics Group, Max Planck
Institute for Molecular Genetics
Title: Differential Co-Expression of Genes
Presentation Materials: PPT
Streaming Video: Real
Media
Gene expression is a tightly regulated process, crucial for the
proper functioning of a cell. In microarray data, coregulation is
reflected by strong correlations between expression. Molecular disease
mechanisms typically constitute abnormalities in the coregulation
of genes. Resulting changes in expression profiles help identifying
disease related genes, and in several cases facilitate improved
diagnosis and even prognosis of disease outcome. Alteration of gene
regulation often results in up or down regulated genes.
Common analysis strategies look for these differentially expressed
genes, or for genes, which play an important role in some supervised
learning algorithm, applied to the data. We opt to use a complementary
approach. Not all changes in coregulation are manifested by up or
down regulation of individual genes. Alterations of the correlation
structure of expression should be investigated as well.
We address the problem of detecting sets of differentially co-expressed
genes in two phenotypically distinct sets of expression profiles.
We introduce a score for differential coexpression, and suggest
a computationally efficient algorithm for finding high scoring sets
of genes. The use of our novel method is demonstrated in the context
of simulations and on real expression data from a clinical study.
Author: Terry Speed, Department of Statistics, University of California
@ Berkeley
Title: Some problems in the statistical analysis of microarray data
Presentation Materials: PPT
Streaming Video: Real
Media
Author: Hongyu Zhao, School of Public Health, Yale University
Title: Integrated Statistical Analysis of Gene Expression Data
Presentation Materials: PDF
Streaming Video: Real
Media
Recent advances in large-scale RNA expression measurements, DNA-protein
interactions, , protein-protein interactions and the availability
of genome sequences from many organisms have opened the opportunity
for massively parallel biological data acquisition and integrated
understanding of the genetic networks underlying complex biological
phenotypes. Many existing statistical procedures have been proposed
to analyze a single data type, e.g. clustering algorithms for microarray
data and motif finding methods for sequence data. Different data sources
offer different perspectives on the same underlying system, and they
can be combined to increase our chance of uncovering underlying biological
mechanisms. In this talk, we will describe our attempts to develop
a statistical framework to integrate diverse genomics and proteomics
information to dissect transcriptional regulatory networks and signal
transduction pathways. The developed methods will be illustrated through
their applications in yeast. This is joint work with Ning Sun, Liang
Chen, Baolin Wu, and Yin Liu. |