MBI Logo
MBI Logo

CTW: Systems Biology of Biological Processes and Diseases: Biological Problems and Statistical Solutions: Abstracts and Lecture Materials

High-throughput assays in epigenomics research
John Greally, Albert Einstein College of Medicine

Both microarray and massively-parallel sequencing technologies are offering insights never previously possible into chromatin biology, cytosine methylation and transcriptional regulation. This new field, loosely referred to as "epigenomics", is being studied not only in terms of the means by which these processes influence normal cellular physiology, but also how they become dysregulated in human disease. Of the challenges faced in this field, by far the most significant is the computational analysis of the massive datasets generated when performing epigenome-wide assays. We describe how we have developed systematic approaches for computational analysis of cytosine methylation data, and how we have exploited these resources to gain insights into the normal physiology of the epigenome and its dysregulation in disease.

Weighted Gene Co-expression Network Analysis: Theory, Applications, and Software
Steve Horvath, Assoc Prof, Human Genetics and Biostatistics, University of California, Los Angeles

Weighted gene co-expression network analysis (WGCNA) facilitates a systems biologic view of gene expression data. The network framework makes it straightforward to integrate gene expression data with other types of data, e.g. clinical traits and genetic marker data. This talk covers several theoretical topics including network construction, module definition, network based gene screening, and differential network analysis. The methods are illustrated using several applications including i) screening for cancer genes, ii) comparing human and chimp brains, and iii) complex disease gene mapping. Related articles and material can be found at the following webpage http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/

Citations:

  1. Horvath S, Dong J (2008) Geometric Interpretation of Gene Coexpression Network Analysis. PLoS Comput Biol 4(8): e1000117
  2. Zhang B, Horvath S (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17
  3. Langfelder P, Horvath S (2008) WGCNA: an R package for weighted gene co-expression network analysis. BMC Bioinformatics 2008, 9:559

A systems biology approach to model breast tumor microenvironment
Kun Huang, Biomedical Informatics, OSUCCC BISR

The tumor microenvironment (TME) is a significant contributor to the progression of cancer. TME in breast cancer consists of a multitude of cell types such as endothelial cells, fibroblasts, and immune cells. To design a realistic computational model for cancer, we need to model the both the intracellular reactions and cell-cell interactions in the TME. Our goal is to develop a "geographical information system" of the breast TME by integrating spatial information of cells (from microscopic imaging) with molecular information (e.g., gene expression and ChIP-seq) using a systems biology approach. While we will discuss a wide spectrum of algorithms involved in this work, we will focus on the microscopic image segmentation problem using two-point correlation function and two hybrid linear model fitting algorithms.

Factorial cumulants as dynamical variables for molecular biocircuits
Ovidiu Lipan, University of Richmond

Mathematical formalisms in use to describe molecular biocircuits range from deterministic Boolean to discrete stochastic models. Between these extremes lie models based on deterministic ordinary differential equation or based on continuous stochastic differential equations. This talk focuses on the master equation approach to the discrete stochastic molecular biocircuits. The master equation, with time-dependent transition probabilities per unit of time, captures the signal flow through the molecular biocircuit.

The theoretical approach will be used to analyze the experimental data collected at single-cell level for the activity of the heat shock protein 70 promoter in Chinese hamster ovary cells.

Citations:

  1. Lipan O. and Wong W.H.(2005) The use of oscillatory signals in the study of genetic networks. Proc Natl Acad Sci U S A. 2005 May 17;102(20):7063-8.
  2. Achimescu S. and Lipan O.(2006) Signal propagation in stochastic nonlinear genetic networks. IEE Systems Biology. May;153(3):120-34.
  3. Lipan O., Navenot J-M., Wang Z., Huang L. and Peiper S. C. Heat shock response in CHO mammalian cells is controlled by a nonlinear stochastic process. PLoS Comput. Biol. 3(10) (2007) e187.
  4. Raffard R., Lipan O., Wong W. H. and Tomlin J. C., Optimal discovery of a stochastic genetic network. Proc. of the American Control Conference (Washington, Seattle, 2008).

Sparsity and Factor Models for Associating Expression with Genetic and Epigenetic Changes
Joseph Lucas, Department of Statistics, Duke University

High throughput biological assays such as mass spec proteomics and gene expression arrays offer the potential to revolutionize our understanding of biological pathways. However, describing the patterns of expression and understanding what they tell us are daunting challenges. We describe a statistical model which may be fit to data that is characterized by high dimensionality and few observations. We show how this model may be used to discover associations between gene expression data and other types of high throughput assays, and thereby generate testable biological hypotheses.

Network Based Gene Set Analysis
George Michailidis, Department of Statistics, University of Michigan

In this work, we consider the problem of assessing differential expression of entire gene sets in complex biological experiments. We propose a latent variable model that directly incorporates the underlying biological network structure. Subsequently, using the theory of mixed linear models we develop the necessary inference framework for addressing the task at hand. Several test procedures are examined and a network based method for testing changes in expression levels of gene sets, as well as the structure of the network is presented. The performance of the proposed methodology is assessed through a simulation study and applied to a number of real data sets.

Regression and Classification with Networked Predictors
Wei Pan, Division of Biostatistics, School of Public Health, University of Minnesota

We consider the problem of conducting regression or classifciation analysis with predictors whose relatinships are described a priori by a network. A class of motivating examples is to model a quantitative or categorical phenotype using gene expression profiles while accounting for coordinated functioning of genes in the form of biological pathways or networks. We introduce our new methods and compare them with some existing ones.

Statistical challenges with next-generation sequence data
Terry Speed, Department of Statistics and Program in Biostatistics, University of California, BERKELEY

Since their first appearance just over a decade ago, microarrays have become the assays of choice for high-throughput genome-wide studies of gene expression. At the same time the use of microarrays has broadened to include studies of DNA polymorphism, DNA copy-number, DNA binding proteins, DNA (re) sequencing, and more. In the course of these developments, we have learned a lot about the many non-biological aspects of microarray data, and have devised methods which attempt to deal with them. Also, many novel statistical methods have been developed to address the challenges posed by the availability of large amounts of microarray data for answering biological questions.

Recent improvements in the efficiency, quality, and cost of genome-wide sequencing are prompting biologists to abandon microarrays in favor of next-generation sequencers, e.g., Applied Biosystems' SOLiD, Helicos BioSciences' HeliScope, Illumina's Solexa, and Roche's 454 Life Sciences sequencing systems, and more. These high-throughput sequencing technologies have already been applied to studying genome-wide transcription levels (mRNA-Seq), transcription factor binding sites (ChIP-Seq), chromatin structure, DNA copy number, and DNA methylation status.

While we might hope that these new sequencing-based studies have overcome many of the limitations of microarray-based studies, realistically we should expect that these new technologies raise problems of their own similar to the ones we met with microarrays. If so, there will be a need for statisticians and others to understand and deal with non-biological features of the data, and to modify existing or develop novel statistical methods to get the best out of these data, when helping biologists address the questions of interest to them. This talk, which draws heavily on recent, unpublished work of Sandrine Dudoit and her students, reports on early findings, work in progress, and promising directions.

Enhancing Signal Detection Ability through Information Sharing
Naisyin Wang, Texas A&M University

It is of great interest to identify genes that play a crucial role in the promotion stage of tumor forming. However, the differential signals at this stage tend to be much weaker compared to those obtained in the comparisons between tumor and regular tissues. One strategy in the study of diet prevention effects in tumorgenesis is to collect multivariate information, for example, microRNA and various types of mRNA measurements, from the same animals at different experimental setup. This practice allows researchers to borrow strength from the related variables to detect the weak but practically important diet differences at the early stage of the tumorgenesis. I will present some challenges we encountered during the study and methods we developed.

Statistical False Positive Control in Genome-wide TilingArrays
Yu Zhang, Department of Statistics, The Penn State University

Genome-wide tilingarray study requires millions of simultaneous comparisons of binding signals for significance. Controlling statistical false positives in tiling array studies is very important, because the number of identified binding regions can easily go beyond the capability of experimental verification. Using ChIP-chip transcription factor binding data as an example, we introduce a novel and efficient method for accurate evaluation of statistical significance of peaks. We further introduce a modified FDR control method that is more appropriate for tilingarrays. Using a moving window approach, we further demonstrate how to combine results from various window sizes to increase the detection power while maintaining a specified type I error rate or FDR. Our approach is general and can potentially be accommodated in many large genomic and genetic studies.

Integrated Approaches to Mapping Genome to Phenome
Xianghong Jasmine Zhou, Associate Professor, Molecular and Computational Biology, University of Southern California

In this talk, we will report our recent effort in utilizing the rapidly accumulating body of genomics data, especially the enormous amount of public microarray data, together with the associated phenotypic and environmental context information to reconstruct the biological basis of phenotypes. Traditional association studies have been relatively successful at relating genetic polymorphisms to phenotypes. However, they have met difficulties in elucidating the gene-gene interactions that contribute to complex phenotypes. Here, we develop novel methods aimed at deriving genome-wide molecular networks of genotype-phenotype associations. Furthermore, we develop methods to perform phenotype prediction and computational diagnosis utilizing public genomics databases, particularly the large public microarray repositories, to create an automated disease diagnosis database.

Integrating diverse data to elucidate multi-level regulations of biological systems: A systems approach for complex human diseases
Jun Zhu, Ph. D, Associate Scientific Director, Rosetta Inpharmatics, Merck Research Laboratories

Common human diseases are driven by multiple coherent networks interacting within and between tissues, not by simple changes in single genes. There are transcriptional, protein-protein interaction, phosphorylation, and metabolite networks, to name just a few, in biological systems. In addition, many different genetic and environmental factors can affect these networks and in turn lead to phenotypic change at the organism level. Multiple types of high throughput data are available for constructing networks, including RNA microarray data, chip-chip data, protein array data, siRNA screening data, and DNA variation data, among several other types. These different types of networks interact with each other both within and between multiple tissues that in turn contribute to disease risk and progression. We demonstrate here, from yeast, mouse and human systems, how to integrate different types of genomic data to derive networks that elucidate complex disease traits like obesity. We demonstrate how these networks aid in the identification of new drug targets and biomarkers for common human diseases like obesity, diabetes, and heart disease.