Workshop 4: Emerging Genomic Technologies and Data Integration Problems

(February 21,2005 - February 24,2005 )

Organizers


Terence Speed
Genetics and Bioinformatics, University of California, Berkeley
Hongyu Zhao
Epidemiology and Public Health, Yale University

It is arguable that the genomics revolution is largely technology-driven. Whatever one's view on this question, it is hard to imagine genomics without the polymerase chain reaction (PCR), invented as recently as the mid-1980s, or without high-throughput DNA sequencing, which emerged a little later. More recently, we have had the advent of the microarray (DNA chip) and high-throughput mass spectrometry (MS), which have greatly enriched functional genomics and proteomics, respectively. An inevitable consequence of the wider perspective of genomics and proteomics is the desire to extend assays, once carried out with one gene or one protein, to be as effective with hundreds of thousands of genes at a time, aiming at genome-wide or proteome-wide coverage. Thus, we now have a wide variety of high-throughput assays for measuring gene expression, at both the mRNA and protein levels, emerging ones for measuring DNA-protein and protein-protein interactions, and a constant drive to narrow the focus of the assay (e.g., to a single cell) and reduce the quantity of reagents needed.

Each advance of this kind brings with it many computational, mathematical, and statistical questions, both in the generation and initial processing, and in the analysis and interpretation of the data. While the details of the different technologies necessarily differ, many common themes emerge. These include issues, such as signal processing, signal manipulation, and quantification algorithms, as well as a host of common analysis tasks, such as classification, clustering, and the analysis of time course data. The purpose of this workshop is to introduce participants to some of these emerging technologies, and to have talks, which outline their quantitative needs so that we can highlight common analytical themes.

Accepted Speakers

Joel Bader
Department of Biomedical Engineering, Johns Hopkins University
Keith Baggerly
Department of Bioinformatics & Biostatistics, University of Texas M. D. Anderson Cancer Center
Julia Brettschneider
Department of Statistics & Biostatistics, University of California, Berkeley
Martha Bulyk
Department of Genetics, Harvard Medical School
Ramana Davuluri
Molecular Virology, Immunology, & Medical Genetics, The Ohio State University
Steve Horvath
Department of Biostatistics, University of California, San Diego
Earl Hubbell
Department of Statistics, Affymetrix, Inc.
Fiona Hyland
Department of Computational Genetics, Applied Biosystems
Xiaole (Shirley) Liu
Department of Biostatistics, Harvard University
Joakim Lundeberg
Department of Biotechnology, Royal Institute of Technology (KTH)
Richard McGehee
Department of Mathematics, University of Minnesota
Andre Rzetsky
Medical Bioinformatics Unit, Columbia University
Eric Schadt
Research Genetics, Rosetta Inpharmatics
Terence Speed
Genetics and Bioinformatics, University of California, Berkeley
Paul Spellman
Life Sciences Division, Lawrence Berkeley National Laboratory
Ning Sun
Epidemiology and Biostatistics, Yale University
Michael Uhler
Department of Biological Chemistry, University of Michigan
Monday, February 21, 2005
Time Session
09:15 AM
10:00 AM
Eric Schadt - Forward Genetics in Reverse: Integrating genotypic and expression data in a segregating mouse population to map a novel susceptibility locus with pleiotropic effects on obesity, bone density, and cholesterol traits

Forward genetics approaches to identify genes for complex traits such as common human diseases have met with limited success. Fine mapping of linkage regions associated with complex traits and validation of positional candidates are time consuming and often hit-or-miss. Here we detail a hybrid procedure to map loci for complex traits that leverages off of the strengths of forward and reverse genetics approaches. By intersecting genotypic and expression data in a segregating mouse population, we demonstrate how clusters of expression quantitative trait loci (eQTL) linking to regions of the genome controlling for complex traits, accurately reflect the underlying perturbation to the transcriptional network induced by DNA variations in genes that control for the complex traits. By matching patterns of gene expression in a segregating population with gene expression responses induced by single gene perturbation experiments, we demonstrate how genes controlling for clusters of expression and clinical QTL (QTL "hot spot" regions) can be directly mapped. The utility of this approach is demonstrated by mapping a novel susceptibility locus for a previously identified QTL in an F2 cross between strains C57BL/6J (B6) and DBA/2J (DBA), with pleiotropic effects on body fat, lipid levels, and bone density. Our results demonstrate that integrating microarray analysis with genetic and clinical data in segregating populations is a powerful approach for directly identifying genes underlying QTLs.

10:30 AM
11:15 AM
Ning Sun - A Measurement Error Model for Inferring Transcriptional Regulatory Networks

Transcriptional Regulation Networks consist of high level of complexity including a large number of genes and gene products and their associations, the unknown protein-DNA interaction mechanisms, the transient activation of the networks by signal transduction pathways. Recent advances in genomic studies have provided many types of data related to transcription regulation, such as large-scale RNA expression measurements, in vivo DNA-protein binding data, protein-protein interactions and genome sequences. However, each type of data only contains partial information on transcriptional regulation, and often is accompanied with large measurement errors. In this study, our goal is to develop a statistical framework to explicitly separate the mechanism model from the measurement error model, to allow a flexible framework to integrate various data types, and to assist the learning of mechanisms from data. Here, we will present a measurement error model to integrate RNA expression data and protein-DNA binding data. In this model, a linear system model was assumed to describe gene expression as the response of regulation from a set of proteins (transcription factors). Our simulation results showed that this data integration model may reduce the measurement errors in protein-DNA interaction data. We also applied the method to Yeast cell cycle data, and our results will be discussed in this talk.

02:00 PM
02:45 PM
Xiaole (Shirley) Liu - A Hidden Markov Model for Analyzing ChIP-chip Experiments on Genome Tiling Arrays and Its Application to p53 Binding Sequences

Motivation: Transcription factors (TFs) regulate gene expression by recognizing and binding to specific regulatory regions on the genome, which in higher eukaryotes can occur far away from the regulated genes. Recently Affymetrix developed the high-density oligonucleotide arrays that tile all the non-repetitive sequences of the human genome at 35-bp resolution. This new array platform allows for the unbiased mapping of in vivo TF binding sequences (TFBSs) using Chromatin ImmunoPrecipitation followed by microarray experiments (ChIP-chip). The massive data generated from these experiments pose great challenges for data analysis.


Results: We developed a fast, scalable and sensitive method to extract TFBSs from ChIP-chip experiments on genome tiling arrays. Our method takes advantage of tiling array data from many experiments to normalize and model the behavior of each individual probe, and identifies TFBSs using a Hidden Markov Model (HMM). When applied to the data of p53 ChIP-chip experiments (Cawley et al., 2004), our method discovered many new high confidence p53 targets including all the regions verified by quantitative PCR . Using a de novo motif finding algorithm MDscan (Liu et al., 2002), we also recovered the p53 motif from our HMM identified p53 target regions. Furthermore, we found substantial p53 motif enrichment in these regions comparing with both genomic background and the TFBSs identified by Cawley et al (2004). Several of the newly identified p53 TFBSs are in known genes' promoter regions or associated with previous characterized p53-responsive genes.

03:15 PM
04:00 PM
Ramana Davuluri - Identifying Estrogen Receptor a Target Genes Using Integrated Computational Genomics and Chromatin Immunoprecipitation Microarray

The key aspect in deciphering the complex puzzle of transcriptional regulatory networks is the identification of target genes of transcription factors (TFs). In general, TFs bind to specific sequence motifs present in the promoter regions of target genes, and participate in combinatorial interaction with TFs of other signaling networks (transcriptional modules). A recent technology called ChIP-on-chip, or chromatin immunoprecipitation followed by DNA microarray analysis, has proven to be an efficient means of mapping TF-promoter interactions. We will describe an integrative computational genomics approach to analyze the data generated from ChIP-on-chip experiments. The integrative approach involves TF binding site detection, comparative promoter analysis of orthologous genes (http://bioinformatics.med.ohio-state.edu/OMGProm; Palaniswamy et al. 2004, Bioinformatics) and the application of Classification And Regression Tree (CART) data-mining method.


The estrogen receptor (ER ) regulates gene expression by either direct binding to estrogen response elements (EREs) or indirect tethering to other TFs on promoter targets. In order to identify these promoter sequences, we conducted a genome-wide screening with ChIP-on-chip. A set of 70 candidate ER loci were identified and the corresponding promoter sequences were analyzed by statistical pattern recognition and comparative genomics approaches. We found mouse counterparts for 63 of these loci, and classified 42 (67%) as direct ER targets using CART statistical model, which involves position weight matrix, human-mouse sequence similarity scores and presence of other TF binding sites near ERE as model parameters. The remaining genes were considered to be indirect targets. To validate this computational prediction, we conducted an additional ChIP-on-chip assay that identified acetylated chromatin components in active ER promoters. Of 27 loci upregulated in an ER positive breast cancer cell line, 20 having mouse counterparts were correctly predicted by CART model. CART model identified four different modules of combinatorial control ER based on over representation of other TF binding sites near ERE in ER target promoters. One of the identified modules (ERE+AP1) is already known, and experimental validation is required for other predicted modules. Further details about the computational method and ER target gene database can be found at http://bioinformatics.med.ohio-state.edu/ERTargetDB/, and were described in Jin et al. 2004, Nucleic Acids Research, 32: 6627-6635 & Leu et al. 2004, Cancer Research, 64: 8184-8192.

Tuesday, February 22, 2005
Time Session
10:15 AM
11:00 AM
Andre Rzetsky - Analysis of Heterogeneous/Noisy Molecular Interaction Data

I will give an overview of our effort to automatically extract pathway information from a large number of full-text research articles (GeneWays system), automatically curate the extracted information, and to combine the literature-derived information with sequence and experimental (such as yeast two-hybrid) data using a probabilistic approach.

11:30 AM
12:15 PM
Amy Keating - Combinatorial Associations of the Human bZIP Transcription Factors: protein microarray measurements and computational predictions

Sequencing of the human genome has revealed approximately 55 human bZIP transcription factors that can form homo- or heterodimers to regulate a wide variety of biological processes. The information necessary for dimerization specificity is encoded in the coiled-coil or "leucine-zipper" domains of these proteins. We have used protein microarrays to carry out a comprehensive analysis of the intrinsic interaction specificities of the bZIPs. By paying particular attention to issues such as purity, valency and oxidation state, we have obtained high quality interaction data, as judged by reproducibility, symmetry and agreement with solution studies. Our measurements of over 1,400 unique pairwise combinations show that bZIP coiled-coil interactions are sparse and highly-selective in vitro. The resulting data are valuable for understanding combinatorial regulation of transcription by the bZIPs. The array technology is likely to be valuable for analyzing other protein domain and peptide interactions as well.


The bZIP microarray data provide an excellent foundation for computational studies of protein sequence and structural features that are important for interaction specificity. A support vector machine (SVM) developed by Mona Singh (Princeton University Computer Science) trained on coiled-coil data from the literature does an excellent job predicting bZIP interaction preferences. We are working to combine machine-learning methods such as the SVM with physical modeling of protein structure. An integration of diverse approaches promises better performance, as well as an improved understanding of the underlying determinants of protein-protein interaction specificity. The SVM method for predicting bZIP coiled-coil interactions is available for interactive use at http://compbio.cs.princeton.edu/bzip/.


Joint work with Mona Singh.

02:00 PM
02:45 PM
Fiona Hyland - Automated Analysis on High Throughput Genotyping and Gene Expression Platforms, and Evaluating Haplotype Tagging SNPs Selected in One Human Population for Their Informativeness in Other Populations

High-throughput genomic technologies necessitate automated data processing and analysis through multiple stages: System design (e.g. annotation and probe design), primary analysis (e.g. image analysis, background subtraction, signal processing, outlier detection), and secondary analysis (experimental design, normalization, filtering, hypothesis testing, classification and prediction). A brief conceptual overview of data analysis on Applied Biosystems' genotyping and gene expression high throughput platforms will be used to tie together common analytical themes.


It is widely hoped that the study of sequence variation in the human genome will provide a means of elucidating the genetic component of complex disease. The existence of substantial linkage disequilibrium (LD) in the human genome suggests that it should be possible to select a subset of single-nucleotide polymorphisms (SNPs) that optimally retain the overall informativeness with respect to disease or trait association of the entire set. This in turn would reduce both the costs of exhaustive genotyping and ameliorate the challenging problem of statistical inference from grossly over determined models. To confirm disease associations discovered in one ethnic group, it is desirable to know whether a set of haplotype tagging SNPs (tSNPs) selected in one population will be informative in other populations.


Tagging SNPs (tSNP) were selected from about 20,000 SNPs closely spaced along the axis of 3 human chromosomes (chrs. 6, 21, & 22). A sample of 45 individuals from each of 4 human populations were genotyped to obtain information on the patterns of LD on each of the populations for the selection of tagging SNPs. We utilized haplotype R2 as a metric of information to select minimum informative subsets of tSNPs. The number of tSNPs needed to maintain a haplotype r2 above a critical threshold was computed separately for each population: more tSNPs were needed in African Americans than in Caucasians, Chinese or Japanese. The effect of SNP density was examined. When subsets of SNPs of various sizes were sampled, and the degree to which the subset tagged the 'hidden' SNPs was calculated, the hidden SNPs were more completely tagged in Caucasians and Asians than in African Americans. The average haplotype r2 of the Caucasian tSNPs in Caucasians, and of the Caucasian tSNPs in African Americans, Chinese and Japanese was computed, and vice versa. The average haplotype r2 of Caucasian tSNPs used in Caucasians was very close to the haplotype r2 of Asian tSNPs used in Caucasians, or African American tSNPs used in Caucasians; similarly, the average haplotype r2 of African American tSNPs used in African Americans was very close to the haplotype r2 of Asian tSNPs used in African American, or Caucasian tSNPs used in African American. These results indicate that tSNPs selected in one population will work reasonably well in other populations, at least with regard to common haplotypes. About 65% of tSNPs were found to be common across populations when selected without optimization for overlap.

Wednesday, February 23, 2005
Time Session
09:00 AM
09:45 AM
Julia Brettschneider - Antibody Array Data Analysis and Comparison with Gene Expression

Antibody arrays are a fairly recent technology for high-throughput protein expression measurement. They typically screen a few hundred proteins simultanusly. We discuss image analysis and normalization for this platform. We further address the question of comparing protein measurements with RNA level measurements.


Our data comes from human brain tissue. Protein expression was screened by BD Biosciences antibody arrays, RNA levels were measured by Affymetrix chips.

10:15 AM
11:00 AM
Martha Bulyk - Rapid Analysis of the DNA Binding Specificities of Transcription Factors with DNA Microarrays

We have recently developed a new DNA microarray-based technology, termed protein binding microarrays (PBMs), that allows rapid, high-throughput characterization of the in vitro DNA binding site sequence specificities of transcription factors in a single day. Using PBMs, we identified the DNA binding site sequence specificities of the yeast transcription factors Abf1, Rap1, and Mig1. Comparison of these proteins' in vitro binding sites versus their in vivo binding sites indicates that PBM-derived sequence specificities can accurately reflect in vivo DNA sequence specificities. In addition to previously identified targets, Abf1, Rap1, and Mig1 bound to 107, 90, and 75 putative new target intergenic regions, respectively, many of which were upstream of previously uncharacterized open reading frames (ORFs). Comparative sequence analysis indicates that many of these newly identified sites are highly conserved across five sequenced sensu stricto yeast species and thus are likely to be functional in vivo binding sites that potentially are utilized in a condition-specific manner. Similar PBM experiments will likely be useful in identifying novel cis regulatory elements and transcriptional regulatory networks in various genomes.

02:00 PM
02:45 PM
Joakim Lundeberg - Molecular Tools to Analyse and Elucidate Gene Function

Biomedical investigators have in the post-genome era been quick in their efforts to develop new powerful technologies such as DNA microarrays, high throughput genotyping and proteomics methods to decipher gene function at a global scale. But a number of bottlenecks still exist to be fully useful and this presentation adress some of these issues. Examples includes analysis of the rich abundance genetic variations in the human genome (such as single nucleotide polymorphisms,SNPs) in populations that makes them ideal genetic markers for identifying genetic factors associated with complex diseases etc but requires high throughput methodology to obtain statistical power in the analysis. Another current example is transcript profiling projects using DNA microarrays that facilitates monitoring of thousands of genes in parallel. These DNA chips are in routine use in many research projects but the data analysis is still debated. Furthermore the relatively poor annotation of the human proteome has hindered a full exploration of many of these high throughput methods but a recent effort at our department, the Human Proteome Resource program, aims to describe the localisation of human proteins and will therefore be an important tool in future efforts to analyse gene function in an integrated manner.

03:15 PM
04:00 PM
Michael Uhler - Microarray Transfection Analysis of Transcriptional Regulation

Although a wide variety of bioinformatic tools have been described to characterize potential transcriptional regulatory mechanisms based on genomic sequence analysis and microarray hybrization studies, these regulatory pathways are most often experimentally verified using transient transfection methods. Current transfection methods are largely limited by both the large scale of existing methods or by the low level of efficiency for certain cell types. Our goals were to develop a microarray-based transfection method that could be optimized for different cell types and would be usefully in reporter assays of transcriptional regulation. Here we describe a novel transfection method, termed STEP (Surface Transfection and Expression Protocol), which employs microarray-based DNA transfection of adherent cells in the functional analysis of transcriptional regulation. In STEP, recombinant proteins with biological activities engineered to enhance transfection are complexed with plasmid or linear expression vectors prior to spotting on microscope slides. The recombinant proteins of the STEP complex can be varied to increase the efficiency for different cell types. We demonstrate that STEP efficiently transfects both supercoiled plasmids as well as PCR-generated linear expression cassettes. A co-transfection assay using effector expression vectors encoding the cAMP-dependent protein kinase (PKA) as well as reporter vectors containing PKA-regulated promoters demonstrates that STEP transfection allows detection and quantitation of transcriptional regulation by this protein kinase. Since bioinformatic studies often result in the identification of many putative regulatory elements and signaling pathways, this approach should be of utility in high-throughput functional genomic studies of transcriptional regulation.

Thursday, February 24, 2005
Time Session
09:00 AM
09:45 AM
Keith Baggerly - Proteomic Profiling: Experimental Design and Processing Spectra

Over the past few years, several studies have used mass spectrometry based proteomic profiling to identify differences between diseased and healthy samples. These apparent successes have been heralded as providing gains for early diagnosis and/or monitoring. However, there have been problems with reproducibility, associated with issues of both experimental design and preprocessing.


In this talk, we provide a brief introduction to the most widely used "high-throughput" types of mass spectrometry, MALDI and SELDI. We discuss some of the issues with experimental design, highlighting one study where things went wrong and another where things went right. We then discuss how reproducibility can be enhanced by processing the spectra: eg, using wavelets to denoise and correct for baseline, and using averaging to better localize peaks. Finally, time permitting, we try to test our processing algorithms by simulating spectra, letting us check our answers in cases when "truth" can be known.

10:15 AM
11:00 AM
Paul Spellman - A Cell Line System for Understanding Breast Cancer

We have developed a systematic approach to understanding both the biological causes of breast cancer as well as mechanisms of therapeutic effects by applying genome scale technologies to a panel of cell lines.


We have shown that the panel of approximately 50 cell lines captures most, if not all of the variability found in tumors at the level of the DNA structure and copy number, RNA expression, and protein expression. Key technologies that we are employing are reverse phase protein lysate arrays for measuring absolute protein abundances, and a set of technologies centered around the recently introduced Affymetrix HTA system which gives us enormous throughput for expression analysis, DNA copy analysis, and SNP genotyping at substantially reduced costs.

11:30 AM
12:15 PM
Steve Horvath - Statistical Methods for the Analysis of Tissue Microarray Data

Tissue microarrays (TMAs) are a new high-throughput tool for the study of protein expression patterns in tissues and are increasingly used to evaluate the diagnostic, prognostic importance of tumor biomarkers. TMA data are rather challenging. Covariates are either ordinal variables or highly skewed percentages. Since it is standard practice in the TMA community to use cut-off values for tumor marker expression values, it is natural to apply tree-based methods. We describe different supervised and unsupervised learning methods based on survival trees and random forests (Breiman 2001). We describe a novel strategy (random forest clustering) for tumor profiling based on tissue microarray data. Random forest clustering is attractive for tissue microarray and other immunohistochemistry data since it handles highly skewed tumor marker expressions well and weighs the contribution of each marker according to its relatedness with other tumor markers. The real data application is the first tumor class discovery analysis of renal cell carcinoma patients based on protein expression profiles.

Name Affiliation
Bader, Joel joel.bader@jhu.edu Department of Biomedical Engineering, Johns Hopkins University
Baggerly, Keith kabagg@mdanderson.org Department of Bioinformatics & Biostatistics, University of Texas M. D. Anderson Cancer Center
Best, Janet jbest@mbi.osu.edu Mathematics, The Ohio State University
Borisyuk, Alla borisyuk@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Brettschneider, Julia juliab@stat.berkeley.edu Department of Statistics & Biostatistics, University of California, Berkeley
Buechler, Steven buechler.1@nd.edu Department of Mathematics, University of Notre Dame
Bulyk, Martha mlbulyk@receptor.med.harvard.edu Department of Genetics, Harvard Medical School
Chen, Liang liang.chen@yale.edu Molecular, Cellular, & Developmental Biology, Yale University
Chen, Shu-Chuan Grace scchen@math.la.asu.edu Department of Mathematics and Statistics, Arizona State University
Costello, Catherine cecmsms@bu.edu Department of Biochemistry & Biophysics, Boston University
Cracium, Gheorghe craciun@math.wisc.edu Dept. of Mathematics, University of Wisconsin-Madison
Davuluri, Ramana ramana.davuluri@osumc.edu Molecular Virology, Immunology, & Medical Genetics, The Ohio State University
Dougherty, Daniel dpdoughe@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Ge, Yongchao Yongchao.Ge@mssm.edu Department of Biomathematical Sciences, Mount Sinai School of Medicine, CUNY
Goel, Pranay goelpra@helix.nih.gov NIDDK, Indian Institute of Science Education and Research
Goldstein, Darlene Darlene.Goldstein@epfl.ch Institute of Mathematics, 'Ecole Polytechnique F'ed'erale de Lausanne (EPFL)
Guo, Yixin yixin@math.drexel.edu Department of Mathematics, The Ohio State University
Haddadian, Esmael haddadian.1@osu.edu Department of Biophysics, The Ohio State University
Hardin, Johanna jo.hardin@pomona.edu Department of Mathematics, Pomona College
Horvath, Steve SHorvath@mednet.ucla.edu Department of Biostatistics, University of California, San Diego
Hsu, Jason hsu.1@osu.edu Department of Statistics, The Ohio State University
Hu, Bei Department of Mathematics, University of Notre Dame
Hubbell, Earl Earl_Hubbell@affymetrix.com Department of Statistics, Affymetrix, Inc.
Huber, Wolfgang huber@ebi.ac.uk Wellcome Trust Genome Campus, European Bioinformatics Institute
Huebner, Marianne Huebner.Marianne@mayo.edu Department of Statistics and Probability, Michigan State University
Hyland, Fiona hylandfc@appliedbiosystems.com Department of Computational Genetics, Applied Biosystems
Javaid, Sarah javaid.5@osu.edu Department of Biophysics, The Ohio State University
Jin, Victor jin-2@medctr.osu.edu Biomedical Informatics, The Ohio State University
Jornsten, Rebecka rebecka@stat.rutgers.edu Department of Statistics, Rutgers University
Joseph Souriraj, Irene joseph-souriraj.1@osu.edu Biophysics Program, The Ohio State University
Keating, Amy keating@mit.edu Department of Biology, Massachusetts Institute of Technology
Lim, Sookkyung limsk@math.uc.edu Department of Mathematical Sciences, University of Cincinnati
Lin, Shili lin.328@osu.edu Department of Statistics, The Ohio State University
Liu, Xiaole (Shirley) xsliu@jimmy.harvard.edu Department of Biostatistics, Harvard University
Liyanarachchi, Sandya sandya.liyanarachchi@osumc.edu Molecular Virology, Immunology, & Medical Genetics, The Ohio State University
Lundeberg, Joakim joakim.lundeberg@biotech.kth.se Department of Biotechnology, Royal Institute of Technology (KTH)
McGehee, Richard mcgehee@math.umn.edu Department of Mathematics, University of Minnesota
Melfi, Vincent melfi@mbi.osu.edu Mathematics, Michigan State University
Mukherjee, Mitali mukherjee.21@osu.edu Department of Medicinal Chemistry, The Ohio State University
Nagaraja, Haikady Department of Statistics, The Ohio State University
Nettleton, Dan dnett@iastate.edu Department of Statistics, Iowa State University
Pohar, Twyla pohar-2@medctr.osu.edu Human Cancer Genetics Program - CCC, The Ohio State University
Pol, Diego dpol@mbi.osu.edu Independent Researcher, Museo Paleontologico E. Feruglio
Ramamoorthi, R. ramamoor@stt.msu.edu Department of Statistics and Probability, Michigan State University
Rassoul-Agha, Firas firas@math.ohio-state.edu Department of Mathematics, University of Utah
Rejniak, Katarzyna rejniak@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Rzetsky, Andre ar345@columbia.edu Medical Bioinformatics Unit, Columbia University
Saxena, Uma uma.saxena@kcl.ac.uk MRC Center for Developmental Neurobiology, Kings College London
Schadt, Eric kristina_brysz@merck.com Research Genetics, Rosetta Inpharmatics
Singh, Mona msingh@cs.princeton.edu Computer Science & Integrative Genomics, Princeton University
Speed, Terence terry@stat.Berkeley.edu Genetics and Bioinformatics, University of California, Berkeley
Spellman, Paul PTSpellman@lbl.gov Life Sciences Division, Lawrence Berkeley National Laboratory
Stubna, Michael stubna@mbi.osu.edu Engineering Team Leader, Pulsar Informatics
Sun, Ning ns79@email.med.yale.edu Epidemiology and Biostatistics, Yale University
Sun, Hao sun.143@osu.edu Human Cancer Genetics Program - CCC, The Ohio State University
Terman, David terman@math.ohio-state.edu Mathemathics Department, The Ohio State University
Therneau, Terry therneau.terry@mayo.edu Division of Biostatistics, Mayo Clinic
Tian, Jianjun Paul tianjj@mbi.osu.edu Mathematics, College of William and Mary
Uhler, Michael muhler@umich.edu Department of Biological Chemistry, University of Michigan
Vakalis, Ignatios ivakalis@capital.edu Mathematics & Computer Sc, Capital University
Verducci, Joseph verducci.1@osu.edu Department of Statistics, The Ohio State University
Wang, Chao wachao@cse.ohio-state.edu Computer Science and Engineering, The Ohio State University
Wang, Zailong zlwang@mbi.osu.edu Integrated Information Sciences, Novartis
Wechselberger, Martin wm@mbi.osu.edu Mathematical Biosciences Insitute, The Ohio State University
Wei, Susan wei-2@medctr.osu.edu Human Cancer Genetics, The Ohio State University
Wright, Geraldine wright.572@osu.edu School of Biology, Newcastle University
Wu, Zhijun zhijun@iastate.edu Math, Bioinformatics, & Computational Biology, Iowa State University
Zhao, Hongyu hz27@email.med.yale.edu Epidemiology and Public Health, Yale University
Zhou, Jin jzhou@mbi.osu.edu Department of Mathematics, Northern Michigan University
Proteomic Profiling: Experimental Design and Processing Spectra

Over the past few years, several studies have used mass spectrometry based proteomic profiling to identify differences between diseased and healthy samples. These apparent successes have been heralded as providing gains for early diagnosis and/or monitoring. However, there have been problems with reproducibility, associated with issues of both experimental design and preprocessing.


In this talk, we provide a brief introduction to the most widely used "high-throughput" types of mass spectrometry, MALDI and SELDI. We discuss some of the issues with experimental design, highlighting one study where things went wrong and another where things went right. We then discuss how reproducibility can be enhanced by processing the spectra: eg, using wavelets to denoise and correct for baseline, and using averaging to better localize peaks. Finally, time permitting, we try to test our processing algorithms by simulating spectra, letting us check our answers in cases when "truth" can be known.

Antibody Array Data Analysis and Comparison with Gene Expression

Antibody arrays are a fairly recent technology for high-throughput protein expression measurement. They typically screen a few hundred proteins simultanusly. We discuss image analysis and normalization for this platform. We further address the question of comparing protein measurements with RNA level measurements.


Our data comes from human brain tissue. Protein expression was screened by BD Biosciences antibody arrays, RNA levels were measured by Affymetrix chips.

Rapid Analysis of the DNA Binding Specificities of Transcription Factors with DNA Microarrays

We have recently developed a new DNA microarray-based technology, termed protein binding microarrays (PBMs), that allows rapid, high-throughput characterization of the in vitro DNA binding site sequence specificities of transcription factors in a single day. Using PBMs, we identified the DNA binding site sequence specificities of the yeast transcription factors Abf1, Rap1, and Mig1. Comparison of these proteins' in vitro binding sites versus their in vivo binding sites indicates that PBM-derived sequence specificities can accurately reflect in vivo DNA sequence specificities. In addition to previously identified targets, Abf1, Rap1, and Mig1 bound to 107, 90, and 75 putative new target intergenic regions, respectively, many of which were upstream of previously uncharacterized open reading frames (ORFs). Comparative sequence analysis indicates that many of these newly identified sites are highly conserved across five sequenced sensu stricto yeast species and thus are likely to be functional in vivo binding sites that potentially are utilized in a condition-specific manner. Similar PBM experiments will likely be useful in identifying novel cis regulatory elements and transcriptional regulatory networks in various genomes.

Identifying Estrogen Receptor a Target Genes Using Integrated Computational Genomics and Chromatin Immunoprecipitation Microarray

The key aspect in deciphering the complex puzzle of transcriptional regulatory networks is the identification of target genes of transcription factors (TFs). In general, TFs bind to specific sequence motifs present in the promoter regions of target genes, and participate in combinatorial interaction with TFs of other signaling networks (transcriptional modules). A recent technology called ChIP-on-chip, or chromatin immunoprecipitation followed by DNA microarray analysis, has proven to be an efficient means of mapping TF-promoter interactions. We will describe an integrative computational genomics approach to analyze the data generated from ChIP-on-chip experiments. The integrative approach involves TF binding site detection, comparative promoter analysis of orthologous genes (http://bioinformatics.med.ohio-state.edu/OMGProm; Palaniswamy et al. 2004, Bioinformatics) and the application of Classification And Regression Tree (CART) data-mining method.


The estrogen receptor (ER ) regulates gene expression by either direct binding to estrogen response elements (EREs) or indirect tethering to other TFs on promoter targets. In order to identify these promoter sequences, we conducted a genome-wide screening with ChIP-on-chip. A set of 70 candidate ER loci were identified and the corresponding promoter sequences were analyzed by statistical pattern recognition and comparative genomics approaches. We found mouse counterparts for 63 of these loci, and classified 42 (67%) as direct ER targets using CART statistical model, which involves position weight matrix, human-mouse sequence similarity scores and presence of other TF binding sites near ERE as model parameters. The remaining genes were considered to be indirect targets. To validate this computational prediction, we conducted an additional ChIP-on-chip assay that identified acetylated chromatin components in active ER promoters. Of 27 loci upregulated in an ER positive breast cancer cell line, 20 having mouse counterparts were correctly predicted by CART model. CART model identified four different modules of combinatorial control ER based on over representation of other TF binding sites near ERE in ER target promoters. One of the identified modules (ERE+AP1) is already known, and experimental validation is required for other predicted modules. Further details about the computational method and ER target gene database can be found at http://bioinformatics.med.ohio-state.edu/ERTargetDB/, and were described in Jin et al. 2004, Nucleic Acids Research, 32: 6627-6635 & Leu et al. 2004, Cancer Research, 64: 8184-8192.

Statistical Methods for the Analysis of Tissue Microarray Data

Tissue microarrays (TMAs) are a new high-throughput tool for the study of protein expression patterns in tissues and are increasingly used to evaluate the diagnostic, prognostic importance of tumor biomarkers. TMA data are rather challenging. Covariates are either ordinal variables or highly skewed percentages. Since it is standard practice in the TMA community to use cut-off values for tumor marker expression values, it is natural to apply tree-based methods. We describe different supervised and unsupervised learning methods based on survival trees and random forests (Breiman 2001). We describe a novel strategy (random forest clustering) for tumor profiling based on tissue microarray data. Random forest clustering is attractive for tissue microarray and other immunohistochemistry data since it handles highly skewed tumor marker expressions well and weighs the contribution of each marker according to its relatedness with other tumor markers. The real data application is the first tumor class discovery analysis of renal cell carcinoma patients based on protein expression profiles.

Automated Analysis on High Throughput Genotyping and Gene Expression Platforms, and Evaluating Haplotype Tagging SNPs Selected in One Human Population for Their Informativeness in Other Populations

High-throughput genomic technologies necessitate automated data processing and analysis through multiple stages: System design (e.g. annotation and probe design), primary analysis (e.g. image analysis, background subtraction, signal processing, outlier detection), and secondary analysis (experimental design, normalization, filtering, hypothesis testing, classification and prediction). A brief conceptual overview of data analysis on Applied Biosystems' genotyping and gene expression high throughput platforms will be used to tie together common analytical themes.


It is widely hoped that the study of sequence variation in the human genome will provide a means of elucidating the genetic component of complex disease. The existence of substantial linkage disequilibrium (LD) in the human genome suggests that it should be possible to select a subset of single-nucleotide polymorphisms (SNPs) that optimally retain the overall informativeness with respect to disease or trait association of the entire set. This in turn would reduce both the costs of exhaustive genotyping and ameliorate the challenging problem of statistical inference from grossly over determined models. To confirm disease associations discovered in one ethnic group, it is desirable to know whether a set of haplotype tagging SNPs (tSNPs) selected in one population will be informative in other populations.


Tagging SNPs (tSNP) were selected from about 20,000 SNPs closely spaced along the axis of 3 human chromosomes (chrs. 6, 21, & 22). A sample of 45 individuals from each of 4 human populations were genotyped to obtain information on the patterns of LD on each of the populations for the selection of tagging SNPs. We utilized haplotype R2 as a metric of information to select minimum informative subsets of tSNPs. The number of tSNPs needed to maintain a haplotype r2 above a critical threshold was computed separately for each population: more tSNPs were needed in African Americans than in Caucasians, Chinese or Japanese. The effect of SNP density was examined. When subsets of SNPs of various sizes were sampled, and the degree to which the subset tagged the 'hidden' SNPs was calculated, the hidden SNPs were more completely tagged in Caucasians and Asians than in African Americans. The average haplotype r2 of the Caucasian tSNPs in Caucasians, and of the Caucasian tSNPs in African Americans, Chinese and Japanese was computed, and vice versa. The average haplotype r2 of Caucasian tSNPs used in Caucasians was very close to the haplotype r2 of Asian tSNPs used in Caucasians, or African American tSNPs used in Caucasians; similarly, the average haplotype r2 of African American tSNPs used in African Americans was very close to the haplotype r2 of Asian tSNPs used in African American, or Caucasian tSNPs used in African American. These results indicate that tSNPs selected in one population will work reasonably well in other populations, at least with regard to common haplotypes. About 65% of tSNPs were found to be common across populations when selected without optimization for overlap.

Combinatorial Associations of the Human bZIP Transcription Factors: protein microarray measurements and computational predictions

Sequencing of the human genome has revealed approximately 55 human bZIP transcription factors that can form homo- or heterodimers to regulate a wide variety of biological processes. The information necessary for dimerization specificity is encoded in the coiled-coil or "leucine-zipper" domains of these proteins. We have used protein microarrays to carry out a comprehensive analysis of the intrinsic interaction specificities of the bZIPs. By paying particular attention to issues such as purity, valency and oxidation state, we have obtained high quality interaction data, as judged by reproducibility, symmetry and agreement with solution studies. Our measurements of over 1,400 unique pairwise combinations show that bZIP coiled-coil interactions are sparse and highly-selective in vitro. The resulting data are valuable for understanding combinatorial regulation of transcription by the bZIPs. The array technology is likely to be valuable for analyzing other protein domain and peptide interactions as well.


The bZIP microarray data provide an excellent foundation for computational studies of protein sequence and structural features that are important for interaction specificity. A support vector machine (SVM) developed by Mona Singh (Princeton University Computer Science) trained on coiled-coil data from the literature does an excellent job predicting bZIP interaction preferences. We are working to combine machine-learning methods such as the SVM with physical modeling of protein structure. An integration of diverse approaches promises better performance, as well as an improved understanding of the underlying determinants of protein-protein interaction specificity. The SVM method for predicting bZIP coiled-coil interactions is available for interactive use at http://compbio.cs.princeton.edu/bzip/.


Joint work with Mona Singh.

A Hidden Markov Model for Analyzing ChIP-chip Experiments on Genome Tiling Arrays and Its Application to p53 Binding Sequences

Motivation: Transcription factors (TFs) regulate gene expression by recognizing and binding to specific regulatory regions on the genome, which in higher eukaryotes can occur far away from the regulated genes. Recently Affymetrix developed the high-density oligonucleotide arrays that tile all the non-repetitive sequences of the human genome at 35-bp resolution. This new array platform allows for the unbiased mapping of in vivo TF binding sequences (TFBSs) using Chromatin ImmunoPrecipitation followed by microarray experiments (ChIP-chip). The massive data generated from these experiments pose great challenges for data analysis.


Results: We developed a fast, scalable and sensitive method to extract TFBSs from ChIP-chip experiments on genome tiling arrays. Our method takes advantage of tiling array data from many experiments to normalize and model the behavior of each individual probe, and identifies TFBSs using a Hidden Markov Model (HMM). When applied to the data of p53 ChIP-chip experiments (Cawley et al., 2004), our method discovered many new high confidence p53 targets including all the regions verified by quantitative PCR . Using a de novo motif finding algorithm MDscan (Liu et al., 2002), we also recovered the p53 motif from our HMM identified p53 target regions. Furthermore, we found substantial p53 motif enrichment in these regions comparing with both genomic background and the TFBSs identified by Cawley et al (2004). Several of the newly identified p53 TFBSs are in known genes' promoter regions or associated with previous characterized p53-responsive genes.

Molecular Tools to Analyse and Elucidate Gene Function

Biomedical investigators have in the post-genome era been quick in their efforts to develop new powerful technologies such as DNA microarrays, high throughput genotyping and proteomics methods to decipher gene function at a global scale. But a number of bottlenecks still exist to be fully useful and this presentation adress some of these issues. Examples includes analysis of the rich abundance genetic variations in the human genome (such as single nucleotide polymorphisms,SNPs) in populations that makes them ideal genetic markers for identifying genetic factors associated with complex diseases etc but requires high throughput methodology to obtain statistical power in the analysis. Another current example is transcript profiling projects using DNA microarrays that facilitates monitoring of thousands of genes in parallel. These DNA chips are in routine use in many research projects but the data analysis is still debated. Furthermore the relatively poor annotation of the human proteome has hindered a full exploration of many of these high throughput methods but a recent effort at our department, the Human Proteome Resource program, aims to describe the localisation of human proteins and will therefore be an important tool in future efforts to analyse gene function in an integrated manner.

Analysis of Heterogeneous/Noisy Molecular Interaction Data

I will give an overview of our effort to automatically extract pathway information from a large number of full-text research articles (GeneWays system), automatically curate the extracted information, and to combine the literature-derived information with sequence and experimental (such as yeast two-hybrid) data using a probabilistic approach.

Forward Genetics in Reverse: Integrating genotypic and expression data in a segregating mouse population to map a novel susceptibility locus with pleiotropic effects on obesity, bone density, and cholesterol traits

Forward genetics approaches to identify genes for complex traits such as common human diseases have met with limited success. Fine mapping of linkage regions associated with complex traits and validation of positional candidates are time consuming and often hit-or-miss. Here we detail a hybrid procedure to map loci for complex traits that leverages off of the strengths of forward and reverse genetics approaches. By intersecting genotypic and expression data in a segregating mouse population, we demonstrate how clusters of expression quantitative trait loci (eQTL) linking to regions of the genome controlling for complex traits, accurately reflect the underlying perturbation to the transcriptional network induced by DNA variations in genes that control for the complex traits. By matching patterns of gene expression in a segregating population with gene expression responses induced by single gene perturbation experiments, we demonstrate how genes controlling for clusters of expression and clinical QTL (QTL "hot spot" regions) can be directly mapped. The utility of this approach is demonstrated by mapping a novel susceptibility locus for a previously identified QTL in an F2 cross between strains C57BL/6J (B6) and DBA/2J (DBA), with pleiotropic effects on body fat, lipid levels, and bone density. Our results demonstrate that integrating microarray analysis with genetic and clinical data in segregating populations is a powerful approach for directly identifying genes underlying QTLs.

A Cell Line System for Understanding Breast Cancer

We have developed a systematic approach to understanding both the biological causes of breast cancer as well as mechanisms of therapeutic effects by applying genome scale technologies to a panel of cell lines.


We have shown that the panel of approximately 50 cell lines captures most, if not all of the variability found in tumors at the level of the DNA structure and copy number, RNA expression, and protein expression. Key technologies that we are employing are reverse phase protein lysate arrays for measuring absolute protein abundances, and a set of technologies centered around the recently introduced Affymetrix HTA system which gives us enormous throughput for expression analysis, DNA copy analysis, and SNP genotyping at substantially reduced costs.

A Measurement Error Model for Inferring Transcriptional Regulatory Networks

Transcriptional Regulation Networks consist of high level of complexity including a large number of genes and gene products and their associations, the unknown protein-DNA interaction mechanisms, the transient activation of the networks by signal transduction pathways. Recent advances in genomic studies have provided many types of data related to transcription regulation, such as large-scale RNA expression measurements, in vivo DNA-protein binding data, protein-protein interactions and genome sequences. However, each type of data only contains partial information on transcriptional regulation, and often is accompanied with large measurement errors. In this study, our goal is to develop a statistical framework to explicitly separate the mechanism model from the measurement error model, to allow a flexible framework to integrate various data types, and to assist the learning of mechanisms from data. Here, we will present a measurement error model to integrate RNA expression data and protein-DNA binding data. In this model, a linear system model was assumed to describe gene expression as the response of regulation from a set of proteins (transcription factors). Our simulation results showed that this data integration model may reduce the measurement errors in protein-DNA interaction data. We also applied the method to Yeast cell cycle data, and our results will be discussed in this talk.

Microarray Transfection Analysis of Transcriptional Regulation

Although a wide variety of bioinformatic tools have been described to characterize potential transcriptional regulatory mechanisms based on genomic sequence analysis and microarray hybrization studies, these regulatory pathways are most often experimentally verified using transient transfection methods. Current transfection methods are largely limited by both the large scale of existing methods or by the low level of efficiency for certain cell types. Our goals were to develop a microarray-based transfection method that could be optimized for different cell types and would be usefully in reporter assays of transcriptional regulation. Here we describe a novel transfection method, termed STEP (Surface Transfection and Expression Protocol), which employs microarray-based DNA transfection of adherent cells in the functional analysis of transcriptional regulation. In STEP, recombinant proteins with biological activities engineered to enhance transfection are complexed with plasmid or linear expression vectors prior to spotting on microscope slides. The recombinant proteins of the STEP complex can be varied to increase the efficiency for different cell types. We demonstrate that STEP efficiently transfects both supercoiled plasmids as well as PCR-generated linear expression cassettes. A co-transfection assay using effector expression vectors encoding the cAMP-dependent protein kinase (PKA) as well as reporter vectors containing PKA-regulated promoters demonstrates that STEP transfection allows detection and quantitation of transcriptional regulation by this protein kinase. Since bioinformatic studies often result in the identification of many putative regulatory elements and signaling pathways, this approach should be of utility in high-throughput functional genomic studies of transcriptional regulation.