Over the past few years, several studies have used mass spectrometry based proteomic profiling to identify differences between diseased and healthy samples. These apparent successes have been heralded as providing gains for early diagnosis and/or monitoring. However, there have been problems with reproducibility, associated with issues of both experimental design and preprocessing.
In this talk, we provide a brief introduction to the most widely used "high-throughput" types of mass spectrometry, MALDI and SELDI. We discuss some of the issues with experimental design, highlighting one study where things went wrong and another where things went right. We then discuss how reproducibility can be enhanced by processing the spectra: eg, using wavelets to denoise and correct for baseline, and using averaging to better localize peaks. Finally, time permitting, we try to test our processing algorithms by simulating spectra, letting us check our answers in cases when "truth" can be known.
Antibody arrays are a fairly recent technology for high-throughput protein expression measurement. They typically screen a few hundred proteins simultanusly. We discuss image analysis and normalization for this platform. We further address the question of comparing protein measurements with RNA level measurements.
Our data comes from human brain tissue. Protein expression was screened by BD Biosciences antibody arrays, RNA levels were measured by Affymetrix chips.
We have recently developed a new DNA microarray-based technology, termed protein binding microarrays (PBMs), that allows rapid, high-throughput characterization of the in vitro DNA binding site sequence specificities of transcription factors in a single day. Using PBMs, we identified the DNA binding site sequence specificities of the yeast transcription factors Abf1, Rap1, and Mig1. Comparison of these proteins' in vitro binding sites versus their in vivo binding sites indicates that PBM-derived sequence specificities can accurately reflect in vivo DNA sequence specificities. In addition to previously identified targets, Abf1, Rap1, and Mig1 bound to 107, 90, and 75 putative new target intergenic regions, respectively, many of which were upstream of previously uncharacterized open reading frames (ORFs). Comparative sequence analysis indicates that many of these newly identified sites are highly conserved across five sequenced sensu stricto yeast species and thus are likely to be functional in vivo binding sites that potentially are utilized in a condition-specific manner. Similar PBM experiments will likely be useful in identifying novel cis regulatory elements and transcriptional regulatory networks in various genomes.
The key aspect in deciphering the complex puzzle of transcriptional regulatory networks is the identification of target genes of transcription factors (TFs). In general, TFs bind to specific sequence motifs present in the promoter regions of target genes, and participate in combinatorial interaction with TFs of other signaling networks (transcriptional modules). A recent technology called ChIP-on-chip, or chromatin immunoprecipitation followed by DNA microarray analysis, has proven to be an efficient means of mapping TF-promoter interactions. We will describe an integrative computational genomics approach to analyze the data generated from ChIP-on-chip experiments. The integrative approach involves TF binding site detection, comparative promoter analysis of orthologous genes (http://bioinformatics.med.ohio-state.edu/OMGProm; Palaniswamy et al. 2004, Bioinformatics) and the application of Classification And Regression Tree (CART) data-mining method.
The estrogen receptor (ER ) regulates gene expression by either direct binding to estrogen response elements (EREs) or indirect tethering to other TFs on promoter targets. In order to identify these promoter sequences, we conducted a genome-wide screening with ChIP-on-chip. A set of 70 candidate ER loci were identified and the corresponding promoter sequences were analyzed by statistical pattern recognition and comparative genomics approaches. We found mouse counterparts for 63 of these loci, and classified 42 (67%) as direct ER targets using CART statistical model, which involves position weight matrix, human-mouse sequence similarity scores and presence of other TF binding sites near ERE as model parameters. The remaining genes were considered to be indirect targets. To validate this computational prediction, we conducted an additional ChIP-on-chip assay that identified acetylated chromatin components in active ER promoters. Of 27 loci upregulated in an ER positive breast cancer cell line, 20 having mouse counterparts were correctly predicted by CART model. CART model identified four different modules of combinatorial control ER based on over representation of other TF binding sites near ERE in ER target promoters. One of the identified modules (ERE+AP1) is already known, and experimental validation is required for other predicted modules. Further details about the computational method and ER target gene database can be found at http://bioinformatics.med.ohio-state.edu/ERTargetDB/, and were described in Jin et al. 2004, Nucleic Acids Research, 32: 6627-6635 & Leu et al. 2004, Cancer Research, 64: 8184-8192.
Tissue microarrays (TMAs) are a new high-throughput tool for the study of protein expression patterns in tissues and are increasingly used to evaluate the diagnostic, prognostic importance of tumor biomarkers. TMA data are rather challenging. Covariates are either ordinal variables or highly skewed percentages. Since it is standard practice in the TMA community to use cut-off values for tumor marker expression values, it is natural to apply tree-based methods. We describe different supervised and unsupervised learning methods based on survival trees and random forests (Breiman 2001). We describe a novel strategy (random forest clustering) for tumor profiling based on tissue microarray data. Random forest clustering is attractive for tissue microarray and other immunohistochemistry data since it handles highly skewed tumor marker expressions well and weighs the contribution of each marker according to its relatedness with other tumor markers. The real data application is the first tumor class discovery analysis of renal cell carcinoma patients based on protein expression profiles.
High-throughput genomic technologies necessitate automated data processing and analysis through multiple stages: System design (e.g. annotation and probe design), primary analysis (e.g. image analysis, background subtraction, signal processing, outlier detection), and secondary analysis (experimental design, normalization, filtering, hypothesis testing, classification and prediction). A brief conceptual overview of data analysis on Applied Biosystems' genotyping and gene expression high throughput platforms will be used to tie together common analytical themes.
It is widely hoped that the study of sequence variation in the human genome will provide a means of elucidating the genetic component of complex disease. The existence of substantial linkage disequilibrium (LD) in the human genome suggests that it should be possible to select a subset of single-nucleotide polymorphisms (SNPs) that optimally retain the overall informativeness with respect to disease or trait association of the entire set. This in turn would reduce both the costs of exhaustive genotyping and ameliorate the challenging problem of statistical inference from grossly over determined models. To confirm disease associations discovered in one ethnic group, it is desirable to know whether a set of haplotype tagging SNPs (tSNPs) selected in one population will be informative in other populations.
Tagging SNPs (tSNP) were selected from about 20,000 SNPs closely spaced along the axis of 3 human chromosomes (chrs. 6, 21, & 22). A sample of 45 individuals from each of 4 human populations were genotyped to obtain information on the patterns of LD on each of the populations for the selection of tagging SNPs. We utilized haplotype R2 as a metric of information to select minimum informative subsets of tSNPs. The number of tSNPs needed to maintain a haplotype r2 above a critical threshold was computed separately for each population: more tSNPs were needed in African Americans than in Caucasians, Chinese or Japanese. The effect of SNP density was examined. When subsets of SNPs of various sizes were sampled, and the degree to which the subset tagged the 'hidden' SNPs was calculated, the hidden SNPs were more completely tagged in Caucasians and Asians than in African Americans. The average haplotype r2 of the Caucasian tSNPs in Caucasians, and of the Caucasian tSNPs in African Americans, Chinese and Japanese was computed, and vice versa. The average haplotype r2 of Caucasian tSNPs used in Caucasians was very close to the haplotype r2 of Asian tSNPs used in Caucasians, or African American tSNPs used in Caucasians; similarly, the average haplotype r2 of African American tSNPs used in African Americans was very close to the haplotype r2 of Asian tSNPs used in African American, or Caucasian tSNPs used in African American. These results indicate that tSNPs selected in one population will work reasonably well in other populations, at least with regard to common haplotypes. About 65% of tSNPs were found to be common across populations when selected without optimization for overlap.
Sequencing of the human genome has revealed approximately 55 human bZIP transcription factors that can form homo- or heterodimers to regulate a wide variety of biological processes. The information necessary for dimerization specificity is encoded in the coiled-coil or "leucine-zipper" domains of these proteins. We have used protein microarrays to carry out a comprehensive analysis of the intrinsic interaction specificities of the bZIPs. By paying particular attention to issues such as purity, valency and oxidation state, we have obtained high quality interaction data, as judged by reproducibility, symmetry and agreement with solution studies. Our measurements of over 1,400 unique pairwise combinations show that bZIP coiled-coil interactions are sparse and highly-selective in vitro. The resulting data are valuable for understanding combinatorial regulation of transcription by the bZIPs. The array technology is likely to be valuable for analyzing other protein domain and peptide interactions as well.
The bZIP microarray data provide an excellent foundation for computational studies of protein sequence and structural features that are important for interaction specificity. A support vector machine (SVM) developed by Mona Singh (Princeton University Computer Science) trained on coiled-coil data from the literature does an excellent job predicting bZIP interaction preferences. We are working to combine machine-learning methods such as the SVM with physical modeling of protein structure. An integration of diverse approaches promises better performance, as well as an improved understanding of the underlying determinants of protein-protein interaction specificity. The SVM method for predicting bZIP coiled-coil interactions is available for interactive use at http://compbio.cs.princeton.edu/bzip/.
Joint work with Mona Singh.
Motivation: Transcription factors (TFs) regulate gene expression by recognizing and binding to specific regulatory regions on the genome, which in higher eukaryotes can occur far away from the regulated genes. Recently Affymetrix developed the high-density oligonucleotide arrays that tile all the non-repetitive sequences of the human genome at 35-bp resolution. This new array platform allows for the unbiased mapping of in vivo TF binding sequences (TFBSs) using Chromatin ImmunoPrecipitation followed by microarray experiments (ChIP-chip). The massive data generated from these experiments pose great challenges for data analysis.
Results: We developed a fast, scalable and sensitive method to extract TFBSs from ChIP-chip experiments on genome tiling arrays. Our method takes advantage of tiling array data from many experiments to normalize and model the behavior of each individual probe, and identifies TFBSs using a Hidden Markov Model (HMM). When applied to the data of p53 ChIP-chip experiments (Cawley et al., 2004), our method discovered many new high confidence p53 targets including all the regions verified by quantitative PCR . Using a de novo motif finding algorithm MDscan (Liu et al., 2002), we also recovered the p53 motif from our HMM identified p53 target regions. Furthermore, we found substantial p53 motif enrichment in these regions comparing with both genomic background and the TFBSs identified by Cawley et al (2004). Several of the newly identified p53 TFBSs are in known genes' promoter regions or associated with previous characterized p53-responsive genes.
Biomedical investigators have in the post-genome era been quick in their efforts to develop new powerful technologies such as DNA microarrays, high throughput genotyping and proteomics methods to decipher gene function at a global scale. But a number of bottlenecks still exist to be fully useful and this presentation adress some of these issues. Examples includes analysis of the rich abundance genetic variations in the human genome (such as single nucleotide polymorphisms,SNPs) in populations that makes them ideal genetic markers for identifying genetic factors associated with complex diseases etc but requires high throughput methodology to obtain statistical power in the analysis. Another current example is transcript profiling projects using DNA microarrays that facilitates monitoring of thousands of genes in parallel. These DNA chips are in routine use in many research projects but the data analysis is still debated. Furthermore the relatively poor annotation of the human proteome has hindered a full exploration of many of these high throughput methods but a recent effort at our department, the Human Proteome Resource program, aims to describe the localisation of human proteins and will therefore be an important tool in future efforts to analyse gene function in an integrated manner.
I will give an overview of our effort to automatically extract pathway information from a large number of full-text research articles (GeneWays system), automatically curate the extracted information, and to combine the literature-derived information with sequence and experimental (such as yeast two-hybrid) data using a probabilistic approach.
Forward genetics approaches to identify genes for complex traits such as common human diseases have met with limited success. Fine mapping of linkage regions associated with complex traits and validation of positional candidates are time consuming and often hit-or-miss. Here we detail a hybrid procedure to map loci for complex traits that leverages off of the strengths of forward and reverse genetics approaches. By intersecting genotypic and expression data in a segregating mouse population, we demonstrate how clusters of expression quantitative trait loci (eQTL) linking to regions of the genome controlling for complex traits, accurately reflect the underlying perturbation to the transcriptional network induced by DNA variations in genes that control for the complex traits. By matching patterns of gene expression in a segregating population with gene expression responses induced by single gene perturbation experiments, we demonstrate how genes controlling for clusters of expression and clinical QTL (QTL "hot spot" regions) can be directly mapped. The utility of this approach is demonstrated by mapping a novel susceptibility locus for a previously identified QTL in an F2 cross between strains C57BL/6J (B6) and DBA/2J (DBA), with pleiotropic effects on body fat, lipid levels, and bone density. Our results demonstrate that integrating microarray analysis with genetic and clinical data in segregating populations is a powerful approach for directly identifying genes underlying QTLs.
We have developed a systematic approach to understanding both the biological causes of breast cancer as well as mechanisms of therapeutic effects by applying genome scale technologies to a panel of cell lines.
We have shown that the panel of approximately 50 cell lines captures most, if not all of the variability found in tumors at the level of the DNA structure and copy number, RNA expression, and protein expression. Key technologies that we are employing are reverse phase protein lysate arrays for measuring absolute protein abundances, and a set of technologies centered around the recently introduced Affymetrix HTA system which gives us enormous throughput for expression analysis, DNA copy analysis, and SNP genotyping at substantially reduced costs.
Transcriptional Regulation Networks consist of high level of complexity including a large number of genes and gene products and their associations, the unknown protein-DNA interaction mechanisms, the transient activation of the networks by signal transduction pathways. Recent advances in genomic studies have provided many types of data related to transcription regulation, such as large-scale RNA expression measurements, in vivo DNA-protein binding data, protein-protein interactions and genome sequences. However, each type of data only contains partial information on transcriptional regulation, and often is accompanied with large measurement errors. In this study, our goal is to develop a statistical framework to explicitly separate the mechanism model from the measurement error model, to allow a flexible framework to integrate various data types, and to assist the learning of mechanisms from data. Here, we will present a measurement error model to integrate RNA expression data and protein-DNA binding data. In this model, a linear system model was assumed to describe gene expression as the response of regulation from a set of proteins (transcription factors). Our simulation results showed that this data integration model may reduce the measurement errors in protein-DNA interaction data. We also applied the method to Yeast cell cycle data, and our results will be discussed in this talk.
Although a wide variety of bioinformatic tools have been described to characterize potential transcriptional regulatory mechanisms based on genomic sequence analysis and microarray hybrization studies, these regulatory pathways are most often experimentally verified using transient transfection methods. Current transfection methods are largely limited by both the large scale of existing methods or by the low level of efficiency for certain cell types. Our goals were to develop a microarray-based transfection method that could be optimized for different cell types and would be usefully in reporter assays of transcriptional regulation. Here we describe a novel transfection method, termed STEP (Surface Transfection and Expression Protocol), which employs microarray-based DNA transfection of adherent cells in the functional analysis of transcriptional regulation. In STEP, recombinant proteins with biological activities engineered to enhance transfection are complexed with plasmid or linear expression vectors prior to spotting on microscope slides. The recombinant proteins of the STEP complex can be varied to increase the efficiency for different cell types. We demonstrate that STEP efficiently transfects both supercoiled plasmids as well as PCR-generated linear expression cassettes. A co-transfection assay using effector expression vectors encoding the cAMP-dependent protein kinase (PKA) as well as reporter vectors containing PKA-regulated promoters demonstrates that STEP transfection allows detection and quantitation of transcriptional regulation by this protein kinase. Since bioinformatic studies often result in the identification of many putative regulatory elements and signaling pathways, this approach should be of utility in high-throughput functional genomic studies of transcriptional regulation.