|
Workshop 4 Abstracts and Lecture Materials:
Author: Keith Baggerly, MD Anderson Cancer Center
Title: Proteomic Profiling: Experimental Design and Processing Spectra
Over the past few years, several studies have used mass spectrometry
based proteomic profiling to identify differences between diseased
and healthy samples. These apparent successes have been heralded
as providing gains for early diagnosis and/or monitoring. However,
there have been problems with reproducibility, associated with issues
of both experimental design and preprocessing.
In this talk, we provide a brief introduction to the most widely
used "high-throughput" types of mass spectrometry, MALDI and SELDI.
We discuss some of the issues with experimental design, highlighting
one study where things went wrong and another where things went
right. We then discuss how reproducibility can be enhanced by processing
the spectra: eg, using wavelets to denoise and correct for baseline,
and using averaging to better localize peaks. Finally, time permitting,
we try to test our processing algorithms by simulating spectra,
letting us check our answers in cases when "truth" can be known.
Author: Julia Brettschneider, Statistics and Biostatistics, University
of California
Title: Antibody array data analysis and comparison with gene expression
Streaming Video: Real
Media
Antibody arrays are a fairly recent technology for high-throughput
protein expression measurement. They typically screen a few hundred
proteins simultanusly. We discuss image analysis and normalization
for this platform. We further address the question of comparing
protein measurements with RNA level measurements.
Our data comes from human brain tissue. Protein expression was
screened by BD Biosciences antibody arrays, RNA levels were measured
by Affymetrix chips.
Author: Martha L. Bulyk, Medicine, Pathology, and Health Sciences
& Technology, Brigham & Women's Hospital and Harvard Medical
School
Title: Rapid Analysis of the DNA Binding Specificities of Transcription
Factors with DNA Microarrays
We have recently developed a new DNA microarray-based technology,
termed protein binding microarrays (PBMs), that allows rapid, high-throughput
characterization of the in vitro DNA binding site sequence specificities
of transcription factors in a single day. Using PBMs, we identified
the DNA binding site sequence specificities of the yeast transcription
factors Abf1, Rap1, and Mig1. Comparison of these proteins' in vitro
binding sites versus their in vivo binding sites indicates that
PBM-derived sequence specificities can accurately reflect in vivo
DNA sequence specificities. In addition to previously identified
targets, Abf1, Rap1, and Mig1 bound to 107, 90, and 75 putative
new target intergenic regions, respectively, many of which were
upstream of previously uncharacterized open reading frames (ORFs).
Comparative sequence analysis indicates that many of these newly
identified sites are highly conserved across five sequenced sensu
stricto yeast species and thus are likely to be functional in vivo
binding sites that potentially are utilized in a condition-specific
manner. Similar PBM experiments will likely be useful in identifying
novel cis regulatory elements and transcriptional regulatory networks
in various genomes.
Author: Ramana V. Davuluri, Human Cancer Genetics Program, Comprehensive
Cancer Center, Dept. of Molecular Virology, Immunology & Medical
Genetics, Ohio State University
Title: Identifying Estrogen Receptor a Target Genes Using Integrated
Computational Genomics and Chromatin Immunoprecipitation Microarray
Presentation materials: PDF
Streaming Video: Real
Media
The key aspect in deciphering the complex puzzle of transcriptional
regulatory networks is the identification of target genes of transcription
factors (TFs). In general, TFs bind to specific sequence motifs
present in the promoter regions of target genes, and participate
in combinatorial interaction with TFs of other signaling networks
(transcriptional modules). A recent technology called ChIP-on-chip,
or chromatin immunoprecipitation followed by DNA microarray analysis,
has proven to be an efficient means of mapping TF-promoter interactions.
We will describe an integrative computational genomics approach
to analyze the data generated from ChIP-on-chip experiments. The
integrative approach involves TF binding site detection, comparative
promoter analysis of orthologous genes (http://bioinformatics.med.ohio-state.edu/OMGProm;
Palaniswamy et al. 2004, Bioinformatics) and the application of
Classification And Regression Tree (CART) data-mining method.
The estrogen receptor (ER ) regulates gene expression by either
direct binding to estrogen response elements (EREs) or indirect
tethering to other TFs on promoter targets. In order to identify
these promoter sequences, we conducted a genome-wide screening with
ChIP-on-chip. A set of 70 candidate ER loci were identified and
the corresponding promoter sequences were analyzed by statistical
pattern recognition and comparative genomics approaches. We found
mouse counterparts for 63 of these loci, and classified 42 (67%)
as direct ER targets using CART statistical model, which involves
position weight matrix, human-mouse sequence similarity scores and
presence of other TF binding sites near ERE as model parameters.
The remaining genes were considered to be indirect targets. To validate
this computational prediction, we conducted an additional ChIP-on-chip
assay that identified acetylated chromatin components in active
ER promoters. Of 27 loci upregulated in an ER positive breast cancer
cell line, 20 having mouse counterparts were correctly predicted
by CART model. CART model identified four different modules of combinatorial
control ER based on over representation of other TF binding sites
near ERE in ER target promoters. One of the identified modules (ERE+AP1)
is already known, and experimental validation is required for other
predicted modules. Further details about the computational method
and ER target gene database can be found at http://bioinformatics.med.ohio-state.edu/ERTargetDB/,
and were described in Jin et al. 2004, Nucleic Acids Research, 32:
6627-6635 & Leu et al. 2004, Cancer Research, 64: 8184-8192.
Author: Steve Horvath, Biostatics and Human Genetics, University
of California, Los Angeles
Title: Statistical Methods for the Analysis of Tissue Microarray
Data
Presentation materials: PPT
Streaming Video: Real
Media
Tissue microarrays (TMAs) are a new high-throughput tool for the study
of protein expression patterns in tissues and are increasingly used
to evaluate the diagnostic, prognostic importance of tumor biomarkers.
TMA data are rather challenging. Covariates are either ordinal variables
or highly skewed percentages. Since it is standard practice in the
TMA community to use cut-off values for tumor marker expression values,
it is natural to apply tree-based methods. We describe different supervised
and unsupervised learning methods based on survival trees and random
forests (Breiman 2001). We describe a novel strategy (random forest
clustering) for tumor profiling based on tissue microarray data. Random
forest clustering is attractive for tissue microarray and other immunohistochemistry
data since it handles highly skewed tumor marker expressions well
and weighs the contribution of each marker according to its relatedness
with other tumor markers. The real data application is the first tumor
class discovery analysis of renal cell carcinoma patients based on
protein expression profiles.
Author: Fiona Laird Hyland, Computational Genetics, Applied Biosystems
Title: Automated analysis on high throughput genotyping and gene
expression platforms, and evaluating haplotype tagging SNPs selected
in one human population for their informativeness in other populations
High-throughput genomic technologies necessitate automated data
processing and analysis through multiple stages: System design (e.g.
annotation and probe design), primary analysis (e.g. image analysis,
background subtraction, signal processing, outlier detection), and
secondary analysis (experimental design, normalization, filtering,
hypothesis testing, classification and prediction). A brief conceptual
overview of data analysis on Applied Biosystems' genotyping and
gene expression high throughput platforms will be used to tie together
common analytical themes.
It is widely hoped that the study of sequence variation in the
human genome will provide a means of elucidating the genetic component
of complex disease. The existence of substantial linkage disequilibrium
(LD) in the human genome suggests that it should be possible to
select a subset of single-nucleotide polymorphisms (SNPs) that optimally
retain the overall informativeness with respect to disease or trait
association of the entire set. This in turn would reduce both the
costs of exhaustive genotyping and ameliorate the challenging problem
of statistical inference from grossly over determined models. To
confirm disease associations discovered in one ethnic group, it
is desirable to know whether a set of haplotype tagging SNPs (tSNPs)
selected in one population will be informative in other populations.
Tagging SNPs (tSNP) were selected from about 20,000 SNPs closely
spaced along the axis of 3 human chromosomes (chrs. 6, 21, & 22).
A sample of 45 individuals from each of 4 human populations were
genotyped to obtain information on the patterns of LD on each of
the populations for the selection of tagging SNPs. We utilized haplotype
R2 as a metric of information to select minimum informative subsets
of tSNPs. The number of tSNPs needed to maintain a haplotype r2
above a critical threshold was computed separately for each population:
more tSNPs were needed in African Americans than in Caucasians,
Chinese or Japanese. The effect of SNP density was examined. When
subsets of SNPs of various sizes were sampled, and the degree to
which the subset tagged the 'hidden' SNPs was calculated, the hidden
SNPs were more completely tagged in Caucasians and Asians than in
African Americans. The average haplotype r2 of the Caucasian tSNPs
in Caucasians, and of the Caucasian tSNPs in African Americans,
Chinese and Japanese was computed, and vice versa. The average haplotype
r2 of Caucasian tSNPs used in Caucasians was very close to the haplotype
r2 of Asian tSNPs used in Caucasians, or African American tSNPs
used in Caucasians; similarly, the average haplotype r2 of African
American tSNPs used in African Americans was very close to the haplotype
r2 of Asian tSNPs used in African American, or Caucasian tSNPs used
in African American. These results indicate that tSNPs selected
in one population will work reasonably well in other populations,
at least with regard to common haplotypes. About 65% of tSNPs were
found to be common across populations when selected without optimization
for overlap.
Speaker: Amy E. Keating, Dept. of Biology, Massachusetts Institute
of Technology
Authors: Amy E. Keating, Dept. of Biology, Massachusetts Institute
of Technology; Mona Singh, Dept. of Computer Science and Lewis-Sigler
Institute for Integrative Genomics, Princeton University
Title: Combinatorial associations of the human bZIP transcription
factors: protein microarray measurements and computational predictions
Streaming Video: Real
Media
Sequencing of the human genome has revealed approximately 55 human
bZIP transcription factors that can form homo- or heterodimers to
regulate a wide variety of biological processes. The information
necessary for dimerization specificity is encoded in the coiled-coil
or "leucine-zipper" domains of these proteins. We have used protein
microarrays to carry out a comprehensive analysis of the intrinsic
interaction specificities of the bZIPs. By paying particular attention
to issues such as purity, valency and oxidation state, we have obtained
high quality interaction data, as judged by reproducibility, symmetry
and agreement with solution studies. Our measurements of over 1,400
unique pairwise combinations show that bZIP coiled-coil interactions
are sparse and highly-selective in vitro. The resulting data are
valuable for understanding combinatorial regulation of transcription
by the bZIPs. The array technology is likely to be valuable for
analyzing other protein domain and peptide interactions as well.
The bZIP microarray data provide an excellent foundation for computational
studies of protein sequence and structural features that are important
for interaction specificity. A support vector machine (SVM) developed
by Mona Singh (Princeton University Computer Science) trained on
coiled-coil data from the literature does an excellent job predicting
bZIP interaction preferences. We are working to combine machine-learning
methods such as the SVM with physical modeling of protein structure.
An integration of diverse approaches promises better performance,
as well as an improved understanding of the underlying determinants
of protein-protein interaction specificity. The SVM method for predicting
bZIP coiled-coil interactions is available for interactive use at
http://compbio.cs.princeton.edu/bzip/.
Author: Xiaole Shirley Liu, Biostatistics, HSPH / DFCI
Title: A Hidden Markov Model for Analyzing ChIP-chip Experiments
on Genome Tiling Arrays and its Application to p53 Binding Sequences
Streaming Video: Real
Media
Motivation: Transcription factors (TFs) regulate gene expression
by recognizing and binding to specific regulatory regions on the
genome, which in higher eukaryotes can occur far away from the regulated
genes. Recently Affymetrix developed the high-density oligonucleotide
arrays that tile all the non-repetitive sequences of the human genome
at 35-bp resolution. This new array platform allows for the unbiased
mapping of in vivo TF binding sequences (TFBSs) using Chromatin
ImmunoPrecipitation followed by microarray experiments (ChIP-chip).
The massive data generated from these experiments pose great challenges
for data analysis.
Results: We developed a fast, scalable and sensitive method to
extract TFBSs from ChIP-chip experiments on genome tiling arrays.
Our method takes advantage of tiling array data from many experiments
to normalize and model the behavior of each individual probe, and
identifies TFBSs using a Hidden Markov Model (HMM). When applied
to the data of p53 ChIP-chip experiments (Cawley et al., 2004),
our method discovered many new high confidence p53 targets including
all the regions verified by quantitative PCR . Using a de novo motif
finding algorithm MDscan (Liu et al., 2002), we also recovered the
p53 motif from our HMM identified p53 target regions. Furthermore,
we found substantial p53 motif enrichment in these regions comparing
with both genomic background and the TFBSs identified by Cawley
et al (2004). Several of the newly identified p53 TFBSs are in known
genes' promoter regions or associated with previous characterized
p53-responsive genes.
Author: Joakim Lundeberg, AlbaNova University Center, Royal Institute
of Technology, Department of Biotechnology
Title: Molecular tools to analyse and elucidate gene function
Streaming Video: Real
Media
Biomedical investigators have in the post-genome era been quick
in their efforts to develop new powerful technologies such as DNA
microarrays, high throughput genotyping and proteomics methods to
decipher gene function at a global scale. But a number of bottlenecks
still exist to be fully useful and this presentation adress some
of these issues. Examples includes analysis of the rich abundance
genetic variations in the human genome (such as single nucleotide
polymorphisms,SNPs) in populations that makes them ideal genetic
markers for identifying genetic factors associated with complex
diseases etc but requires high throughput methodology to obtain
statistical power in the analysis. Another current example is transcript
profiling projects using DNA microarrays that facilitates monitoring
of thousands of genes in parallel. These DNA chips are in routine
use in many research projects but the data analysis is still debated.
Furthermore the relatively poor annotation of the human proteome
has hindered a full exploration of many of these high throughput
methods but a recent effort at our department, the Human Proteome
Resource program, aims to describe the localisation of human proteins
and will therefore be an important tool in future efforts to analyse
gene function in an integrated manner.
Author: Andrey Rzhetsky, Department of Biomedical Informatics, Center
for Computational Biology and Bioinformatics (C2B2), and Columbia
Genome Center Columbia University
Title: Analysis of heterogeneous/noisy molecular interaction data
I will give an overview of our effort to automatically extract
pathway information from a large number of full-text research articles
(GeneWays system), automatically curate the extracted information,
and to combine the literature-derived information with sequence
and experimental (such as yeast two-hybrid) data using a probabilistic
approach.
Author: Eric Schadt, Research Genetics , Rosetta Inpharmatics
Title: Forward Genetics in Reverse: Integrating genotypic and expression
data in a segregating mouse population to map a novel susceptibility
locus with pleiotropic effects on obesity, bone density, and cholesterol
traits
Streaming Video: Real
Media
Forward genetics approaches to identify genes for complex traits
such as common human diseases have met with limited success. Fine
mapping of linkage regions associated with complex traits and validation
of positional candidates are time consuming and often hit-or-miss.
Here we detail a hybrid procedure to map loci for complex traits
that leverages off of the strengths of forward and reverse genetics
approaches. By intersecting genotypic and expression data in a segregating
mouse population, we demonstrate how clusters of expression quantitative
trait loci (eQTL) linking to regions of the genome controlling for
complex traits, accurately reflect the underlying perturbation to
the transcriptional network induced by DNA variations in genes that
control for the complex traits. By matching patterns of gene expression
in a segregating population with gene expression responses induced
by single gene perturbation experiments, we demonstrate how genes
controlling for clusters of expression and clinical QTL (QTL "hot
spot" regions) can be directly mapped. The utility of this approach
is demonstrated by mapping a novel susceptibility locus for a previously
identified QTL in an F2 cross between strains C57BL/6J (B6) and
DBA/2J (DBA), with pleiotropic effects on body fat, lipid levels,
and bone density. Our results demonstrate that integrating microarray
analysis with genetic and clinical data in segregating populations
is a powerful approach for directly identifying genes underlying
QTLs.
Author: Paul Spellman, Computational Scientist, Lawrence Berkeley
National Laboratory
Title: A Cell Line System for Understanding Breast Cancer
Streaming Video: Real
Media
We have developed a systematic approach to understanding both the
biological causes of breast cancer as well as mechanisms of therapeutic
effects by applying genome scale technologies to a panel of cell
lines.
We have shown that the panel of approximately 50 cell lines captures
most, if not all of the variability found in tumors at the level
of the DNA structure and copy number, RNA expression, and protein
expression. Key technologies that we are employing are reverse phase
protein lysate arrays for measuring absolute protein abundances,
and a set of technologies centered around the recently introduced
Affymetrix HTA system which gives us enormous throughput for expression
analysis, DNA copy analysis, and SNP genotyping at substantially
reduced costs.
Author: Ning Sun, Biostat, EPH, Yale School of Medicine
Title: A Measurement Error Model for Inferring Transcriptional Regulatory
Networks
Transcriptional Regulation Networks consist of high level of complexity
including a large number of genes and gene products and their associations,
the unknown protein-DNA interaction mechanisms, the transient activation
of the networks by signal transduction pathways. Recent advances
in genomic studies have provided many types of data related to transcription
regulation, such as large-scale RNA expression measurements, in
vivo DNA-protein binding data, protein-protein interactions and
genome sequences. However, each type of data only contains partial
information on transcriptional regulation, and often is accompanied
with large measurement errors. In this study, our goal is to develop
a statistical framework to explicitly separate the mechanism model
from the measurement error model, to allow a flexible framework
to integrate various data types, and to assist the learning of mechanisms
from data. Here, we will present a measurement error model to integrate
RNA expression data and protein-DNA binding data. In this model,
a linear system model was assumed to describe gene expression as
the response of regulation from a set of proteins (transcription
factors). Our simulation results showed that this data integration
model may reduce the measurement errors in protein-DNA interaction
data. We also applied the method to Yeast cell cycle data, and our
results will be discussed in this talk.
Author: Michael Uhler, Biological Chemistry, University of Michigan
Title: Microarray Transfection Analysis of Transcriptional Regulation
Although a wide variety of bioinformatic tools have been described
to characterize potential transcriptional regulatory mechanisms based
on genomic sequence analysis and microarray hybrization studies, these
regulatory pathways are most often experimentally verified using transient
transfection methods. Current transfection methods are largely limited
by both the large scale of existing methods or by the low level of
efficiency for certain cell types. Our goals were to develop a microarray-based
transfection method that could be optimized for different cell types
and would be usefully in reporter assays of transcriptional regulation.
Here we describe a novel transfection method, termed STEP (Surface
Transfection and Expression Protocol), which employs microarray-based
DNA transfection of adherent cells in the functional analysis of transcriptional
regulation. In STEP, recombinant proteins with biological activities
engineered to enhance transfection are complexed with plasmid or linear
expression vectors prior to spotting on microscope slides. The recombinant
proteins of the STEP complex can be varied to increase the efficiency
for different cell types. We demonstrate that STEP efficiently transfects
both supercoiled plasmids as well as PCR-generated linear expression
cassettes. A co-transfection assay using effector expression vectors
encoding the cAMP-dependent protein kinase (PKA) as well as reporter
vectors containing PKA-regulated promoters demonstrates that STEP
transfection allows detection and quantitation of transcriptional
regulation by this protein kinase. Since bioinformatic studies often
result in the identification of many putative regulatory elements
and signaling pathways, this approach should be of utility in high-throughput
functional genomic studies of transcriptional regulation. |