Workshop 3: Computational Proteomics and Mass Spectrometry

(January 11,2005 - January 14,2005 )

Organizers


Vineet Bafna
Computer Science & Engineering, University of California, San Diego
Tim Ting Chen
Biology, Computer Science, & Mathematics, University of Southern California

Proteomics is defined as the study of the total protein complement of a cell. This broad definition covers a lot of ground, including, but not limited to protein identification and quantification in specific cellular environments, structural genomics and fold recognition, identification and characterization of functional domains, and finally, the networks defining the interactions of proteins with bio-molecules (proteins, DNA, etc.). With the sequencing of the genome, and subsequent identification of the parts list (the gene and their protein products), there is a renewed emphasis on studying the proteome.

In this workshop, we will focus on emerging technologies for probing the proteome, with two focal points. The first is the computational analysis of mass spectrometry data. Simply speaking, a mass spectrum is a collection of masses and (relative) intensities of charged molecules. The spectrum of mass fragments of a protein (or peptide) sequence form a fingerprint that can be used for identification and relative quantification. Post translational modifcations can be measured using characteristic shifts in the spectrum. Various computational issues arise in the analysis mass spectrometry data for protein identification and quantification.

The second focus is the analysis of protein function, with an emphasis on combining evidence from emerging high-throughput technologies. Many techniques have been developed to profile protein function directly and indirectly. For example, multiple alignments of evolutionary-conserved protein domains provide direct annotation of functions of a protein; gene expression profiles can be used to cluster proteins with similar functions; protein interactions show how proteins interact with one another to carry the necessary functions. In particular, a large amount of protein interactions have been generated recently by several large-scale techniques such as mass spectrometry, gene-knockout, and yeast two hybrid assays. These data together provide us with a global view of the protein netwrok inside the cell. Analysis of such networks is critical to understand the biological system at the molecular level.

The workshop will aim to bring together the leading researchers in these areas to describe the state of the art, and also to present problems that will challenge the next generation of Bioinformatics researchers.

Accepted Speakers

Vineet Bafna
Computer Science & Engineering, University of California, San Diego
Nuno Bandeira
Computer Science & Engineering, University of California, San Diego
Tim Ting Chen
Biology, Computer Science, & Mathematics, University of Southern California
Nathan Edwards
Bioinformatics & Computational Biology, College of Business and Management
Peter Harrington
Chemistry and Biochemistry, Ohio University
Oliver Kohlbacher
Simulation of Biological Systems, WSI - University of Tubingen
Douglas Kohn
Edison Biotechnology Institute & Neuroscience, Ohio University
Bin Ma
Department of Computer Science, University of Western Ontario
Alexey Nesvizhsky
Institute for Systems Biology
Scott Patterson
Molecular Sciences, Amgen, Inc.
Knut Reinert
Arbeitsgruppe Algorithmische Bioinformatik, Institut fur Informatik
Rovshan Sadygov
Self-employed Bioinformatician
Benno Schwikowski
Department of Systems Biology, Institut Pasteur
Brian Searle
Proteome Software, Inc.
Bernhard Spengler
Institute of Inorganic & Analytical Chemistry, Justus Liebig University Giessen
Fengzhu Sun
Department of Biological Sciences, University of Southern California
Alfred Yergey
Mass Spectrometry and Metabolism, National Institutes of Health
Tuesday, January 11, 2005
Time Session
09:15 AM
07:00 PM
Scott Patterson - Proteomics in Drug Discovery and Development: Computational Opportunities Abound

Parallel protein measurements, aka proteomics, have the potential to provide information on biological systems in isolation as cell culture systems, tissues or in an organism. Whereas parallel measures of transcript (mRNA) abundance can be multiplexed more easily through microarray analysis of even small quantities of sample following amplification using PCR, parallel measures of protein abundance are more difficult due to the heterogeneity of protein properties compared with nucleic acids, and the inability to amplify the signal. However, despite these drawbacks much useful data can be generated, but the interpretation of such data sets is challenging. This presentation will focus on the kinds of datasets that are generated from a range of proteomics approaches including mass spectrometry (both LC-MS and MS/MS data) for unbiased analyses and multiplexed protein assays for targeted analyses. How and where such assays are employed and the pros and cons of such approaches will also be discussed.

10:30 AM
11:15 AM
Douglas Kohn - Proteomic Analysis of the Brains from Mice that Lack a Growth Hormone Receptor

Growth hormone (GH) regulates cell growth and differentiation primarily by modulating gene expression and metabolism in target tissues. Targeted disruption of the gene encoding the growth hormone receptor and binding protein (GHR/BP-/-) functionally inactivates GH and generates long-lived, dwarf mice with elevated circulating GH and markedly reduced insulin-like growth factor-1 (IGF-1) levels (1, 2). Indeed, insulin/IGF-1signaling has been shown to be a critical determinant of lifespan in several species. GHR/BP-/- mice also have decreased fasting insulin and glucose levels (3) and appear to resist complications due to streptozotocin (STZ)-induced diabetes (4). To determine the consequences of the GHR/BP-/- mutation on gene expression in the central nervous system (CNS), brain tissue was harvested from normal and gene-disrupted mice at different developmental stages (young, adult and aged) and proteins were isolated from distinct subcellular fractions (nucleus, cytoplasm, polysomes) using differential gradient ultracentrifugation. The proteins in each fraction were resolved by two-dimensional gel electrophoresis and stained with the fluorescent dye SYPRO Orange. The images were captured with a high-resolution CCD camera (Bio-Rad Versa-Doc 3000) or a laser-scanning device (Fuji FLA-3000G) and quantitatively analyzed with PDQuest or Image Gauge software packages. Differentially expressed proteins were manually excised from the gels and identified by mass spectrometry. Of the hundreds of proteins resolved, several were differentially expressed in the brains of GHR/BP-/- mice relative to controls. The goal is to identify those proteins whose expression patterns are spatially and temporally correlated and establish functional protein networks that may delay or attenuate age-related tissue dysfunction or diabetic complications. This work was supported in part by the State of Ohio's Eminent Scholar Program which includes a gift from Milton and Lawrence Goll.



  1. Zhou et al., (1997) Proc Natl Acad Sci USA 94:13215-13220

  2. Coschigano et al., (2003) Endocrinology 144:3799-3810

  3. Coschigano et al., (2001) Endocrinology 141:2608-2613

  4. Bellush et al., (2000) Endocrinology 141:163-168

01:30 PM
02:15 PM
Peter Harrington - Fuzzy Entropy Classification Systems and Their Application to Mass Spectrometry of the Proteome

Mass spectrometry is a burgeoning method for proteomic studies because it is a high throughput method that offers low detection limits and high selectivity. Our work has focused on matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS). The signals obtained from mass spectrometry are intricate and can be influenced by the experimental design. The use of classification methods are useful for detecting biomarkers and making predictions based on mass spectral signals. MS of proteins from noninvasive samples has potential as a medical tool for early diagnoses of disease. Spectra from studies of amniotic fluids from women who had normal, normal with inflamed uteri, and premature delivery were used for building classification models.


Fuzzy classifications systems are considered soft methods that based on the variance of the data. Soft methods are advantageous because they avoid overfitting the data and the curse of dimensionality. Fuzzy Rule-Building Expert Systems (FuRES)1 are useful because an inductive classification tree is obtained as a model that may be interpreted. Using principal component compression, FuRES is simple, fast, reliable, and applicable to MS data. Coupled with Latin-partition methods precision bounds may be obtained for evaluation of predictability.


(1) Harrington, P. B. Journal of Chemometrics 1991, 5, 467-486.


Joint work with Nancy E. Vieira and Alfred L. Yergey.

02:45 PM
03:30 PM
Benno Schwikowski - Looking at the Whole Instead of the Parts: Combining multidimensional LC-MS data at the signal level

The analysis of complex protein mixtures by LC-MS is one of the key technologies for systematic large-scale observation and modeling of cellular processes. Mass spectrometry itself is exquisitely sensitive, and reproducibility at the signal level is high. It is known, however, that a significant number of peptides - especially those with modifications - are missed in current computational analyses. We present a new approach to the computational interpretation of the experimental data that globally integrates all data of one or more experiments instead of interpreting spectra one-by-one. Instead of attempting to detect the presence of a protein or its fragments from individual signals (peaks) in a single mass spectrograph, all data acquired across a whole experiment are first aligned into an n+1-dimensional space, where n is the number of dimensions used for the LC separation. This condenses all peaks generated by the same protein fragment throughout the experiment into a single dense signal, which allows a much better separation of signal and noise. The main computational challenge is to compensate for fluctuations in the separation process. We will present algorithms for the implementation of this approach, and demonstrate its usefulness using case studies in one and two dimensions. This is joint work with Amol Prakash, and the research groups of Ruedi Aebersold and Amanda Pavlovitch in Seattle.

Wednesday, January 12, 2005
Time Session
09:00 AM
09:45 AM
Oliver Kohlbacher - Peptides, Markers, Targets: Algorithms for the identification of differentially expressed proteins

In this talk we describe methods to identify the differential expression of peptides and propose strategies to avoid MS/MS identification of peptides of interest. The algorithms are embedded in the freely available software library OpenMS which is currently under development at the Freie Universitat Berlin and the Eberhardt-Karls Universitat Tubingen. We give an overview of the capabilities and design principles of OpenMS and demonstrate its ease of use. Finally we describe projects in which OpenMS will be or was already deployed and thereby demonstrate its versatility.

10:15 AM
11:00 AM
Knut Reinert - Signal Processing and Data Reduction for Differential Proteomics with HPLC/MS

In the talk we describe methods to reduce the amount of data obtained by (multi)-dimensional HPLC/MS experiments. The algorithms are embedded in the freely available software library OpenMS which is currently under development at the Freie Universitat Berlin and the Eberhardt-Karls Universitat Tubingen. We give an overview of the goals and problems in differential proteomics with HPLC and then describe in detail the implemented approaches for signal processing, peak detection and data reduction currently employed in OpenMS.

11:15 AM
12:15 PM
Bin Ma - SPIDER: Software for Protein Identification from Sequence Tags Containing De Novo Sequencing Error

For the protein identification of novel proteins using MS/MS, searching the sequence tags obtained by de novo sequencing in a protein sequence database is the best way. However, de novo sequencing very often can give only partially correct sequence tags. The most commonly type of error found in the sequence tags is the same-mass segments replacement, i.e. a segment of amino acids is replaced with another one with the same mass. The current database search software such as MS-BLAST cannot handle the errors existing in the sequence tags. We developed a new efficient algorithm to align sequence tags from de novo sequencing with database sequences to identify proteins. This talk introduces the algorithms and implementation details of SPIDER software.

02:00 PM
02:45 PM
Alfred Yergey - Analysis of Variance Coupled with Principal Component Analysis for the Characterization of Potential Biomarkers

Many researchers have reported biomarkers from mass spectrometry (MS) experiments. However, in many cases no experimental design was reported and the biomarkers did not correspond to any reliable signals in the data. In many cases artificial intelligence is used after the data are collected rather than using natural intelligence in designing the experimental measurements. This paper proposes and demonstrates the power of a rational experimental design applied to preliminary experiments directed towards discovery of biomarkers in amniotic fluid.


Combining analysis of variance with principal component analysis (ANOVA/PCA) provides a powerful tool for the discovery of biomarkers in chemical measurements of biological systems. This approach encourages the use of experimental design to separate the variation of the experimental hypothesis from other potentially confounding sources of variation. When the factors of the experiment are greater than the residual error, the variable loadings of the principal components can be interpreted without requiring a supervised rotation and thus avoid the Curse of Dimensionality that occurs with underdetermined data. A series of spectral score plots are obtained for each experimental factor that allows easy interpretation by scientists who may not be proficient in advance mathematical calculations. A conservative statistical test is used been presented to evaluate the significance of the experimental factors. Potential biomarker peaks can be validated through a univariate resolution measure.


Joint work with Nancy E. Vieira, Roberto Romero, and Peter de B. Harrington.

03:00 PM
03:45 PM
Nuno Bandeira - Shotgun Protein Sequencing by MS/MS Spectra Assembly

The analysis of mass spectrometry data is still largely based on identification of single MS/MS spectra and does not attempt to make use of the extra information available in multiple MS/MS spectra from partially or completely overlapping peptides. Analysis of MS/MS spectra from multiple overlapping peptides opens up the possibility of assembling MS/MS spectra into entire proteins, similarly to the assembly of overlapping DNA reads into entire genomes. This presentation will focus on new methods to detect, score and interpret overlaps between uninterpreted MS/MS spectra in an attempt to sequence entire proteins rather than individual peptides. This approach not only extends the length of reconstructed amino acid sequences but also dramatically improves the quality of de-novo peptide sequencing. Results will be presented using data from an ESI/IonTrap mass spectrometer.

04:00 PM
04:45 PM
Nathan Edwards - Faster, More Sensitive Peptide Identification from Tandem Mass Spectra by Sequence Database Compression

Peptide identification from tandem mass spectra is an important enabling technology for high-throughput proteomics pipelines. The search engines that analyze these spectra, such as Mascot or SEQUEST, use amino-acid sequence databases, such as UniProt, to generate putative peptides to compare against each spectrum. We have developed a method by which the entire peptide content of such an amino-acid sequence database can be represented by a new, smaller, amino-acid sequence database. Existing search software can be sped up, without modification, by using this compressed sequence database. Further, since fewer peptides are scored against each spectrum, the statistical significance of the same peptide scores are improved, making the search more sensitive. With peptide identifications in hand, an exact sequence search in the original sequence database restores the protein context for each peptide. The effectiveness of this approach is demonstrated using Mascot and the UniProt family of amino-acid sequence databases.

Thursday, January 13, 2005
Time Session
09:00 AM
09:45 AM
Ari Frank - De Novo Peptide Sequencing via Probabilistic Network Modeling

We present a novel scoring method for de novo interpretation of peptides from tandem mass spectrometry data. Our scoring method uses a probabilistic network whose structure reflects the chemical and physical rules that govern the peptide fragmentation. We use a likelihood ratio hypothesis test to determine if the peaks observed in the mass spectrum are more likely to have been produced under our fragmentation model, than under a model that treats peaks as random events. We tested our de novo algorithm PepNovo on Ion-Trap data, and achieved results that are superior to popular de novo peptide sequencing algorithms. PepNovo can be accessed via the URL http://peptide.ucsd.edu/.


Joint work with Pevel Pezner.

10:15 AM
11:00 AM
Vineet Bafna - InsPecT: Identification of post translationally modified peptides via interpretation of tandem Mass Spectra

Reliable identification of post-translational modification is key to understanding various cellular processes. We describe a tool, insPecT, to identify post-translational modifications using tandem mass spectrometry data. The tool is based upon a novel algorithms for the following: (a) Constructing tag based filters based on a novel de novo interpretation algorithm that works in the presence of modifications. The sequence tags help eliminate much of the database while retaining the true peptide; (b) a fast Trie based search for scanning the database with sequence tags; (c) a dynamic programming technique to identify candidate peptides with modifications without explicit enumeration of the modifications; (d) a scoring algorithm that is rapidly reconfigured for differing fragmentation propensities, and is independent of the length of the peptide, and (e) a novel quality score computation based on an optimization of complementary features for evaluating quality. The tool was tested on a number of real and simulated data-sets. InsPecT can search for modified and unmodified peptides in time that is faster than other database search tools. We identified a large number of modified peptides, including several novel phospho-petides in data-sets provided by the Alliance for Cellular signalling.


Joint work with Stephen Tanner, Hongjun Shu, Ari Frank, Marc Mumby, and Pavel Pevzner.

11:15 AM
12:15 PM
Alexey Nesvizhsky - Analysis and Statistical Validation of Shotgun Proteomics Datasets

Shotgun proteomics approach has been used increasingly for high throughput analysis of complex protein samples. A major challenge lies in the consistent, objective and transparent analysis of the large amounts of data generated by such experiments and in their dissemination and publication. The first part of this presentation will focus on various statistical measures and approaches for estimating the confidence level of peptide identifications made by MS/MS database searching, including p-values, expectation values, reverse database searching, and the Bayesian classification. A comparison will be made with methods developed for the analysis of other types of data such as microarray gene expression.


Identification of peptides from MS/MS spectra represents the first step in the computational analysis of shotgun proteomics data. Most often, the goal of the experiment is to infer what proteins are present in the original sample. A statistical model for assembling peptides into proteins and computing protein probabilities will be presented. A special attention will be paid to the problem of non-random grouping of peptides according to their corresponding proteins ('single hit' identification problem). Furthermore, limitations of shotgun proteomics with respect to the accurate characterization of protein isoforms and mature protein forms will be discussed. Similar to the shotgun DNA fragment sequence assembly problem, the presence of 'degenerate' peptides (peptides whose sequence is present in multiple proteins) makes it difficult to infer what proteins are present in the sample. An informatics approach for dealing with the cases of degenerate peptides and presenting protein identification results to the biologists analyzing the data will be described.

02:00 PM
02:45 PM
Brian Searle - Improving Sensitivity by Combining Results from Multiple Search Methodologies

Database-searching programs generally identify only a fraction of the spectra acquired in a standard LC/MS/MS study of digested proteins. By using a mass-based alignment algorithm of de novo sequencing results, OpenSea can sometimes perform better than this because it can also identify modified peptides. However, OpenSea is dependent on de novo sequencing algorithms that usually cannot derive accurate sequences from low quality MS/MS spectra. Conveniently, many database-searching programs are well suited for matching peptide sequences to low quality data. To leverage this dichotomy, we have developed an algorithm to probabilistically combine the results of multiple search engines, including SEQUEST, Mascot, X!Tandem, and OpenSea. We have found that we normally gain 5% to 20% more MS/MS spectrum identifications with each additional search engine we use, primarily due to increased confidence in low scoring matches. In addition, we use ranked-based clustering to mine information from the remaining spectra. First, we remove redundant results by clustering unmatched spectra to other spectra identified by the database-searching programs. Then we identify potentially interesting unmatched spectra by looking for spectral duplication and using high quality filters. These results are singled out for further modification discovery analysis or manual interpretation.

03:00 PM
03:45 PM
Tim Ting Chen - A New Scoring Function for Tandem Mass Spectrometry Database Search

A New Scoring Function for Tandem Mass Spectrometry Database Search

04:00 PM
04:45 PM
Rovshan Sadygov - A Two-Dimensional Probability Model for Peptide Identification Using Tandem Mass Spectrometry and Protein Databases

The presentation focuses on a two-dimensional probability model for peptide identification using tandem mass spectra and amino acid sequence databases. Probability models are developed for two of the parameters that affect the quality of peptide identification the most - number of product ion matches and the sum of the product ion abundances. Both models are derived from the direct comparison of experimental tandem mass spectrum to amino acid sequences from the protein database. The probabilities obtained from each model are correlated and normalized to derive a single score - significance of peptide identification.


The talk will discuss the comparison of the approach to other database search algorithms.

Friday, January 14, 2005
Time Session
09:00 AM
09:45 AM
Fengzhu Sun - TBA

Enormous amount of biological data have been accumulated over the years such as sequences, gene expressions, protein physical interactions, genetic interactions, protein complexes, protein localizations, etc. For given biological problems of interest, most data contribute some, but not all the information for the problems. By combining different problems intelligently, we are able to obtain a more complete picture of the problems of interest.


We present two examples of data integration. One is the estimation of reliability of observed protein interaction data sets using gene expressions and protein localizations. The integration of the two data sources can give a more accurate estimation of the reliability. The other example is protein function prediction combining protein interactions, complexes, and features of individual proteins based on a Markov Random Field model. We further study the relationship between gene lethality, protein interaction networks, and protein function annotation.


Joint work with Ting Chen.

10:15 AM
11:00 AM
Bernhard Spengler - High-Accuracy Mass Spectrometry for Composition Based de Novo Sequencing (CBS) of Unknown Peptides

With the easy availability of ultraprecise mass spectrometrical data, the accurate mass of biomolecules is becoming a physical quantity of high interest in bioanalytical methodology. MALDI/ESI FT-ICR mass spectrometry, especially when combined with convenient and versatile ion manipulating devices such as quadrupolar ion traps, now allows to easily determine the amino acid composition of medium size unknown peptides when employing combinatorial calculations of parent and fragment ion masses. This new method, which in a second step allows to reliably sequence completely unknown peptides ("Composition-Based Sequencing (CBS)" [1]) appears to open a wide new field of bioanalytical investigation in proteomics.


CBS appears to have some fundamental advantages over common de novo sequencing strategies, since it does not require preknowledge of underlying fragmentation mechanisms or of peptide specific ionization and fragmentation behavior. While classical strategies usually try to verify the presence of expected fragment ion signals, CBS instead interprets the observed accurate mass values of precursor and fragment ions with respect to possible amino acid combinations by means of combinatorial logic. The potential and limitations of the method will be discussed in the light of the expected evolution of high accuracy mass spectrometry in the coming years.


[1] B. Spengler, De Novo Sequencing, Peptide Composition Analysis and Composition-based Sequencing: A new Strategy Employing Accurate Mass Determination by Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. J Am Soc Mass Spectrom, 15 (2004) 704-715.

01:00 PM
01:45 PM
Sebastian Boecker - Algorithms for Interpreting Mass Spectrometry Data

Mass Spectrometry (MS) is a technology very well suited for high-throughput data acquisition, due to its speed and accuracy. Simplified, a mass spectrometer's input is a molecular mixture, and its output a list of masses of the sample molecules. The most well-known application in biotechnology is protein identification with database lookup, but MS is also increasingly used to analyze DNA and other biomolecules.


The sample biomolecules of an MS experiment can often be represented by strings over a weighted alphabet. Clearly, the order of characters cannot be determined from the weight. Thus, the problem leads to the study of weighted strings and compomers: A string's compomer is an integer vector specifying the number of occurrences of each character. We are interested in efficient algorithms for determining all or some compomers with a given mass, the number of such compomers, and related questions.


One of the pressing problems in the context of mass spectrometry is to transform the MS raw data into a list of peaks with masses and possibly other attributes. To do this, the noise in the raw spectrum has to be filtered out and in order to get the real mass of a peak, we have to deconvolute the spectrum with the isotopic distribution of each peak. We are working on robust methods for doing so, based on the stochastic concept of regression analysis.


Last, we want to identify biological molecules measured by MS. Given a list of candidates (e.g. a database), the measured spectrum has to be compared to the theoretically predicted spectrum of each sequence and a score has to be computed as a measure of quality of match. Here, it is important to combine the freedom to adjust matching scores to an application's peculiarities, with a rigid statistical analysis of score distributions. We present an approach that allows easy and fast estimation of p-values of such scores for two important applications, Peptide Mass Fingerprints and Tandem Mass Spectrometry.


We present algorithms and software developed in our research group for several of these questions, along with experimental data.


Joint work with Michael Kaltenbach and Zsuzsanna Lipt?k.

02:00 PM
02:45 PM
Frederic Schutz - Deriving Better Specificity Models for Trypsin to Improve Protein Identification by Tandem MS

Deriving Better Specificity Models for Trypsin to Improve Protein Identification by Tandem MS

Name Affiliation
Ackerman, William ackerman.72@osu.edu Obstetrics and Gynecology, The Ohio State University
Aggarwal, Divya diaggarw@indiana.edu School of Informatics, Indiana University
Bafna, Vineet vbafna@cs.ucsd.edu Computer Science & Engineering, University of California, San Diego
Bandeira, Nuno bandeira@cs.ucsd.edu Computer Science & Engineering, University of California, San Diego
Best, Janet jbest@mbi.osu.edu Mathematics, The Ohio State University
Boecker, Sebastian boecker@CeBiTec.uni-bielefeld.de A.G. Genominformatik, Universitaet Bielefeld
Borisyuk, Alla borisyuk@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Buechler, Steven buechler.1@nd.edu Department of Mathematics, University of Notre Dame
Calin, George george.calin@osumc.edu Molecular Virology, Immunology, & Medical Genetics, The Ohio State University
Chen, Ping pc256003@ohio.edu Chemistry and Biochemistry, Ohio University
Chen, Tim Ting tingchen@hto.usc.edu Biology, Computer Science, & Mathematics, University of Southern California
Cracium, Gheorghe craciun@math.wisc.edu Dept. of Mathematics, University of Wisconsin-Madison
Dougherty, Daniel dpdoughe@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Edwards, Nathan nedwards@umiacs.umd.edu Bioinformatics & Computational Biology, College of Business and Management
Franco, Albert albertfranco@hotmail.com Obstetrics and Gynecology, The Ohio State University
Frank, Ari arf@cs.ucsd.edu Computer Science & Engineering, University of California, San Diego
Goel, Pranay goelpra@helix.nih.gov NIDDK, Indian Institute of Science Education and Research
Gropl, Clemens gropl@inf.fu-berlin.de Algorithmic Bioinformatics, Free University Berlin
Guo, Yixin yixin@math.drexel.edu Department of Mathematics, The Ohio State University
Harrington, Peter peter.harrington@ohio.edu Chemistry and Biochemistry, Ohio University
Higgs, Rick higgs@lilly.com Genomic and Molecular Informatics, Lilly Research Laboratories
Hille, Charles (Russ) hille.1@osu.edu Molecular and Cellular Biochemistry, The Ohio State University
Hsu, Jason hsu.1@osu.edu Department of Statistics, The Ohio State University
Hu, Bei Department of Mathematics, University of Notre Dame
Javaid, Sarah javaid.5@osu.edu Department of Biophysics, The Ohio State University
Joy, Saju joy-1@medctr.osu.edu Obstetrics and Gynecology, The Ohio State University
Kohlbacher, Oliver oliver.kohlbacker@uni-tubingen.de Simulation of Biological Systems, WSI - University of Tubingen
Kohn, Douglas kohn@ohio.edu Edison Biotechnology Institute & Neuroscience, Ohio University
Lebo, Matthew lebo@usc.edu Molecular and Computational Biology, University of Southern California
Lim, Sookkyung limsk@math.uc.edu Department of Mathematical Sciences, University of Cincinnati
Lin, Shili lin.328@osu.edu Department of Statistics, The Ohio State University
Liu, Chang-gong chang-gong.liu@osumc.edu Molecular Virology, Immunology, & Medical Genetics, The Ohio State University
Lu, Bingwen bingwenl@usc.edu Department of Biological Sciences, University of Southern California
Ma, Bin bma@csd.uwo.ca Department of Computer Science, University of Western Ontario
Mehta, Sameep mehtas@cse.ohio-state.edu Computer Science and Engineering, The Ohio State University
Melfi, Vincent melfi@mbi.osu.edu Mathematics, Michigan State University
Michailidis, George gmichail@umich.edu Department of Statistics, University of Michigan
Mukherjee, Mitali mukherjee.21@osu.edu Department of Medicinal Chemistry, The Ohio State University
Nagaraja, Haikady Department of Statistics, The Ohio State University
Nesvizhsky, Alexey nesvi@systemsbiology.org Institute for Systems Biology
Patterson, Scott spatters@amgen.com Molecular Sciences, Amgen, Inc.
Pol, Diego dpol@mbi.osu.edu Independent Researcher, Museo Paleontologico E. Feruglio
Popescu, Liviu liviup@cs.cornell.edu Department of Computer Science, Cornell University
Rassoul-Agha, Firas firas@math.ohio-state.edu Mathematical Biosciences Institute, The Ohio State University
Reinert, Knut kasseckert@inf.fu-berlin.de Arbeitsgruppe Algorithmische Bioinformatik, Institut fur Informatik
Rejniak, Katarzyna rejniak@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Robinson, Mark m.robinson@utoronto.ca Electrical and Computer Engineering, University of Toronto
Sadygov, Rovshan rovshan.sadygov@thermo.com Self-employed Bioinformatician
Saltz, Joel Joel.Saltz@osumc.edu. Department of Biomedical Informatics, The Ohio State University
Schutz, Frederic schutz@wehi.edu.au Genetics and Bioinformatics, Walter & Eliza Hall Institute of Medical Research
Schwikowski, Benno benno@pasteur.fr Department of Systems Biology, Institut Pasteur
Searle, Brian brian.searle@proteomesoftware.com Proteome Software, Inc.
Spengler, Bernhard Bernhard.spengler@uni.giessen.de Institute of Inorganic & Analytical Chemistry, Justus Liebig University Giessen
Stubna, Michael stubna@mbi.osu.edu Engineering Team Leader, Pulsar Informatics
Sun, Fengzhu fsun@hto.usc.edu Department of Biological Sciences, University of Southern California
Sun, Junfeng sun@stat.ohio-state.edu Department of Statistics, The Ohio State University
Swift, Dionne swift.dp@pg.com Biometrics and Statistical Sciences, Proctor and Gamble Company
Tanner, Stephen stanner@ucsd.edu Department of Bioinformatics, University of California, San Diego
Terman, David terman@math.ohio-state.edu Mathemathics Department, The Ohio State University
Tian, Jianjun Paul tianjj@mbi.osu.edu Mathematics, College of William and Mary
Tu, Zhidong zdtu@yahoo.com Department of Computational Biology, University of Southern California
Ucar, Duygu ucarduygu@gmail.com Computer Science and Engineering, The Ohio State University
Vakalis, Ignatios ivakalis@capital.edu Mathematics & Computer Sc, Capital University
Varbanov, Alex varbanov.ar@pg.com Department of Statistics, Procter & Gamble
Verducci, Joseph verducci.1@osu.edu Department of Statistics, The Ohio State University
Wang, Chao wachao@cse.ohio-state.edu Computer Science and Engineering, The Ohio State University
Wang, Zailong zlwang@mbi.osu.edu Integrated Information Sciences, Novartis
Wechselberger, Martin wm@mbi.osu.edu Mathematical Biosciences Insitute, The Ohio State University
Wright, Geraldine wright.572@osu.edu School of Biology, Newcastle University
Wu, Zhijun zhijun@iastate.edu Math, Bioinformatics, & Computational Biology, Iowa State University
Yergey, Alfred aly@helix.nih.gov Mass Spectrometry and Metabolism, National Institutes of Health
Zhang, Yali zhang.387@math.ohio-state.edu Department of Biochemistry, The Ohio State University
Zhou, Jin jzhou@mbi.osu.edu Department of Mathematics, Northern Michigan University
Zhuge, Lei lzhuge@usc.edu Department of Computational Biology, University of Southern California
InsPecT: Identification of post translationally modified peptides via interpretation of tandem Mass Spectra

Reliable identification of post-translational modification is key to understanding various cellular processes. We describe a tool, insPecT, to identify post-translational modifications using tandem mass spectrometry data. The tool is based upon a novel algorithms for the following: (a) Constructing tag based filters based on a novel de novo interpretation algorithm that works in the presence of modifications. The sequence tags help eliminate much of the database while retaining the true peptide; (b) a fast Trie based search for scanning the database with sequence tags; (c) a dynamic programming technique to identify candidate peptides with modifications without explicit enumeration of the modifications; (d) a scoring algorithm that is rapidly reconfigured for differing fragmentation propensities, and is independent of the length of the peptide, and (e) a novel quality score computation based on an optimization of complementary features for evaluating quality. The tool was tested on a number of real and simulated data-sets. InsPecT can search for modified and unmodified peptides in time that is faster than other database search tools. We identified a large number of modified peptides, including several novel phospho-petides in data-sets provided by the Alliance for Cellular signalling.


Joint work with Stephen Tanner, Hongjun Shu, Ari Frank, Marc Mumby, and Pavel Pevzner.

Shotgun Protein Sequencing by MS/MS Spectra Assembly

The analysis of mass spectrometry data is still largely based on identification of single MS/MS spectra and does not attempt to make use of the extra information available in multiple MS/MS spectra from partially or completely overlapping peptides. Analysis of MS/MS spectra from multiple overlapping peptides opens up the possibility of assembling MS/MS spectra into entire proteins, similarly to the assembly of overlapping DNA reads into entire genomes. This presentation will focus on new methods to detect, score and interpret overlaps between uninterpreted MS/MS spectra in an attempt to sequence entire proteins rather than individual peptides. This approach not only extends the length of reconstructed amino acid sequences but also dramatically improves the quality of de-novo peptide sequencing. Results will be presented using data from an ESI/IonTrap mass spectrometer.

Algorithms for Interpreting Mass Spectrometry Data

Mass Spectrometry (MS) is a technology very well suited for high-throughput data acquisition, due to its speed and accuracy. Simplified, a mass spectrometer's input is a molecular mixture, and its output a list of masses of the sample molecules. The most well-known application in biotechnology is protein identification with database lookup, but MS is also increasingly used to analyze DNA and other biomolecules.


The sample biomolecules of an MS experiment can often be represented by strings over a weighted alphabet. Clearly, the order of characters cannot be determined from the weight. Thus, the problem leads to the study of weighted strings and compomers: A string's compomer is an integer vector specifying the number of occurrences of each character. We are interested in efficient algorithms for determining all or some compomers with a given mass, the number of such compomers, and related questions.


One of the pressing problems in the context of mass spectrometry is to transform the MS raw data into a list of peaks with masses and possibly other attributes. To do this, the noise in the raw spectrum has to be filtered out and in order to get the real mass of a peak, we have to deconvolute the spectrum with the isotopic distribution of each peak. We are working on robust methods for doing so, based on the stochastic concept of regression analysis.


Last, we want to identify biological molecules measured by MS. Given a list of candidates (e.g. a database), the measured spectrum has to be compared to the theoretically predicted spectrum of each sequence and a score has to be computed as a measure of quality of match. Here, it is important to combine the freedom to adjust matching scores to an application's peculiarities, with a rigid statistical analysis of score distributions. We present an approach that allows easy and fast estimation of p-values of such scores for two important applications, Peptide Mass Fingerprints and Tandem Mass Spectrometry.


We present algorithms and software developed in our research group for several of these questions, along with experimental data.


Joint work with Michael Kaltenbach and Zsuzsanna Lipt?k.

A New Scoring Function for Tandem Mass Spectrometry Database Search

A New Scoring Function for Tandem Mass Spectrometry Database Search

Faster, More Sensitive Peptide Identification from Tandem Mass Spectra by Sequence Database Compression

Peptide identification from tandem mass spectra is an important enabling technology for high-throughput proteomics pipelines. The search engines that analyze these spectra, such as Mascot or SEQUEST, use amino-acid sequence databases, such as UniProt, to generate putative peptides to compare against each spectrum. We have developed a method by which the entire peptide content of such an amino-acid sequence database can be represented by a new, smaller, amino-acid sequence database. Existing search software can be sped up, without modification, by using this compressed sequence database. Further, since fewer peptides are scored against each spectrum, the statistical significance of the same peptide scores are improved, making the search more sensitive. With peptide identifications in hand, an exact sequence search in the original sequence database restores the protein context for each peptide. The effectiveness of this approach is demonstrated using Mascot and the UniProt family of amino-acid sequence databases.

De Novo Peptide Sequencing via Probabilistic Network Modeling

We present a novel scoring method for de novo interpretation of peptides from tandem mass spectrometry data. Our scoring method uses a probabilistic network whose structure reflects the chemical and physical rules that govern the peptide fragmentation. We use a likelihood ratio hypothesis test to determine if the peaks observed in the mass spectrum are more likely to have been produced under our fragmentation model, than under a model that treats peaks as random events. We tested our de novo algorithm PepNovo on Ion-Trap data, and achieved results that are superior to popular de novo peptide sequencing algorithms. PepNovo can be accessed via the URL http://peptide.ucsd.edu/.


Joint work with Pevel Pezner.

Fuzzy Entropy Classification Systems and Their Application to Mass Spectrometry of the Proteome

Mass spectrometry is a burgeoning method for proteomic studies because it is a high throughput method that offers low detection limits and high selectivity. Our work has focused on matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS). The signals obtained from mass spectrometry are intricate and can be influenced by the experimental design. The use of classification methods are useful for detecting biomarkers and making predictions based on mass spectral signals. MS of proteins from noninvasive samples has potential as a medical tool for early diagnoses of disease. Spectra from studies of amniotic fluids from women who had normal, normal with inflamed uteri, and premature delivery were used for building classification models.


Fuzzy classifications systems are considered soft methods that based on the variance of the data. Soft methods are advantageous because they avoid overfitting the data and the curse of dimensionality. Fuzzy Rule-Building Expert Systems (FuRES)1 are useful because an inductive classification tree is obtained as a model that may be interpreted. Using principal component compression, FuRES is simple, fast, reliable, and applicable to MS data. Coupled with Latin-partition methods precision bounds may be obtained for evaluation of predictability.


(1) Harrington, P. B. Journal of Chemometrics 1991, 5, 467-486.


Joint work with Nancy E. Vieira and Alfred L. Yergey.

Peptides, Markers, Targets: Algorithms for the identification of differentially expressed proteins

In this talk we describe methods to identify the differential expression of peptides and propose strategies to avoid MS/MS identification of peptides of interest. The algorithms are embedded in the freely available software library OpenMS which is currently under development at the Freie Universitat Berlin and the Eberhardt-Karls Universitat Tubingen. We give an overview of the capabilities and design principles of OpenMS and demonstrate its ease of use. Finally we describe projects in which OpenMS will be or was already deployed and thereby demonstrate its versatility.

Proteomic Analysis of the Brains from Mice that Lack a Growth Hormone Receptor

Growth hormone (GH) regulates cell growth and differentiation primarily by modulating gene expression and metabolism in target tissues. Targeted disruption of the gene encoding the growth hormone receptor and binding protein (GHR/BP-/-) functionally inactivates GH and generates long-lived, dwarf mice with elevated circulating GH and markedly reduced insulin-like growth factor-1 (IGF-1) levels (1, 2). Indeed, insulin/IGF-1signaling has been shown to be a critical determinant of lifespan in several species. GHR/BP-/- mice also have decreased fasting insulin and glucose levels (3) and appear to resist complications due to streptozotocin (STZ)-induced diabetes (4). To determine the consequences of the GHR/BP-/- mutation on gene expression in the central nervous system (CNS), brain tissue was harvested from normal and gene-disrupted mice at different developmental stages (young, adult and aged) and proteins were isolated from distinct subcellular fractions (nucleus, cytoplasm, polysomes) using differential gradient ultracentrifugation. The proteins in each fraction were resolved by two-dimensional gel electrophoresis and stained with the fluorescent dye SYPRO Orange. The images were captured with a high-resolution CCD camera (Bio-Rad Versa-Doc 3000) or a laser-scanning device (Fuji FLA-3000G) and quantitatively analyzed with PDQuest or Image Gauge software packages. Differentially expressed proteins were manually excised from the gels and identified by mass spectrometry. Of the hundreds of proteins resolved, several were differentially expressed in the brains of GHR/BP-/- mice relative to controls. The goal is to identify those proteins whose expression patterns are spatially and temporally correlated and establish functional protein networks that may delay or attenuate age-related tissue dysfunction or diabetic complications. This work was supported in part by the State of Ohio's Eminent Scholar Program which includes a gift from Milton and Lawrence Goll.



  1. Zhou et al., (1997) Proc Natl Acad Sci USA 94:13215-13220

  2. Coschigano et al., (2003) Endocrinology 144:3799-3810

  3. Coschigano et al., (2001) Endocrinology 141:2608-2613

  4. Bellush et al., (2000) Endocrinology 141:163-168

SPIDER: Software for Protein Identification from Sequence Tags Containing De Novo Sequencing Error

For the protein identification of novel proteins using MS/MS, searching the sequence tags obtained by de novo sequencing in a protein sequence database is the best way. However, de novo sequencing very often can give only partially correct sequence tags. The most commonly type of error found in the sequence tags is the same-mass segments replacement, i.e. a segment of amino acids is replaced with another one with the same mass. The current database search software such as MS-BLAST cannot handle the errors existing in the sequence tags. We developed a new efficient algorithm to align sequence tags from de novo sequencing with database sequences to identify proteins. This talk introduces the algorithms and implementation details of SPIDER software.

Analysis and Statistical Validation of Shotgun Proteomics Datasets

Shotgun proteomics approach has been used increasingly for high throughput analysis of complex protein samples. A major challenge lies in the consistent, objective and transparent analysis of the large amounts of data generated by such experiments and in their dissemination and publication. The first part of this presentation will focus on various statistical measures and approaches for estimating the confidence level of peptide identifications made by MS/MS database searching, including p-values, expectation values, reverse database searching, and the Bayesian classification. A comparison will be made with methods developed for the analysis of other types of data such as microarray gene expression.


Identification of peptides from MS/MS spectra represents the first step in the computational analysis of shotgun proteomics data. Most often, the goal of the experiment is to infer what proteins are present in the original sample. A statistical model for assembling peptides into proteins and computing protein probabilities will be presented. A special attention will be paid to the problem of non-random grouping of peptides according to their corresponding proteins ('single hit' identification problem). Furthermore, limitations of shotgun proteomics with respect to the accurate characterization of protein isoforms and mature protein forms will be discussed. Similar to the shotgun DNA fragment sequence assembly problem, the presence of 'degenerate' peptides (peptides whose sequence is present in multiple proteins) makes it difficult to infer what proteins are present in the sample. An informatics approach for dealing with the cases of degenerate peptides and presenting protein identification results to the biologists analyzing the data will be described.

Proteomics in Drug Discovery and Development: Computational Opportunities Abound

Parallel protein measurements, aka proteomics, have the potential to provide information on biological systems in isolation as cell culture systems, tissues or in an organism. Whereas parallel measures of transcript (mRNA) abundance can be multiplexed more easily through microarray analysis of even small quantities of sample following amplification using PCR, parallel measures of protein abundance are more difficult due to the heterogeneity of protein properties compared with nucleic acids, and the inability to amplify the signal. However, despite these drawbacks much useful data can be generated, but the interpretation of such data sets is challenging. This presentation will focus on the kinds of datasets that are generated from a range of proteomics approaches including mass spectrometry (both LC-MS and MS/MS data) for unbiased analyses and multiplexed protein assays for targeted analyses. How and where such assays are employed and the pros and cons of such approaches will also be discussed.

Signal Processing and Data Reduction for Differential Proteomics with HPLC/MS

In the talk we describe methods to reduce the amount of data obtained by (multi)-dimensional HPLC/MS experiments. The algorithms are embedded in the freely available software library OpenMS which is currently under development at the Freie Universitat Berlin and the Eberhardt-Karls Universitat Tubingen. We give an overview of the goals and problems in differential proteomics with HPLC and then describe in detail the implemented approaches for signal processing, peak detection and data reduction currently employed in OpenMS.

A Two-Dimensional Probability Model for Peptide Identification Using Tandem Mass Spectrometry and Protein Databases

The presentation focuses on a two-dimensional probability model for peptide identification using tandem mass spectra and amino acid sequence databases. Probability models are developed for two of the parameters that affect the quality of peptide identification the most - number of product ion matches and the sum of the product ion abundances. Both models are derived from the direct comparison of experimental tandem mass spectrum to amino acid sequences from the protein database. The probabilities obtained from each model are correlated and normalized to derive a single score - significance of peptide identification.


The talk will discuss the comparison of the approach to other database search algorithms.

Deriving Better Specificity Models for Trypsin to Improve Protein Identification by Tandem MS

Deriving Better Specificity Models for Trypsin to Improve Protein Identification by Tandem MS

Looking at the Whole Instead of the Parts: Combining multidimensional LC-MS data at the signal level

The analysis of complex protein mixtures by LC-MS is one of the key technologies for systematic large-scale observation and modeling of cellular processes. Mass spectrometry itself is exquisitely sensitive, and reproducibility at the signal level is high. It is known, however, that a significant number of peptides - especially those with modifications - are missed in current computational analyses. We present a new approach to the computational interpretation of the experimental data that globally integrates all data of one or more experiments instead of interpreting spectra one-by-one. Instead of attempting to detect the presence of a protein or its fragments from individual signals (peaks) in a single mass spectrograph, all data acquired across a whole experiment are first aligned into an n+1-dimensional space, where n is the number of dimensions used for the LC separation. This condenses all peaks generated by the same protein fragment throughout the experiment into a single dense signal, which allows a much better separation of signal and noise. The main computational challenge is to compensate for fluctuations in the separation process. We will present algorithms for the implementation of this approach, and demonstrate its usefulness using case studies in one and two dimensions. This is joint work with Amol Prakash, and the research groups of Ruedi Aebersold and Amanda Pavlovitch in Seattle.

Improving Sensitivity by Combining Results from Multiple Search Methodologies

Database-searching programs generally identify only a fraction of the spectra acquired in a standard LC/MS/MS study of digested proteins. By using a mass-based alignment algorithm of de novo sequencing results, OpenSea can sometimes perform better than this because it can also identify modified peptides. However, OpenSea is dependent on de novo sequencing algorithms that usually cannot derive accurate sequences from low quality MS/MS spectra. Conveniently, many database-searching programs are well suited for matching peptide sequences to low quality data. To leverage this dichotomy, we have developed an algorithm to probabilistically combine the results of multiple search engines, including SEQUEST, Mascot, X!Tandem, and OpenSea. We have found that we normally gain 5% to 20% more MS/MS spectrum identifications with each additional search engine we use, primarily due to increased confidence in low scoring matches. In addition, we use ranked-based clustering to mine information from the remaining spectra. First, we remove redundant results by clustering unmatched spectra to other spectra identified by the database-searching programs. Then we identify potentially interesting unmatched spectra by looking for spectral duplication and using high quality filters. These results are singled out for further modification discovery analysis or manual interpretation.

High-Accuracy Mass Spectrometry for Composition Based de Novo Sequencing (CBS) of Unknown Peptides

With the easy availability of ultraprecise mass spectrometrical data, the accurate mass of biomolecules is becoming a physical quantity of high interest in bioanalytical methodology. MALDI/ESI FT-ICR mass spectrometry, especially when combined with convenient and versatile ion manipulating devices such as quadrupolar ion traps, now allows to easily determine the amino acid composition of medium size unknown peptides when employing combinatorial calculations of parent and fragment ion masses. This new method, which in a second step allows to reliably sequence completely unknown peptides ("Composition-Based Sequencing (CBS)" [1]) appears to open a wide new field of bioanalytical investigation in proteomics.


CBS appears to have some fundamental advantages over common de novo sequencing strategies, since it does not require preknowledge of underlying fragmentation mechanisms or of peptide specific ionization and fragmentation behavior. While classical strategies usually try to verify the presence of expected fragment ion signals, CBS instead interprets the observed accurate mass values of precursor and fragment ions with respect to possible amino acid combinations by means of combinatorial logic. The potential and limitations of the method will be discussed in the light of the expected evolution of high accuracy mass spectrometry in the coming years.


[1] B. Spengler, De Novo Sequencing, Peptide Composition Analysis and Composition-based Sequencing: A new Strategy Employing Accurate Mass Determination by Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. J Am Soc Mass Spectrom, 15 (2004) 704-715.

TBA

Enormous amount of biological data have been accumulated over the years such as sequences, gene expressions, protein physical interactions, genetic interactions, protein complexes, protein localizations, etc. For given biological problems of interest, most data contribute some, but not all the information for the problems. By combining different problems intelligently, we are able to obtain a more complete picture of the problems of interest.


We present two examples of data integration. One is the estimation of reliability of observed protein interaction data sets using gene expressions and protein localizations. The integration of the two data sources can give a more accurate estimation of the reliability. The other example is protein function prediction combining protein interactions, complexes, and features of individual proteins based on a Markov Random Field model. We further study the relationship between gene lethality, protein interaction networks, and protein function annotation.


Joint work with Ting Chen.

Analysis of Variance Coupled with Principal Component Analysis for the Characterization of Potential Biomarkers

Many researchers have reported biomarkers from mass spectrometry (MS) experiments. However, in many cases no experimental design was reported and the biomarkers did not correspond to any reliable signals in the data. In many cases artificial intelligence is used after the data are collected rather than using natural intelligence in designing the experimental measurements. This paper proposes and demonstrates the power of a rational experimental design applied to preliminary experiments directed towards discovery of biomarkers in amniotic fluid.


Combining analysis of variance with principal component analysis (ANOVA/PCA) provides a powerful tool for the discovery of biomarkers in chemical measurements of biological systems. This approach encourages the use of experimental design to separate the variation of the experimental hypothesis from other potentially confounding sources of variation. When the factors of the experiment are greater than the residual error, the variable loadings of the principal components can be interpreted without requiring a supervised rotation and thus avoid the Curse of Dimensionality that occurs with underdetermined data. A series of spectral score plots are obtained for each experimental factor that allows easy interpretation by scientists who may not be proficient in advance mathematical calculations. A conservative statistical test is used been presented to evaluate the significance of the experimental factors. Potential biomarker peaks can be validated through a univariate resolution measure.


Joint work with Nancy E. Vieira, Roberto Romero, and Peter de B. Harrington.