Drug discovery is rapidly becoming an infomation-driven science. Harnessing the power of large volumes of data for the rapid optimization of drug-like compounds requires new chemi-informatics approaches that work effectively on a massive scale. This talk will highlight key algorithmic advances that expand, by several orders of magnitude, the number of compounds that can be assessed as potential drugs, and offer a preview of a new informatics platform that is being developed at J&J PRD for the effective delivery and visualization of structure-activity relationships. Particular emphasis will be placed on a novel self-organizing algorithm for extracting the intrinsic structure and dimensionality of large experimental observation spaces, and its application on some challenging problems in computational chemistry and biology.
We have examined the NCI public human tumor xenograft data to explore relationships between treatment modality, efficacy and toxicity. Efficacy endpoints of tumor weight reduction (TW) and survival time increase (ST) compared to tumor bearing control mice were augmented by a toxicity measure, defined as the survival advantage of treated versus control animals (TX). These endpoints were used to define two independent therapeutic indices (TIs) as the ratio of efficacy (TW or ST) to toxicity (TX). Linear models predictive of xenograft endpoints were successfully constructed using a model comprised of variables in treatment modality, chemical property descriptors, and in vitro cell growth inhibition in the NCI 60-cell assay. Cross-validation analysis based on randomly chosen training subsets found these predictive correlations to be robust. Model-based sensitivity analysis found chemistry and growth inhibition to provide the best, and treatment modality, the worst indicators of xenograft endpoint. The poor predictive power derived from treatment alone appears to be of less importance to xenograft outcome for compounds having strongly similar chemical and biological features. ROC-based model validation found a 70% positive predictive value for distinguishing FDA approved oncology agents from available xenograft tested compounds. Additional model-based applications are provided that relate xenograft outcome to biological pathways and putative mechanism of compound action. These results find a strong relationship between xenograft efficacy and pathways comprised of genes having highly correlated mRNA expressions. Our analysis demonstrates that a combination of xenograft studies and in vitro preclinical testing offer an effective means to identify compound classes with superior efficacy and reduced toxicity.
Microarray technology is increasingly being utilized in different ways in genomic investigations of disease. Gene Logic has created a prototypic oncology database using multiple microarray platforms to investigate cross-platform correlations and develop methods for integration of different types of genomic information. A set of infiltrating ductal carcinomas of the breast and patient-matched morphologically normal breast samples has been evaluated by CGH arrays, three types of mRNA gene expression arrays, micro RNA expression arrays, and SNP arrays. The initial step in this investigation was to assess aCGH results on this sample set in the context of known copy number gains and losses in breast cancer. Gene copy number gain at the 17q11-12 region (associated with erb-B2 amplification) was correlated with gene expression patterns as analyzed by three gene expression micro-array platforms. Differences in miRNA expression between normal and cancer confirmed recent reports. High-level grouping of samples based on chromosomal aberration analysis combined with gene expression correlation may be a way to generate a candidate "biologically validated" biomarker gene set. Although multi-platform analysis of the same clinical sample is feasible, before the value for clinical application can be exploited, a new analytic approach must be defined. This approach will likely require novel database structures for linking extremely large amounts of data, as well as innovative algorithms that join data of different types.
Genome-wide transcriptional analysis provides a comprehensive molecular representation of cellular activity, suggesting that mRNA expression profiling could serve as a practical universal functional bioassay. High-throughput high-density gene expression profiling solutions raise the possibility of capturing the consequences of small molecule and genetic perturbations at library and genome scale, respectively, and associating these disparate perturbagens with each other and external organic phenotypes to discover decisive functional connections between drugs, genes and diseases. The talk will describe our technology platform, analysis methods and interpretive tools, and will include examples illustrating how the expression profiles of a large collection of bioactive small molecules can be used to reveal signaling cascades, annotate complex phenotypes, predict adverse drug effects, and identify potential human therapeutics.
The linkage of chemical structures to biological activities is a major current challenge. Although the published literature contains many data-points, they are typically inconsitently reported and inaccessible to large-scale analysis and data-mining. In this talk, I outline the construction of a large-scale chemogenomics database, the issues in data preparation and clean-up, and also some of the challenges in deploying such systems. However, once constructed, this data provides an invaluable resource for the consistent and rapid identification of patterns in both target and compound space.
Similarity searching is one of the most widely used techniques for accessing databases of chemical structures and for supporting the lead-discovery stage of pharmaceutical research. Given a user-defined, bioactive reference structure, a similarity search involves comparing the reference structure with each of the database structures, computing the inter-molecular structural similarity in each case, and then ranking the database in order of the decreasing similarity scores. Many measures of structural similarity have been reported in the literature but by far the most widely used are measures based on the comparison of 2D fingerprints, binary vectors denoting the presence of small substructural fragments in molecules. The Similar property Principle states that molecules that are structurally similar will also exhibit similar properties: thus molecules that are structurally similar to a bioactive reference structure are also likely to exhibit the chosen activity.
Systems for chemical similarity searching have been available for many years, but there is ongoing interest in techniques that could enhance their retrieval effectiveness. The paper summarises recent work in Sheffield to this end (see  for an overview of these studies). A detailed comparison of a large number of similarity coefficients demonstrates that the well-known Tanimoto coefficient remains the method of choice for the computation of fingerprint-based similarity, despite possessing some inherent biases related to the sizes of the molecules that are being sought. Group fusion involves combining the results of similarity searches based on multiple reference structures and a single similarity measure. We demonstrate the effectiveness of this approach to screening, and also describe an approximate form of group fusion, turbo similarity searching, that can be used when just a single reference structure is available.
1. Willett, P. "Similarity-based virtual screening using 2D fingerprints." Drug Discovery Today, 11, 2006, 1046-1053.
Micro array, proteomics, metabolomics, etc., all produce data sets where there are many more predictor variables than observations, np. We present relatively new method for analysis of np data sets.
There are typically correlations among variables; indeed, the many variables/predictors cannot all be independent of one another. The correlations can be utilized to improve the statistical analysis. New inference methods will be presented which combine statistical testing with non-negative matrix factorization. The methods will be demonstrated using micro array and metabolomic data sets. Papers and code for these methods can be found at www.niss.org/irMF.
The lecture will be divided into three parts: Introduction, Robust Singular Value Decomposition, and Non-negative Matrix Factorization. This is joint work with a number of people: Paul Fogel, a statistical consultant living in Paris, France, and Doug Hawkins, a statistician at the University of Minnesota.