The "top ten" are just the tail: A correlation method to uncover the rest of the elephant in high throughput biology data

Sarah Wheelan
Oncology Biostatistics and Bioinformatics, Johns Hopkins University

(February 27, 2012 2:30 PM - 3:30 PM)

The "top ten" are just the tail: A correlation method to uncover the rest of the elephant in high throughput biology data

Abstract

Sequencing data are often obtained by biologists wishing to explore details of a system with which they are extremely familiar; however, analysis techniques exclude these experts and often rely on assumptions that may not be relevant to the experimental design. While biologists can manually explore their data using newer, high-capacity genome browsers, and can often suggest relevant hypotheses for statistical testing, fully informed and thorough data exploration is impossible to do by eye. We have created a biologically-based and statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets and to act as a hypothesis generator (not intended to provide "answers"). The software enables several biologically motivated approaches to these data; in fact, each analytical approach was inspired by our own work. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features), for use as a metric of functional interaction. The software is accessible from the command line, through a Tk interface, and through a Galaxy plugin, and is intended to guide biologists and statisticians more quickly to the significant features of high-dimensional datasets.