Multi-Sample Data Spectroscopic Clustering of Large Datasets using Nystrom Extension

Tao Shi
Statistics, The Ohio State University

(April 8, 2010 11:30 AM - 12:18 PM)

Multi-Sample Data Spectroscopic Clustering of Large Datasets using Nystrom Extension

Abstract

Spectral clustering algorithms have shown promising results in statistics, bioinformatics, genetic study, machine learning, and other scientific fields. These spectral algorithms cluster observations (of size n) into groups by investigating eigenvectors of an affinity matrix or its corresponding Laplacian matrix, both of which are size of n by n. However, the computation involved in eigen-decompostion of an n by n matrix is expensive or even infeasible when the sample size is large. To overcome the computation hurdle, subsampling techniques, such as Nystrom extension, have been used in approximating eigenvectors of large matrices. In this talk, we discuss the statistical properties of such approximation and their influence on the accuracy of various spectral clustering algorithms. We found that the perturbation of spectrum due to subsampling could lead to large discrepancy among clustering results based on different subsamples. In order to provide accurate and stable clustering results for large datasets, we propose a method to combine multiple sub-samples using data spectroscopic clustering and the Nystrom extension. In addition, we propose a sparse approximation of the eigenvectors to further speed up the computation. Simulation and experiments on real data sets showed that this multi-sample approach is fast and accuracy.