Fast approximate spectral clustering is a general framework we proposed recently for the spectral clustering of large datasets. The aim is to preserve the virtue of spectral clustering yet being able to meet the scalability challenge of emerging data mining tasks. It grows out of our work on the perturbation analysis of spectral clustering [1] as well as extensive empirical studies. The main idea is to perform a distortion minimizing local transformation on the data to get a small number of representative points on which the cluster membership of all data points can be derived. There are three steps involved:

Distortion minimizing local transformation on the data to get a small set of representative points which preserve its structure.
Spectral clustering on the set of representative points.
Recovery of the cluster membership for all data points by their correspondence to the representative points.

Since the distortion minimizing local transformation can be computed with fast procedures while the expensive spectral clustering is only performed on the representative points, large speedup can be achieved. Our theory predicts that if the distortion resulted from the data "compression" is small, then algorithms implemented within our framework will incur very little loss in clustering "accuracy". As it turns out, only points near the class boundary may be impacted; but such points carry only a small fraction of the total number of points if the decision boundary is not terribly complicated so the loss in "accuracy" will be very small for methods under our framework.

Currently we have two concrete implementations for distortion minimizing local transformation, one based on K-means clustering and the other on random projection trees. These two are chosen for their favorable empirical computational efficiency and the simplicity of their implementation. Evidently other approaches can be adopted if the resulting distortion can be made small. For more details on the general framework, please refer to [1]. The linear computational complexity of these two implementations makes it highly desirable to apply such a framework to the spectral clustering of big data.

It is worthwhile to note that, our algorithmic framework and the associated perturbation analysis extends far beyond spectral clustering as a general recipe for the fast approximate computing of many computation-intensive algorithms, for instance, principal component analysis, kernel machine-based algorithms, and semi-supervised learning etc.

R implementation Download (UNIX/Linux, Windows)

This implementation is completely self-contained except a reference to standard R library 'MASS'.

Example datasets (from UC Irvine Machine Learning Repository) Download
You are welcome to send questions, comments, suggestions, or to report bugs to us. Thank you!

Citation

[1] D. Yan, L. Huang and M. I. Jordan. Fast approximate spectral clustering. 15th ACM Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, 907-916, 2009. [Long version].

[2] L. Huang, D. Yan, M. I. Jordan and N. Taft. Spectral clustering with perturbed data. NIPS 2008 (Spotlight), 705-712.

Slides [kdd2009 ] Video [kdd2009 ]