Cluster Forests (CF) is a cluster ensemble algorithm that is inspired by Random Forests (RF). CF can be conveniently viewed as the clustering version of RF. Of crucial importance is the view of clustering as optimization over some cluster quality measure; this view effectively unifies the two important problems in machine learning and pattern recognition -- clustering and classification, thus methodologies that have been under heavy development for classification during the last several decades can be readily applied to clustering. A concrete implementation of this idea is feature selection under the kappa measure which is the ratio of the between-cluster sum of squares and the within-cluster sum of squares.

Geometrically, CF randomly probes a high-dimensional data cloud to obtain “good local clusterings” and then aggregates via spectral clustering to obtain cluster assignments for the whole data. The search for a good local clustering is guided by a cluster quality measure kappa. CF progressively improves each local clustering in a fashion that resembles the tree growth in RF or projection pursuit in the context of regression.

Nice features about CF:

CF favors strong features and is noise-resistant;
The cluster aggregation algorithm used by CF achieves an error rate that vanishes exponentially fast under a stochastic block model;
Under some spherically Gaussian distributional assumption, CF is Bayes consistent.

Empirical studies on several real-world datasets under two different performance metrics show that CF compares favorably to its competitors.

R implementation Download (original version)

Nystromized version (2019/05) with Nystromized spectral cluctering as cluster aggregation engine

Multicore version (2019/09) speedup depending on the number of cores. For example, 2-3x speedup on Mac Book Air with 2 physical cores and 4 logical cores.

Example datasets (from UC Irvine Machine Learning Repository) Download
- Soybean
- SPECT
- Heart
- Wine
- WDBC
- Robot (lp5)
- Madelon
You are welcome to send questions, comments, suggestions, or to report bugs to us. Thank you!

Citation

[1] D. Yan, A. Chen and M. I. Jordan. Cluster Forests. Computational Statistics and Data Analysis, Vol 66, 178-192, 2013. arXiv:1104.2930.

[2] D. Yan, A. Chen and M. I. Jordan. On the Bayes consistency of Cluster Forests. 2018 (submitted).