Cluster Forests

Cluster Forests (CF) is a cluster ensemble algorithm that is inspired by Random Forests (RF). CF can be conveniently viewed as the clustering version of RF. Of crucial importance is the view of clustering as optimization over some cluster quality measure; this view effectively unifies the two important problems in machine learning and pattern recognition -- clustering and classification, thus methodologies that have been under heavy development for classification during the last several decades can be readily applied to clustering. A concrete implementation of this idea is feature selection under the kappa measure which is the ratio of the between-cluster sum of squares and the within-cluster sum of squares.





Geometrically, CF randomly probes a high-dimensional data cloud to obtain “good local clusterings” and then aggregates via spectral clustering to obtain cluster assignments for the whole data. The search for a good local clustering is guided by a cluster quality measure kappa. CF progressively improves each local clustering in a fashion that resembles the tree growth in RF or projection pursuit in the context of regression.
Nice features about CF:
Empirical studies on several real-world datasets under two different performance metrics show that CF compares favorably to its competitors.




Citation

[1] D. Yan, A. Chen and M. I. Jordan. Cluster Forests. Computational Statistics and Data Analysis, Vol 66, 178-192, 2013.  arXiv:1104.2930.
[2] D. Yan, A. Chen and M. I. Jordan. On the Bayes consistency of Cluster Forests. 2018 (submitted).