Intra-feature Random Forest Clustering

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10710)


Clustering algorithms are commonly used to find structure in data without explicitly being told what they are looking for. One key desideratum of a clustering algorithm is that the clusters it identifies given some set of features will generalize well to features that have not been measured. Yeung et al. (2001) introduce a Figure of Merit closely aligned to this desideratum, which they use to evaluate clustering algorithms. Broadly, the Figure of Merit measures the within-cluster variance of features of the data that were not available to the clustering algorithm. Using this metric, Yeung et al. found no clustering algorithms that reliably outperformed k-means on a suite of real world datasets (Yeung et al. 2001). This paper presents a novel clustering algorithm, intra-feature random forest clustering (IRFC), that does outperform k-means on a variety of real world datasets per this metric. IRFC begins by training an ensemble of decision trees of limited depth to predict randomly selected features given the remaining features. It then aggregates the partitions that are implied by these trees, and outputs however many clusters are specified by an input parameter.


Cluster analysis Random forest Unsupervised learning  Ensemble Figure of Merit 


  1. Albaum, S.P., Hahne, H., Otto, A., Haußmann, U., Becher, D., Poetsch, A., Goesmann, A., Nattkemper, T.W.: A guide through the computational analysis of isotope-labeled mass spectrometry-based quantitative proteomics data: an application study. Proteome sci. 9(1), 1 (2011)CrossRefGoogle Scholar
  2. Becker, R.A., Caceres, R., Hanson, K., Loh, J.M., Urbanek, S., Varshavsky, A., Volinsky, C.: A tale of one city: using cellular network data for urban planning. IEEE Pervasive Comput. 10(4), 18–26 (2011)CrossRefGoogle Scholar
  3. Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, vol. 7, pp. 6–17, December 2001Google Scholar
  4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  5. Breiman, L.: Random Forests Manual v4.0. Technical report, UC Berkeley (2003).
  6. Chicco, G., Napoli, R., Piglione, F.: Application of clustering algorithms and self organising maps to classify electricity customers. In: Power Tech Conference Proceedings, 2003 IEEE Bologna, vol. 1, 7 pp. IEEE, June 2003Google Scholar
  7. Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3(7), 1 (2002)CrossRefGoogle Scholar
  8. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96, no. 34, pp. 226–231, August 1996Google Scholar
  9. Harrigan, K.R.: An application of clustering for strategic group analysis. Strateg. Manag. J. 6(1), 55–73 (1985)CrossRefGoogle Scholar
  10. Hilas, C.S., Mastorocostas, P.A.: An application of supervised and unsupervised learning approaches to telecommunications fraud detection. Knowl. Based Syst. 21(7), 721–726 (2008)CrossRefGoogle Scholar
  11. Iliadis, L.S.: A decision support system applying an integrated fuzzy model for long-term forest fire risk estimation. Environ. Model Softw. 20(5), 613–621 (2005)CrossRefGoogle Scholar
  12. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)zbMATHGoogle Scholar
  13. Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44, 23–34 (1988)MathSciNetCrossRefGoogle Scholar
  14. Li, A., Walling, J., Ahn, S., Kotliarov, Y., Su, Q., Quezado, M., Oberholtzer, J.C., Park, J., Zenklusen, J.C., Fine, H.A.: Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 69(5), 2091–2099 (2009)CrossRefGoogle Scholar
  15. Masulli, F., Schenone, A.: A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artif. Intell. Med. 16(2), 129–147 (1999)CrossRefGoogle Scholar
  16. Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52(1–2), 91–118 (2003)CrossRefGoogle Scholar
  17. Park, B.: Hybrid neuro-fuzzy application in short-term freeway traffic volume forecasting. Transp. Res. Rec. J. Transp. Res. Board 1802, 190–196 (2002)CrossRefGoogle Scholar
  18. Pavlidis, N.G., Tasoulis, D.K., Vrahatis, M.N.: Financial forecasting through unsupervised clustering and evolutionary trained neural networks. In: The 2003 Congress on Evolutionary Computation (CEC 2003), vol. 4, pp. 2314–2321. IEEE, December 2003Google Scholar
  19. Pham, D.T.: Applications of unsupervised clustering algorithms to aircraft identification using high range resolution radar. In: Proceedings of the IEEE 1998 National Aerospace and Electronics Conference (NAECON 1998), pp. 228–235. IEEE, July 1998Google Scholar
  20. Singh, C., Kim, Y.: An efficient technique for reliability analysis of power systems including time dependent sources. IEEE Trans. Power Syst. 3(3), 1090–1096 (1988)CrossRefGoogle Scholar
  21. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63(2), 411–423 (2001)MathSciNetCrossRefGoogle Scholar
  22. Vega-Pons, S., Ruiz-Shulcloper, J.: A survey of clustering ensemble algorithms. Int. J. Pattern Recognit. Artif. Intell. 25(03), 337–372 (2011)MathSciNetCrossRefGoogle Scholar
  23. Wang, C.H.: Apply robust segmentation to the service industry using kernel induced fuzzy clustering techniques. Expert Syst. Appl. 37(12), 8395–8400 (2010)CrossRefGoogle Scholar
  24. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)CrossRefGoogle Scholar
  25. Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17(4), 309–318 (2001)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.GalvanizeSan FranciscoUSA

Personalised recommendations