Learning from Past Observations: Meta-Learning for Efficient Clustering Analyses

Fritz, Manuel; Tschechlov, Dennis; Schwarz, Holger

doi:10.1007/978-3-030-59065-9_28

Learning from Past Observations: Meta-Learning for Efficient Clustering Analyses

Manuel Fritz¹³,
Dennis Tschechlov¹³ &
Holger Schwarz¹³

Conference paper
First Online: 11 September 2020

1115 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12393))

Abstract

Many clustering algorithms require the number of clusters as input parameter prior to execution. Since the “best” number of clusters is most often unknown in advance, analysts typically execute clustering algorithms multiple times with varying parameters and subsequently choose the most promising result. Several methods for an automated estimation of suitable parameters have been proposed. Similar to the procedure of an analyst, these estimation methods draw on repetitive executions of a clustering algorithm with varying parameters. However, when working with voluminous datasets, each single execution tends to be very time-consuming. Especially in today’s Big Data era, such a repetitive execution of a clustering algorithm is not feasible for an efficient exploration. We propose a novel and efficient approach to accelerate estimations for the number of clusters in datasets. Our approach relies on the idea of meta-learning and terminates each execution of the clustering algorithm as soon as an expected qualitative demand is met. We show that this new approach is generally applicable, i.e., it can be used with existing estimation methods. Our comprehensive evaluation reveals that our approach is able to speed up the estimation of the number of clusters by an order of magnitude, while still achieving accurate estimates.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://archive.ics.uci.edu/ml/datasets.html.

References

Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Article MathSciNet Google Scholar
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable K-Means++. Proc. VLDB Endow. 5(7), 622–633 (2012)
Article Google Scholar
Brazdil, P., Carrier, C.G., Soares, C., Vilalta, R.: Metalearning: Applications to Data Mining. Springer Science & Business Media, Berlin (2008)
Google Scholar
Caliñski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)
MathSciNet MATH Google Scholar
Coggins, J.M., Jain, A.K.: A spatial filtering approach to texture analysis. Pattern Recogn. Lett. 3(3), 195–203 (1985)
Article Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979)
Google Scholar
De Souto, M.C.P., Prudêncio, R.B.C., Soares, R.G.F., De Araujo, D.S.A., Costa, I.G., Ludermir, T.B., Schliep, A.: Ranking and selecting clustering algorithms using a meta-learning approach. In: Proceedings of the International Joint Conference on Neural Networks, pp. 3729–3735 (2008)
Google Scholar
Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)
Article MathSciNet Google Scholar
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 147–153 (2003)
Google Scholar
Ferrari, D.G., de Castro, L.N.: Clustering Algorithm Recommendation: A Meta-learning Approach. In: Panigrahi, B.K., Das, S., Suganthan, P.N., Nanda, P.K. (eds.) SEMCCO 2012. LNCS, vol. 7677, pp. 143–150. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35380-2_18
Chapter Google Scholar
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems (2015)
Google Scholar
Fritz, M., Albrecht, S., Ziekow, H., Strüker, J.: Benchmarking big data technologies for energy procurement efficiency. In: Proceedings of the 23rd America’s Conference on Information Systems (AMCIS 2017) (2017)
Google Scholar
Fritz, M., Behringer, M., Schwarz, H.: Quality-driven early stopping for explorative cluster analysis for big data. SICS Softw.-Intensive Cyber-Phys. Syst. 34, 1–12 (2019). https://doi.org/10.1007/s00450-019-00401-0
Fritz, M., Muazzen, O., Behringer, M., Schwarz, H.: ASAP-DM: A framework for automatic selection of analytic platforms for data mining. Softw.-Intensive Cyber-Phys. Syst. 35, 1–13 (2019)
Google Scholar
Fritz, M., Schwarz, H.: Initializing k-Means Efficiently: Benefits for Exploratory Cluster Analysis. In: Panetto, H., Debruyne, C., Hepp, M., Lewis, D., Ardagna, C.A., Meersman, R. (eds.) OTM 2019. LNCS, vol. 11877, pp. 146–163. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33246-4_9
Chapter Google Scholar
Giraud-Carrier, C., Vilalta, R., Brazdil, P.: Introduction to the special issue on meta-learning. Mach. Learn. 54(3), 187–193 (2004)
Article Google Scholar
Hamerly, G., Elkan, C.: Learning the k in kmeans. Adv. Neural Inf. Process. Syst. (NIPS) 17, 1–8 (2004)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., Wu, A.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
Article Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet Google Scholar
Macqueen, J.B.: Some methods for classification and analysis of multivariate observations. Proc. Fifth Berkeley Symp. Math. Stat. Prob. 1, 281–297 (1967)
Google Scholar
Mexicano, A., Rodríguez, R., Cervantes, S., Montes, P., Jiménez, M., Almanza, N., Abrego, A.: The early stop heuristic: A new convergence criterion for K-means. In: AIP Conference Proceedings, vol. 1738 (2016)
Google Scholar
Nascimento, A.C.A., Prudêncio, R.B.C., de Souto, M.C.P., Costa, I.G.: Mining rules for the automatic selection process of clustering methods applied to cancer gene expression data. In: Alippi, C., Polycarpou, M., Panayiotou, C., Ellinas, G. (eds.) ICANN 2009. LNCS, vol. 5769, pp. 20–29. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04277-5_3
Chapter Google Scholar
Pelleg, D., Moore, A.: X-means: Extending K-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp. 727–734 (2000)
Google Scholar
Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(C), 53–65 (1987)
Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Article MathSciNet Google Scholar
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 4, 2951–2959 (2012)
Google Scholar
Soares, R.G.F., Ludermir, T.B., De Carvalho, F.A.T.: An analysis of meta-learning techniques for ranking clustering algorithms applied to artificial data. In: Alippi, C., Polycarpou, M., Panayiotou, C., Ellinas, G. (eds.) ICANN 2009. LNCS, vol. 5768, pp. 131–140. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04274-4_14
Chapter Google Scholar
Sugar, C.A., James, G.M.: Finding the number of clusters in a dataset: An information-theoretic approach. J. Am. Stat. Assoc. 98(463), 750–763 (2003)
Article MathSciNet Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 63(2), 411–423 (2001)
Google Scholar
Tukey, J.W.: Exploratory Data Analysis. Pearson Addison Wesley, Reading (1977)
Google Scholar
Vilalta, R., Drissi, Y.: A perspective view and survey of meta-learning. Artif. Intell. Rev. 18(2), 77–95 (2002)
Article Google Scholar
Wu, X., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Article Google Scholar

Download references

Acknowledgements

This research was partially funded by the Ministry of Science of Baden-Württemberg, Germany, for the Doctoral Program ‘Services Computing’. Some work presented in this paper was performed in the project ‘INTERACT’ as part of the Software Campus program, which is funded by the German Federal Ministry of Education and Research (BMBF) under Grant No.: 01IS17051.

Author information

Authors and Affiliations

University of Stuttgart, Universitätsstraße 38, 70569, Stuttgart, Germany
Manuel Fritz, Dennis Tschechlov & Holger Schwarz

Authors

Manuel Fritz
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Tschechlov
View author publications
You can also search for this author in PubMed Google Scholar
Holger Schwarz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel Fritz .

Editor information

Editors and Affiliations

Department of Library and Information, Yonsei University, Seoul, Korea (Republic of)
Min Song
Drexel University, Philadelphia, PA, USA
Il-Yeol Song
Johannes Kepler University of Linz, Linz, Austria
Gabriele Kotsis
Software Competence Center Hagenberg (Au), Vienna, Wien, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Oberösterreich, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fritz, M., Tschechlov, D., Schwarz, H. (2020). Learning from Past Observations: Meta-Learning for Efficient Clustering Analyses. In: Song, M., Song, IY., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2020. Lecture Notes in Computer Science(), vol 12393. Springer, Cham. https://doi.org/10.1007/978-3-030-59065-9_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-59065-9_28
Published: 11 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59064-2
Online ISBN: 978-3-030-59065-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics