Quality-driven early stopping for explorative cluster analysis for big data

Fritz, Manuel; Behringer, Michael; Schwarz, Holger

doi:10.1007/s00450-019-00401-0

Quality-driven early stopping for explorative cluster analysis for big data

Special Issue Paper
Published: 06 February 2019

Volume 34, pages 129–140, (2019)
Cite this article

SICS Software-Intensive Cyber-Physical Systems

Manuel Fritz¹,
Michael Behringer¹ &
Holger Schwarz¹

376 Accesses
4 Citations
Explore all metrics

Abstract

Data analysis has become a critical success factor for companies in all areas. Hence, it is necessary to quickly gain knowledge from available datasets, which is becoming especially challenging in times of big data. Typical data mining tasks like cluster analysis are very time consuming even if they run in highly parallel environments like Spark clusters. To support data scientists in explorative data analysis processes, we need techniques to make data mining tasks even more efficient. To this end, we introduce a novel approach to stop clustering algorithms as early as possible while still achieving an adequate quality of the detected clusters. Our approach exploits the iterative nature of many cluster algorithms and uses a metric to decide after which iteration the mining task should stop. We present experimental results based on a Spark cluster using multiple huge datasets. The experiments unveil that our approach is able to accelerate the clustering up to a factor of more than 800 by obliterating many iterations which provide only little gain in quality. This way, we are able to find a good balance between the time required for data analysis and quality of the analysis results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Overview of Scalable Partitional Methods for Big Data Clustering

HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data

An approach to validity indices for clustering techniques in Big Data

Article 05 October 2017

Notes

References

Anand SS, Bell DA, Hughes JG (1995) The role of domain knowledge in data mining. In: Proceedings of the fourth international conference on Information and knowledge management-CIKM ’95, pp 37–43. https://doi.org/10.1145/221270.221321
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1025. https://doi.org/10.1145/1283383.1283494
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable K-means++. Proc VLDB Endow 5(7):622–633. https://doi.org/10.14778/2180912.2180915
Article Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305. https://doi.org/10.1162/153244303322533223
MathSciNet MATH Google Scholar
Brachman RJ, Anand T (1994) The process of knowledge discovery in databases: a first sketch. KDD workshop, pp 1–11
Coggins JM, Jain AK (1985) A spatial filtering approach to texture analysis. Pattern Recogn Lett 3(3):195–203. https://doi.org/10.1016/0167-8655(85)90053-4
Article Google Scholar
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI–1(2):224–227. https://doi.org/10.1109/TPAMI.1979.4766909
Article Google Scholar
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104. https://doi.org/10.1080/01969727408546059
Article MathSciNet MATH Google Scholar
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the twentieth international conference on machine learning (ICML-2003), pp 147–153. https://doi.org/10.1016/0026-2714(92)90278-S
Hochbaum DS, Shmoys DB (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2):180–184. https://doi.org/10.1287/moor.10.2.180
Article MathSciNet MATH Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River. https://doi.org/10.2307/1268876
MATH Google Scholar
Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892. https://doi.org/10.1109/TPAMI.2002.1017616
Article MATH Google Scholar
Kopanas I, Avouris NM, Daskalaki S (2002) The role of domain knowledge in a large scale data mining project. Methods Appl Artif Intell 2308(June 2002):288–299. https://doi.org/10.1007/3-540-46014-4_26
MATH Google Scholar
Kotthoff L, Thornton C, Hoos HH, Hutter F, Leyton-Brown K (2016) Auto-WEKA 2.0: automatic model selection and hyperparameter optimization in WEKA. J Mach Learn Res 17:1–5
MATH Google Scholar
Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Macqueen JB (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab 1:281–297
MathSciNet MATH Google Scholar
Malkomes G, Schaff C, Garnett R (2016) Bayesian optimization for automated model selection. In: International conference on machine learning 2016, AutoML workshop, vol 1, No. Nips, pp 1–7
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) MLlib: machine learning in Apache spark. J Mach Learn Res 17:1–7
MathSciNet MATH Google Scholar
Mexicano A, Rodríguez R, Cervantes S, Montes P, Jiménez M, Almanza N, Abrego A (2016) The early stop heuristic: a new convergence criterion for K-means. In: AIP conference proceedings, vol 1738. https://doi.org/10.1063/1.4952103
Pérez J, Mexicano A, Pazos R, Santaolaya R, Hidalgo M, Moreno A, Almanza N (2013) Improvement to the K-means algorithm through a heuristics based on a Bee Honeycomb structure. J Netw Innov Comput ISSN 1:2160–2174
Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(C):53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Sculley D (2010) Web-scale K-means clustering. In: Proceedings of the 19th international conference on world wide web WWW 10, p 1177. https://doi.org/10.1145/1772690.1772862
Selim SZ, Ismail MA (1984) K-means type algorithms: a generalized concergence theorem and characterization of local optimality. IEEE Tran Pattern Anal Mach Intell PAMI 6(1):81–87
Article MATH Google Scholar
Sparks ER, Talwalkar A, Haas D, Franklin MJ, Jordan MI, Kraska T (2015) Automating model search for large scale machine learning. In: Proceedings of the sixth ACM symposium on cloud computing-SoCC ’15, pp 368–380. https://doi.org/10.1145/2806777.2806945
Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-WEKA. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining-KDD ’13, p 847. https://doi.org/10.1145/2487575.2487629
Vendramin L, Campello RJ, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min 3(4):209–235. https://doi.org/10.1002/sam.10080
MathSciNet Google Scholar

Download references

Acknowledgements

This research was partially funded by the Ministry of Science of Baden-Württemberg, Germany, for the Doctoral Program ‘Services Computing’. Some work presented in this paper was performed within the project ‘INTERACT’ as part of the Software Campus program. This project is funded by the German Federal Ministry of Education and Research (BMBF), Grant No. 01IS17051. Finally, we thank Dennis Tschechlov for his implementation work.

Author information

Authors and Affiliations

Institute for Parallel and Distributed Systems, University of Stuttgart, Universitätsstr. 38, 70569, Stuttgart, Germany
Manuel Fritz, Michael Behringer & Holger Schwarz

Authors

Manuel Fritz
View author publications
You can also search for this author in PubMed Google Scholar
Michael Behringer
View author publications
You can also search for this author in PubMed Google Scholar
Holger Schwarz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuel Fritz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fritz, M., Behringer, M. & Schwarz, H. Quality-driven early stopping for explorative cluster analysis for big data. SICS Softw.-Inensiv. Cyber-Phys. Syst. 34, 129–140 (2019). https://doi.org/10.1007/s00450-019-00401-0

Download citation

Published: 06 February 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s00450-019-00401-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quality-driven early stopping for explorative cluster analysis for big data

Abstract

Access this article

Similar content being viewed by others

Overview of Scalable Partitional Methods for Big Data Clustering

HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data

An approach to validity indices for clustering techniques in Big Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Quality-driven early stopping for explorative cluster analysis for big data

Abstract

Access this article

Similar content being viewed by others

Overview of Scalable Partitional Methods for Big Data Clustering

HDBSCAN: Evaluating the Performance of Hierarchical Clustering for Big Data

An approach to validity indices for clustering techniques in Big Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation