Quality-driven early stopping for explorative cluster analysis for big data
- 12 Downloads
Data analysis has become a critical success factor for companies in all areas. Hence, it is necessary to quickly gain knowledge from available datasets, which is becoming especially challenging in times of big data. Typical data mining tasks like cluster analysis are very time consuming even if they run in highly parallel environments like Spark clusters. To support data scientists in explorative data analysis processes, we need techniques to make data mining tasks even more efficient. To this end, we introduce a novel approach to stop clustering algorithms as early as possible while still achieving an adequate quality of the detected clusters. Our approach exploits the iterative nature of many cluster algorithms and uses a metric to decide after which iteration the mining task should stop. We present experimental results based on a Spark cluster using multiple huge datasets. The experiments unveil that our approach is able to accelerate the clustering up to a factor of more than 800 by obliterating many iterations which provide only little gain in quality. This way, we are able to find a good balance between the time required for data analysis and quality of the analysis results.
KeywordsClustering Big data Early stop Convergence Regression
This research was partially funded by the Ministry of Science of Baden-Württemberg, Germany, for the Doctoral Program ‘Services Computing’. Some work presented in this paper was performed within the project ‘INTERACT’ as part of the Software Campus program. This project is funded by the German Federal Ministry of Education and Research (BMBF), Grant No. 01IS17051. Finally, we thank Dennis Tschechlov for his implementation work.
- 1.Anand SS, Bell DA, Hughes JG (1995) The role of domain knowledge in data mining. In: Proceedings of the fourth international conference on Information and knowledge management-CIKM ’95, pp 37–43. https://doi.org/10.1145/221270.221321
- 2.Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1025. https://doi.org/10.1145/1283383.1283494
- 5.Brachman RJ, Anand T (1994) The process of knowledge discovery in databases: a first sketch. KDD workshop, pp 1–11Google Scholar
- 9.Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the twentieth international conference on machine learning (ICML-2003), pp 147–153. https://doi.org/10.1016/0026-2714(92)90278-S
- 17.Malkomes G, Schaff C, Garnett R (2016) Bayesian optimization for automated model selection. In: International conference on machine learning 2016, AutoML workshop, vol 1, No. Nips, pp 1–7Google Scholar
- 19.Mexicano A, Rodríguez R, Cervantes S, Montes P, Jiménez M, Almanza N, Abrego A (2016) The early stop heuristic: a new convergence criterion for K-means. In: AIP conference proceedings, vol 1738. https://doi.org/10.1063/1.4952103
- 20.Pérez J, Mexicano A, Pazos R, Santaolaya R, Hidalgo M, Moreno A, Almanza N (2013) Improvement to the K-means algorithm through a heuristics based on a Bee Honeycomb structure. J Netw Innov Comput ISSN 1:2160–2174Google Scholar
- 22.Sculley D (2010) Web-scale K-means clustering. In: Proceedings of the 19th international conference on world wide web WWW 10, p 1177. https://doi.org/10.1145/1772690.1772862
- 24.Sparks ER, Talwalkar A, Haas D, Franklin MJ, Jordan MI, Kraska T (2015) Automating model search for large scale machine learning. In: Proceedings of the sixth ACM symposium on cloud computing-SoCC ’15, pp 368–380. https://doi.org/10.1145/2806777.2806945
- 25.Thornton C, Hutter F, Hoos HH, Leyton-Brown K (2013) Auto-WEKA. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining-KDD ’13, p 847. https://doi.org/10.1145/2487575.2487629