Abstract
We describe a novel, systematic approach to efficiently parallelizing data mining algorithms: starting with the representation of an algorithm as a sequential composition of functions, we formally transform it into a parallel form using higher-order functions for specifying parallelism. We implement the approach as an extension of the industrial-strength Java-based library Xelopes, and we illustrate its use by developing a multi-threaded Java program for the popular naive Bayes classification algorithm. In comparison with the popular MapReduce programming model, our resulting programs enable not only data-parallel, but also task-parallel implementation and a combination of both. Our experiments demonstrate an efficient parallelization and good scalability on multi-core processors.
Similar content being viewed by others
References
Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Zaki M (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25
Kadam P, Jadhav S, Kulkarni A, Kulkarni S (2017) Survey of parallel implementations of clustering algorithms. Int J Adv Res Comput Commun Eng 6(10):46–52
Zaki MJ, Ho C-T, Agrawal R (1999) Parallel classification for data mining on shared-memory multiprocessors. In: ICDE: IEEE International Conference on Data Engineering, pp 198–205
Kholod I, Shorov A, Gorlatch S (2017) Creation of data mining algorithms as functional expression for parallel and distributed execution. In: Malyshkin V (ed) PaCT 2017, LNCS, vol 10421. Springer, Basel, pp 459–472
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113
Chu C-T et al (2006) Map-reduce for machine learning on multicore. In: Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, pp 281–288
Prudsys Xelopes. https://prudsys.de/en/knowledge/technology/prudsys-xelopes/
Wu X et al (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp 338–345
Bernstein J (1966) Program analysis for parallel processing. IEEE Trans Electron Comput EC–15:757–762
Li Z, Yew P-C, Zhu C-Q (1990) An efficient data dependence analysis for parallelizing compilers. IEEE Trans Parallel Distrib Syst 1:26–34
Allen R, Kennedy K (2002) Optimizing compilers for modern architectures. Morgan Kaufmann, San Francisco
Kaggle Dataset. https://www.kaggle.com/rajanand/ahs-woman-1
Machine Learning Library (MLlib) Guide. http://spark.apache.org/docs/latest/mllib-guide.html
Acknowledgements
This work was supported by the Ministry of Education and Science of the Russian Federation in the framework of the state order “Organization of Scientific Research,” task 2.6113.2017/6.7, and by the German Ministry of Education and Research (BMBF) in the framework of the HPC2SE project at the University of Muenster.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kholod, I., Shorov, A., Titkov, E. et al. A formally based parallelization of data mining algorithms for multi-core systems. J Supercomput 75, 7909–7920 (2019). https://doi.org/10.1007/s11227-018-2473-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2473-8