A formally based parallelization of data mining algorithms for multi-core systems

Kholod, Ivan; Shorov, Andrey; Titkov, Evgenii; Gorlatch, Sergei

doi:10.1007/s11227-018-2473-8

A formally based parallelization of data mining algorithms for multi-core systems

Published: 07 July 2018

Volume 75, pages 7909–7920, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Ivan Kholod ORCID: orcid.org/0000-0002-7255-5035¹,
Andrey Shorov¹,
Evgenii Titkov¹ &
…
Sergei Gorlatch²

269 Accesses
3 Citations
Explore all metrics

Abstract

We describe a novel, systematic approach to efficiently parallelizing data mining algorithms: starting with the representation of an algorithm as a sequential composition of functions, we formally transform it into a parallel form using higher-order functions for specifying parallelism. We implement the approach as an extension of the industrial-strength Java-based library Xelopes, and we illustrate its use by developing a multi-threaded Java program for the popular naive Bayes classification algorithm. In comparison with the popular MapReduce programming model, our resulting programs enable not only data-parallel, but also task-parallel implementation and a combination of both. Our experiments demonstrate an efficient parallelization and good scalability on multi-core processors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 2

Fig. 4

Big data analytics on Apache Spark

Article 13 October 2016

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Article Open access 19 January 2019

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

References

Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Article Google Scholar
Zaki M (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25
Article Google Scholar
Kadam P, Jadhav S, Kulkarni A, Kulkarni S (2017) Survey of parallel implementations of clustering algorithms. Int J Adv Res Comput Commun Eng 6(10):46–52
Google Scholar
Zaki MJ, Ho C-T, Agrawal R (1999) Parallel classification for data mining on shared-memory multiprocessors. In: ICDE: IEEE International Conference on Data Engineering, pp 198–205
Kholod I, Shorov A, Gorlatch S (2017) Creation of data mining algorithms as functional expression for parallel and distributed execution. In: Malyshkin V (ed) PaCT 2017, LNCS, vol 10421. Springer, Basel, pp 459–472
Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113
Article Google Scholar
Chu C-T et al (2006) Map-reduce for machine learning on multicore. In: Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, pp 281–288
Prudsys Xelopes. https://prudsys.de/en/knowledge/technology/prudsys-xelopes/
Wu X et al (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp 338–345
Bernstein J (1966) Program analysis for parallel processing. IEEE Trans Electron Comput EC–15:757–762
Article Google Scholar
Li Z, Yew P-C, Zhu C-Q (1990) An efficient data dependence analysis for parallelizing compilers. IEEE Trans Parallel Distrib Syst 1:26–34
Article Google Scholar
Allen R, Kennedy K (2002) Optimizing compilers for modern architectures. Morgan Kaufmann, San Francisco
Google Scholar
Kaggle Dataset. https://www.kaggle.com/rajanand/ahs-woman-1
Machine Learning Library (MLlib) Guide. http://spark.apache.org/docs/latest/mllib-guide.html

Download references

Acknowledgements

This work was supported by the Ministry of Education and Science of the Russian Federation in the framework of the state order “Organization of Scientific Research,” task 2.6113.2017/6.7, and by the German Ministry of Education and Research (BMBF) in the framework of the HPC2SE project at the University of Muenster.

Author information

Authors and Affiliations

Saint Petersburg Electrotechnical University “LETI”, Saint Petersburg, Russia
Ivan Kholod, Andrey Shorov & Evgenii Titkov
University of Muenster, Münster, Germany
Sergei Gorlatch

Authors

Ivan Kholod
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Shorov
View author publications
You can also search for this author in PubMed Google Scholar
Evgenii Titkov
View author publications
You can also search for this author in PubMed Google Scholar
Sergei Gorlatch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Kholod.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kholod, I., Shorov, A., Titkov, E. et al. A formally based parallelization of data mining algorithms for multi-core systems. J Supercomput 75, 7909–7920 (2019). https://doi.org/10.1007/s11227-018-2473-8

Download citation

Published: 07 July 2018
Issue Date: December 2019
DOI: https://doi.org/10.1007/s11227-018-2473-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A formally based parallelization of data mining algorithms for multi-core systems

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Parallelizing the dual revised simplex method

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A formally based parallelization of data mining algorithms for multi-core systems

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Parallelizing the dual revised simplex method

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation