Skip to main content
Log in

A formally based parallelization of data mining algorithms for multi-core systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

We describe a novel, systematic approach to efficiently parallelizing data mining algorithms: starting with the representation of an algorithm as a sequential composition of functions, we formally transform it into a parallel form using higher-order functions for specifying parallelism. We implement the approach as an extension of the industrial-strength Java-based library Xelopes, and we illustrate its use by developing a multi-threaded Java program for the popular naive Bayes classification algorithm. In comparison with the popular MapReduce programming model, our resulting programs enable not only data-parallel, but also task-parallel implementation and a combination of both. Our experiments demonstrate an efficient parallelization and good scalability on multi-core processors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107

    Article  Google Scholar 

  2. Zaki M (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25

    Article  Google Scholar 

  3. Kadam P, Jadhav S, Kulkarni A, Kulkarni S (2017) Survey of parallel implementations of clustering algorithms. Int J Adv Res Comput Commun Eng 6(10):46–52

    Google Scholar 

  4. Zaki MJ, Ho C-T, Agrawal R (1999) Parallel classification for data mining on shared-memory multiprocessors. In: ICDE: IEEE International Conference on Data Engineering, pp 198–205

  5. Kholod I, Shorov A, Gorlatch S (2017) Creation of data mining algorithms as functional expression for parallel and distributed execution. In: Malyshkin V (ed) PaCT 2017, LNCS, vol 10421. Springer, Basel, pp 459–472

    Google Scholar 

  6. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113

    Article  Google Scholar 

  7. Chu C-T et al (2006) Map-reduce for machine learning on multicore. In: Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, pp 281–288

  8. Prudsys Xelopes. https://prudsys.de/en/knowledge/technology/prudsys-xelopes/

  9. Wu X et al (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  10. John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp 338–345

  11. Bernstein J (1966) Program analysis for parallel processing. IEEE Trans Electron Comput EC–15:757–762

    Article  Google Scholar 

  12. Li Z, Yew P-C, Zhu C-Q (1990) An efficient data dependence analysis for parallelizing compilers. IEEE Trans Parallel Distrib Syst 1:26–34

    Article  Google Scholar 

  13. Allen R, Kennedy K (2002) Optimizing compilers for modern architectures. Morgan Kaufmann, San Francisco

    Google Scholar 

  14. Kaggle Dataset. https://www.kaggle.com/rajanand/ahs-woman-1

  15. Machine Learning Library (MLlib) Guide. http://spark.apache.org/docs/latest/mllib-guide.html

Download references

Acknowledgements

This work was supported by the Ministry of Education and Science of the Russian Federation in the framework of the state order “Organization of Scientific Research,” task 2.6113.2017/6.7, and by the German Ministry of Education and Research (BMBF) in the framework of the HPC2SE project at the University of Muenster.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivan Kholod.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kholod, I., Shorov, A., Titkov, E. et al. A formally based parallelization of data mining algorithms for multi-core systems. J Supercomput 75, 7909–7920 (2019). https://doi.org/10.1007/s11227-018-2473-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2473-8

Keywords

Navigation