Advertisement

Knowledge and Information Systems

, Volume 50, Issue 1, pp 1–26 | Cite as

A highly scalable parallel algorithm for maximally informative k-itemset mining

  • Saber SalahEmail author
  • Reza Akbarinia
  • Florent Masseglia
Regular paper

Abstract

The discovery of informative itemsets is a fundamental building block in data analytics and information retrieval. While the problem has been widely studied, only few solutions scale. This is particularly the case when (1) the data set is massive, calling for large-scale distribution, and/or (2) the length k of the informative itemset to be discovered is high. In this paper, we address the problem of parallel mining of maximally informative k-itemsets (miki) based on joint entropy. We propose PHIKS (Parallel Highly Informative \(\underline{K}\)-ItemSet), a highly scalable, parallel miki mining algorithm. PHIKS renders the mining process of large-scale databases (up to terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of miki having different sizes, which drastically reduces the execution time, the communication cost and the energy consumption, in a distributed computational platform. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large databases.

Keywords

Joint entropy Informative itemsets Massive distribution MapReduce Spark Hadoop Big data 

Notes

Acknowledgments

Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see https://www.grid5000.fr).

References

  1. 1.
    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 487–499Google Scholar
  2. 2.
  3. 3.
    Anand R (2012) Mining of massive datasets. Cambridge University Press, New YorkGoogle Scholar
  4. 4.
    Berberich K, Bedathur S (2013) Computing n-gram statistics in mapreduce. In: Proceedings of the 16th international conference on extending database technology (EDBT), pp 101–112Google Scholar
  5. 5.
    Berry M (2008) Survey of text mining II clustering, classification, and retrieval. Springer, New YorkCrossRefGoogle Scholar
  6. 6.
    Bizer C, Boncz PA, Brodie ML, Erling O (2011) The meaningful use of big data: four perspectives—four challenges. SIGMOD Rec 40(4):56–60CrossRefGoogle Scholar
  7. 7.
    Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. SIGMOD Rec 26(2):265–276. doi: 10.1145/253262.253327 CrossRefGoogle Scholar
  8. 8.
    Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Elect Eng 40(1):16–28CrossRefGoogle Scholar
  9. 9.
    Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New YorkCrossRefzbMATHGoogle Scholar
  10. 10.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  11. 11.
    English Wikipedia Articles (2014) http://dumps.wikimedia.org/enwiki/latest
  12. 12.
    Hastie T (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York. ISBN: 978-0387848570Google Scholar
  13. 13.
    Gray R (2011) Entropy and information theory. Springer, New YorkCrossRefzbMATHGoogle Scholar
  14. 14.
  15. 15.
    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182zbMATHGoogle Scholar
  16. 16.
  17. 17.
    Han J (2012) Data mining: concepts and techniques. Elsevier/Morgan Kaufmann, BostonCrossRefzbMATHGoogle Scholar
  18. 18.
    Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. SIGMOD Rec 29(2):1–12. doi: 10.1145/335191.335372 CrossRefGoogle Scholar
  19. 19.
    Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 350–359Google Scholar
  20. 20.
    Herrera F, Carmona C, González P, del Jesus M (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525. doi: 10.1007/s10115-010-0356-2 CrossRefGoogle Scholar
  21. 21.
    Knobbe AJ, Ho EKY (2006) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 237–244Google Scholar
  22. 22.
    Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of international conference on emerging artificial intelligence applications in computer engineering, pp 3–24Google Scholar
  23. 23.
    Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query recommendation. In Proceedings of the ACM conference on recommender systems (RecSys), pp 107–114Google Scholar
  24. 24.
    Miliaraki I, Berberich K, Gemulla R, Zoupanos S (2013) Mind the gap: Large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD), pp 797–808Google Scholar
  25. 25.
    Moens S, Aksehirli E, Goethals B ( 2013) Frequent itemset mining for big data. In: IEEE international conference on big data, pp 111–118Google Scholar
  26. 26.
    Riondato M, DeBrabant JA, Fonseca R, Upfal E (2012) Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM international conference on information and knowledge management (CIKM), pp 85–94Google Scholar
  27. 27.
    Savasere A, Omiecinski E, Navathe SB ( 1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 432–444Google Scholar
  28. 28.
    Tanbeer S, Ahmed C, Jeong B-S ( 2009) Parallel and distributed frequent pattern mining in large databases. In: 11th IEEE international conference on high performance computing and communications (HPCC), pp 407–414Google Scholar
  29. 29.
    Tatti N (2010) Probably the best itemsets. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, July 25-28, 2010, pp 293–302. doi: 10.1145/1835804.1835843
  30. 30.
    Teng W-G, Chen M-S, Yu PS ( 2003) A regression-based temporal pattern mining scheme for data streams. In: Proceedings of international conference on very large data bases (VLDB), pp 93–104Google Scholar
  31. 31.
    The ClueWeb09 Dataset (2009) http://www.lemurproject.org/clueweb09.php/
  32. 32.
    White T (2012) Hadoop: the definitive guide. O’Reilly, CaliforniaGoogle Scholar
  33. 33.
    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, p 10Google Scholar
  34. 34.
    Zhang C, Masseglia F (2010) Discovering highly informative feature sets from data streams. In: Proceedings of the 21st international conference on database and expert systems applications: part I, DEXA’10, Springer, Berlin, pp 91–104. http://dl.acm.org/citation.cfm?id=1881867.1881877

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  • Saber Salah
    • 1
    Email author
  • Reza Akbarinia
    • 1
  • Florent Masseglia
    • 1
  1. 1.Inria and Lirmm, Zenith TeamUniversity of Montpellier 2Montpellier Cedex 5France

Personalised recommendations