A highly scalable parallel algorithm for maximally informative k-itemset mining

Salah, Saber; Akbarinia, Reza; Masseglia, Florent

doi:10.1007/s10115-016-0931-2

A highly scalable parallel algorithm for maximally informative k-itemset mining

Regular paper
Published: 22 March 2016

Volume 50, pages 1–26, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Saber Salah¹,
Reza Akbarinia¹ &
Florent Masseglia¹

463 Accesses
9 Citations
Explore all metrics

Abstract

The discovery of informative itemsets is a fundamental building block in data analytics and information retrieval. While the problem has been widely studied, only few solutions scale. This is particularly the case when (1) the data set is massive, calling for large-scale distribution, and/or (2) the length k of the informative itemset to be discovered is high. In this paper, we address the problem of parallel mining of maximally informative k-itemsets (miki) based on joint entropy. We propose PHIKS (Parallel Highly Informative \(\underline{K}\)-ItemSet), a highly scalable, parallel miki mining algorithm. PHIKS renders the mining process of large-scale databases (up to terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of miki having different sizes, which drastically reduces the execution time, the communication cost and the energy consumption, in a distributed computational platform. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Article 24 March 2017

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

A Parallel Algorithm for Mining High Utility Itemsets

References

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 487–499
Amazon (n.d.) , http://snap.stanford.edu/data/web-Amazon-links.html
Anand R (2012) Mining of massive datasets. Cambridge University Press, New York
Google Scholar
Berberich K, Bedathur S (2013) Computing n-gram statistics in mapreduce. In: Proceedings of the 16th international conference on extending database technology (EDBT), pp 101–112
Berry M (2008) Survey of text mining II clustering, classification, and retrieval. Springer, New York
Book Google Scholar
Bizer C, Boncz PA, Brodie ML, Erling O (2011) The meaningful use of big data: four perspectives—four challenges. SIGMOD Rec 40(4):56–60
Article Google Scholar
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. SIGMOD Rec 26(2):265–276. doi:10.1145/253262.253327
Article Google Scholar
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Elect Eng 40(1):16–28
Article Google Scholar
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York
Book MATH Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
English Wikipedia Articles (2014) http://dumps.wikimedia.org/enwiki/latest
Hastie T (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York. ISBN: 978-0387848570
Gray R (2011) Entropy and information theory. Springer, New York
Book MATH Google Scholar
Grid5000 (n.d.) https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hadoop (2014) http://hadoop.apache.org
Han J (2012) Data mining: concepts and techniques. Elsevier/Morgan Kaufmann, Boston
Book MATH Google Scholar
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. SIGMOD Rec 29(2):1–12. doi:10.1145/335191.335372
Article Google Scholar
Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 350–359
Herrera F, Carmona C, González P, del Jesus M (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525. doi:10.1007/s10115-010-0356-2
Article Google Scholar
Knobbe AJ, Ho EKY (2006) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 237–244
Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of international conference on emerging artificial intelligence applications in computer engineering, pp 3–24
Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) Pfp: parallel fp-growth for query recommendation. In Proceedings of the ACM conference on recommender systems (RecSys), pp 107–114
Miliaraki I, Berberich K, Gemulla R, Zoupanos S (2013) Mind the gap: Large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD), pp 797–808
Moens S, Aksehirli E, Goethals B ( 2013) Frequent itemset mining for big data. In: IEEE international conference on big data, pp 111–118
Riondato M, DeBrabant JA, Fonseca R, Upfal E (2012) Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM international conference on information and knowledge management (CIKM), pp 85–94
Savasere A, Omiecinski E, Navathe SB ( 1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of international conference on very large data bases (VLDB), pp 432–444
Tanbeer S, Ahmed C, Jeong B-S ( 2009) Parallel and distributed frequent pattern mining in large databases. In: 11th IEEE international conference on high performance computing and communications (HPCC), pp 407–414
Tatti N (2010) Probably the best itemsets. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, July 25-28, 2010, pp 293–302. doi:10.1145/1835804.1835843
Teng W-G, Chen M-S, Yu PS ( 2003) A regression-based temporal pattern mining scheme for data streams. In: Proceedings of international conference on very large data bases (VLDB), pp 93–104
The ClueWeb09 Dataset (2009) http://www.lemurproject.org/clueweb09.php/
White T (2012) Hadoop: the definitive guide. O’Reilly, California
Google Scholar
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing, p 10
Zhang C, Masseglia F (2010) Discovering highly informative feature sets from data streams. In: Proceedings of the 21st international conference on database and expert systems applications: part I, DEXA’10, Springer, Berlin, pp 91–104. http://dl.acm.org/citation.cfm?id=1881867.1881877

Download references

Acknowledgments

Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see https://www.grid5000.fr).

Author information

Authors and Affiliations

Inria and Lirmm, Zenith Team, University of Montpellier 2, Bâtiment 5, CC 05 018 Campus St Priest - 860 rue St Priest, 34095, Montpellier Cedex 5, France
Saber Salah, Reza Akbarinia & Florent Masseglia

Authors

Saber Salah
View author publications
You can also search for this author in PubMed Google Scholar
Reza Akbarinia
View author publications
You can also search for this author in PubMed Google Scholar
Florent Masseglia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saber Salah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Salah, S., Akbarinia, R. & Masseglia, F. A highly scalable parallel algorithm for maximally informative k-itemset mining. Knowl Inf Syst 50, 1–26 (2017). https://doi.org/10.1007/s10115-016-0931-2

Download citation

Received: 13 November 2015
Revised: 18 January 2016
Accepted: 03 March 2016
Published: 22 March 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s10115-016-0931-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A highly scalable parallel algorithm for maximally informative k-itemset mining

Abstract

Access this article

Similar content being viewed by others

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

A Parallel Algorithm for Mining High Utility Itemsets

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A highly scalable parallel algorithm for maximally informative k-itemset mining

Abstract

Access this article

Similar content being viewed by others

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

A Parallel Algorithm for Mining High Utility Itemsets

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation