Big Data Frequent Pattern Mining

Anastasiu, David C.; Iverson, Jeremy; Smith, Shaden; Karypis, George

doi:10.1007/978-3-319-07821-2_10

David C. Anastasiu³,
Jeremy Iverson³,
Shaden Smith³ &
…
George Karypis³

5913 Accesses
14 Citations
1 Altmetric

Abstract

Frequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of repeated patterns. Many efficient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Scalable parallel algorithms hold the key to solving the problem in this context. In this chapter, we review recent advances in parallel frequent pattern mining, analyzing them through the Big Data lens. We identify three areas as challenges to designing parallel frequent pattern mining algorithms: memory scalability, work partitioning, and load balancing. With these challenges as a frame of reference, we extract and describe key algorithmic design patterns from the wealth of research conducted in this domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In International Conference on Very Large Data Bases, VLDB ’94, pages 487–499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.
Google Scholar
Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In International Conference on Data Engineering, ICDE '95, pages 3–14, Washington, DC, USA, 1995. IEEE Computer Society.
Google Scholar
Rakesh Agrawal and John C. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969, 1996.
Article Google Scholar
Ramesh C. Agarwal, Charu C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3):350–371, March 2001.
Article MATH Google Scholar
Big data meets big data analytics. http://www.sas.com/resources/whitepaper/wp_46345.pdf. Accessed: 2014-03-06.
Christian Borgelt and Michael R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. In IEEE International Conference on Data Mining, ICDM 2002, pages 51–58. IEEE, 2002.
Google Scholar
Dhruba Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 11:2–1, 2007.
Google Scholar
Gregory Buehrer, Srinivasan Parthasarathy, Anthony Nguyen, Daehyun Kim, Yen-Kuang Chen, and Pradeep Dubey. Parallel graph mining on shared memory architectures. Technical report, The Ohio State University, Columbus, OH, USA, 2005.
Google Scholar
Shengnan Cong, Jiawei Han, Jay Hoeflinger, and David Padua. A sampling-based framework for parallel data mining. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05, pages 255–265, New York, NY, USA, 2005. ACM.
Google Scholar
Shengnan Cong, Jiawei Han, and David Padua. Parallel mining of closed sequential patterns. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD '05, pages 562–567, New York, NY, USA, 2005. ACM.
Google Scholar
Diane J Cook, Lawrence B Holder, Gehad Galal, and Ron Maglothin. Approaches to parallel graph-based knowledge discovery. Journal of Parallel and Distributed Computing, 61(3):427–446, 2001.
Article MATH Google Scholar
Brian A. Davey and Hilary A. Priestley. Introduction to lattices and order. Cambridge University Press, Cambridge, 1990.
MATH Google Scholar
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, January 2008.
Article Google Scholar
Giuseppe Di Fatta and Michael R. Berthold. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems, 17(8):773–785, 2006.
Article Google Scholar
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In ACM SIGOPS Operating Systems Review, volume 37, pages 29–43. ACM, 2003.
Google Scholar
Carole A. Goble and David De Roure. The impact of workflow tools on data-centric research. In Tony Hey, Stewart Tansley, and Kristin M. Tolle, editors, The Fourth Paradigm, pages 137–145. Microsoft Research, 2009.
Google Scholar
Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. Introduction to Parallel Computing (2nd Edition). Addison Wesley, second edition, 2003.
Google Scholar
Valerie Guralnik and George Karypis. Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4):443–472, April 2004.
Article Google Scholar
Eui-Hong Han, George Karypis, and Vipin Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD International Conference on Management of Data, SIGMOD '97, pages 277–288, New York, NY, USA, 1997. ACM.
Google Scholar
Jiawei Han, Jian Pei, Behzad Mortazavi-Asl, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Freespan: Frequent pattern-projected sequential pattern mining. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '00, pages 355–359, New York, NY, USA, 2000. ACM.
Google Scholar
Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD International Conference on Management of Data, SIGMOD '00, pages 1–12, New York, NY, USA, 2000. ACM.
Google Scholar
Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent pattern mining: Current status and future directions. Data Mining and Knowledge Discovery, 15(1):55–86, August 2007.
Article MathSciNet Google Scholar
Carl Hewitt, Peter Bishop, and Richard Steiger. A universal modular actor formalism for artificial intelligence. In Third International Joint Conference on Artificial intelligence, IJCAI-73, pages 235–245. Morgan Kaufmann Publishers Inc., 1973.
Google Scholar
Lawrence B Holder, Diane J Cook, Surnjani Djoko, et al. Substucture discovery in the subdue system. In AAAI Workshop on Knowledge Discovery in Databases, KDD-94, pages 169–180, 199–4.
Google Scholar
Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Data Mining and Knowledge Discovery, pages 13–23. Springer, 2000.
Google Scholar
Mahesh V. Joshi, George Karypis, and Vipin Kumar. A universal formulation of sequential patterns. Technical Report 99-021, Department of Computer Science, University of Minnesota, 1999.
Google Scholar
Mahesh V. Joshi, George Karypis, and Vipin Kumar. Parallel algorithms for mining sequential associations: Issues and challenges. Technical report, Department of Computer Science, University of Minnesota, 2000.
Google Scholar
George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing, 20(1):359–392, Dec 1998.
Article MathSciNet Google Scholar
Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM 2001, pages 313–320. IEEE, 2001.
Google Scholar
Michihiro Kuramochi and George Karypis. Finding frequent patterns in a large sparse graph. Data Mining and Knowledge Discovery, 11(3):243–271, 2005.
Article MathSciNet Google Scholar
Vance Chiang-Chi Liao and Ming-Syan Chen. Dfsp: a depth-first spelling algorithm for sequential pattern mining of biological sequences. Knowledge and Information Systems, pages 1–17, 2013.
Google Scholar
Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: the twitter experience. ACM SIGKDD Explorations Newsletter, 14(2):6–19, 2013.
Article Google Scholar
Ming-Yen Lin, Pei-Yu Lee, and Sue-Chen Hsueh. Apriori-based frequent itemset mining algorithms on mapreduce. In Proceedings of the Sixth International Conference on Ubiquitous Information Management and Communication, ICUIMC '12, pages 76:1–76:8, New York, NY, USA, 2012. ACM.
Google Scholar
Yang Liu, Xiaohong Jiang, Huajun Chen, Jun Ma, and Xiangyu Zhang. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In Advanced Parallel Processing Technologies, pages 341–355. Springer, 2009.
Google Scholar
Wei Lu, Gang Chen, Anthony KH Tung, and Feng Zhao. Efficiently extracting frequent subgraphs using mapreduce. In 2013 IEEE International Conference on Big Data, pages 639–647. IEEE, 2013.
Google Scholar
Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3):259–289, January 1997.
Article Google Scholar
Thorsten Meinl, Marc Worlein, Ingrid Fischer, and Michael Philippsen. Mining molecular datasets on symmetric multiprocessor systems. In IEEE International Conference on Systems, Man and Cybernetics, volume 2 of SMC '06, pages 1269–1274. IEEE, 2006.
Google Scholar
Iris Miliaraki, Klaus Berberich, Rainer Gemulla, and Spyros Zoupanos. Mind the gap: Large-scale frequent sequence mining. In ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 797–808, New York, NY, USA, 2013. ACM.
Google Scholar
Sandy Moens, Emin Aksehirli, and Bart Goethals. Frequent itemset mining for big data. In 2013 IEEE International Conference on Big Data, pages 111–118. IEEE, 2013.
Google Scholar
Andreas Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical report, University of Maryland at College Park, College Park, MD, USA, 1995.
Google Scholar
Christos H. Papadimitriou and Kenneth Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Dover Publications, 1998.
Google Scholar
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, and Hua Zhu. Mining access patterns efficiently from web logs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications, PAKDD '00, pages 396–407, London, UK, UK, 2000. Springer-Verlag.
Google Scholar
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In International Conference on Data Engineering, ICDE '01, pages 215–224, Washington, DC, USA, 2001. IEEE Computer Society.
Google Scholar
Shaojie Qiao, Changjie Tang, Shucheng Dai, Mingfang Zhu, Jing Peng, Hongjun Li, and Yungchang Ku. Partspan: Parallel sequence mining of trajectory patterns. In International Conference on Fuzzy Systems and Knowledge Discovery - Volume 05, FSKD '08, pages 363–367, Washington, DC, USA, 2008. IEEE Computer Society.
Google Scholar
Shaojie Qiao, Tianrui Li, Jing Peng, and Jiangtao Qiu. Parallel sequential pattern mining of massive trajectory data. International Journal of Computational Intelligence Systems, 3(3):343–356, 2010.
Article Google Scholar
A. Rajimol and G. Raju. Web access pattern mining — a survey. In International Conference on Data Engineering and Management, ICDEM '10, pages 24–31, Berlin, Heidelberg, 2012. Springer-Verlag.
Google Scholar
Abhik Ray and Lawrence B. Holder. Efficiency improvements for parallel subgraph miners. In Florida Artificial Intelligence Research Society Conference, FLAIRS '12, 2012.
Google Scholar
Steve Reinhardt and George Karypis. A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph. In International Symposium on Parallel and Distributed Processing, IPDPS 2007, pages 1–8, 2007.
Google Scholar
Isidore Rigoutsos and Aris Floratos. Combinatorial pattern discovery in biological sequences: The teiresias algorithm. Bioinformatics, 14(1):55–67, 1998.
Article Google Scholar
Majed Sahli, Essam Mansour, and Panos Kalnis. Parallel motif extraction from very long sequences. In ACM International Conference on Conference on Information & Knowledge Management, CIKM '13, pages 549–558, New York, NY, USA, 2013. ACM.
Google Scholar
Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In International Conference on Very Large Data Bases, VLDB '95, pages 432–444, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.
Google Scholar
Takahiko Shintani and Masaru Kitsuregawa. Hash based parallel algorithms for mining association rules. In International Conference on Parallel and Distributed Information Systems, pages 19–30, Dec 1996.
Google Scholar
Takahiko Shintani and Masaru Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach. In Xindong Wu, Kotagiri Ramamohanarao, and Kevin B. Korb, editors, Pacific-Asia Conference on Knowledge Discovery and Data Mining, volume 1394 of PAKDD '98, pages 283–294. Springer, 1998.
Google Scholar
Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and performance improvements. In International Conference on Extending Database Technology: Advances in Database Technology, EDBT '96, pages 3–17, London, UK, UK, 1996. Springer-Verlag.
Google Scholar
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.
Article MATH MathSciNet Google Scholar
Jianyong Wang and Jiawei Han. Bide: Efficient mining of frequent closed sequences. In International Conference on Data Engineering, ICDE '04, pages 79–91, Washington, DC, USA, 2004. IEEE Computer Society.
Google Scholar
Chao Wang and Srinivasan Parthasarathy. Parallel algorithms for mining frequent structural motifs in scientific data. In Annual International Conference on Supercomputing, ICS '04, pages 31–40, New York, NY, USA, 2004. ACM.
Google Scholar
Ke Wang, Yabo Xu, and Jeffrey Xu Yu. Scalable sequential pattern mining for biological sequences. In International Conference on Information and Knowledge Management, CIKM '04, pages 178–187, New York, NY, USA, 2004. ACM.
Google Scholar
Tom White. Hadoop: The Definitive Guide. O’Reilly Media, 2009.
Google Scholar
Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining, ICDM 2002, pages 721–724. IEEE, 2002.
Google Scholar
Xifeng Yan, Jiawei Han, and Ramin Afshar. Clospan: Mining closed sequential patterns in large databases. In Daniel Barbará and Chandrika Kamath, editors, SIAM International Conference on Data Mining, SDM 2003. SIAM, 2003.
Google Scholar
Mohammed J. Zaki. Efficient enumeration of frequent sequences. In Seventh International Conference on Information and Knowledge Management, CIKM '98, pages 68–75, New York, NY, USA, 1998. ACM.
Google Scholar
Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372–390, May 2000.
Article MathSciNet Google Scholar
Mohammed J. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1–2):31–60, January 2001.
Article MATH Google Scholar
Mohammed J. Zaki. Parallel sequence mining on shared-memory machines. Journal of Parallel and Distributed Computing, 61(3):401–426, Mar 2001. Special issue on High Performance Data Mining.
Article MATH Google Scholar
Mohammed J. Zaki, Mitsunori Ogihara, Srinivasan Parthasarathy, and Wei Li. Parallel data mining for association rules on shared-memory multi-processors. In ACM/IEEE Conference on Supercomputing, pages 43–43, 1996.
Google Scholar
Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. Parallel algorithms for discovery of association rules. Data Mining and Knowledge Discovery, 1(4):343–373, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Minnesota, Minneapolis, USA
David C. Anastasiu, Jeremy Iverson, Shaden Smith & George Karypis

Authors

David C. Anastasiu
View author publications
You can also search for this author in PubMed Google Scholar
Jeremy Iverson
View author publications
You can also search for this author in PubMed Google Scholar
Shaden Smith
View author publications
You can also search for this author in PubMed Google Scholar
George Karypis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David C. Anastasiu .

Editor information

Editors and Affiliations

IBM, Yorktown Heights, New York, USA
Charu C. Aggarwal
University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
Jiawei Han

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Anastasiu, D., Iverson, J., Smith, S., Karypis, G. (2014). Big Data Frequent Pattern Mining. In: Aggarwal, C., Han, J. (eds) Frequent Pattern Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-07821-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-07821-2_10
Published: 30 August 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07820-5
Online ISBN: 978-3-319-07821-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics