Abstract
Frequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of repeated patterns. Many efficient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Scalable parallel algorithms hold the key to solving the problem in this context. In this chapter, we review recent advances in parallel frequent pattern mining, analyzing them through the Big Data lens. We identify three areas as challenges to designing parallel frequent pattern mining algorithms: memory scalability, work partitioning, and load balancing. With these challenges as a frame of reference, we extract and describe key algorithmic design patterns from the wealth of research conducted in this domain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In International Conference on Very Large Data Bases, VLDB ’94, pages 487–499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.
Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In International Conference on Data Engineering, ICDE '95, pages 3–14, Washington, DC, USA, 1995. IEEE Computer Society.
Rakesh Agrawal and John C. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969, 1996.
Ramesh C. Agarwal, Charu C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3):350–371, March 2001.
Big data meets big data analytics. http://www.sas.com/resources/whitepaper/wp_46345.pdf. Accessed: 2014-03-06.
Christian Borgelt and Michael R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. In IEEE International Conference on Data Mining, ICDM 2002, pages 51–58. IEEE, 2002.
Dhruba Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 11:2–1, 2007.
Gregory Buehrer, Srinivasan Parthasarathy, Anthony Nguyen, Daehyun Kim, Yen-Kuang Chen, and Pradeep Dubey. Parallel graph mining on shared memory architectures. Technical report, The Ohio State University, Columbus, OH, USA, 2005.
Shengnan Cong, Jiawei Han, Jay Hoeflinger, and David Padua. A sampling-based framework for parallel data mining. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05, pages 255–265, New York, NY, USA, 2005. ACM.
Shengnan Cong, Jiawei Han, and David Padua. Parallel mining of closed sequential patterns. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD '05, pages 562–567, New York, NY, USA, 2005. ACM.
Diane J Cook, Lawrence B Holder, Gehad Galal, and Ron Maglothin. Approaches to parallel graph-based knowledge discovery. Journal of Parallel and Distributed Computing, 61(3):427–446, 2001.
Brian A. Davey and Hilary A. Priestley. Introduction to lattices and order. Cambridge University Press, Cambridge, 1990.
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, January 2008.
Giuseppe Di Fatta and Michael R. Berthold. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems, 17(8):773–785, 2006.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In ACM SIGOPS Operating Systems Review, volume 37, pages 29–43. ACM, 2003.
Carole A. Goble and David De Roure. The impact of workflow tools on data-centric research. In Tony Hey, Stewart Tansley, and Kristin M. Tolle, editors, The Fourth Paradigm, pages 137–145. Microsoft Research, 2009.
Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. Introduction to Parallel Computing (2nd Edition). Addison Wesley, second edition, 2003.
Valerie Guralnik and George Karypis. Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4):443–472, April 2004.
Eui-Hong Han, George Karypis, and Vipin Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD International Conference on Management of Data, SIGMOD '97, pages 277–288, New York, NY, USA, 1997. ACM.
Jiawei Han, Jian Pei, Behzad Mortazavi-Asl, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Freespan: Frequent pattern-projected sequential pattern mining. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '00, pages 355–359, New York, NY, USA, 2000. ACM.
Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD International Conference on Management of Data, SIGMOD '00, pages 1–12, New York, NY, USA, 2000. ACM.
Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent pattern mining: Current status and future directions. Data Mining and Knowledge Discovery, 15(1):55–86, August 2007.
Carl Hewitt, Peter Bishop, and Richard Steiger. A universal modular actor formalism for artificial intelligence. In Third International Joint Conference on Artificial intelligence, IJCAI-73, pages 235–245. Morgan Kaufmann Publishers Inc., 1973.
Lawrence B Holder, Diane J Cook, Surnjani Djoko, et al. Substucture discovery in the subdue system. In AAAI Workshop on Knowledge Discovery in Databases, KDD-94, pages 169–180, 199–4.
Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Data Mining and Knowledge Discovery, pages 13–23. Springer, 2000.
Mahesh V. Joshi, George Karypis, and Vipin Kumar. A universal formulation of sequential patterns. Technical Report 99-021, Department of Computer Science, University of Minnesota, 1999.
Mahesh V. Joshi, George Karypis, and Vipin Kumar. Parallel algorithms for mining sequential associations: Issues and challenges. Technical report, Department of Computer Science, University of Minnesota, 2000.
George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing, 20(1):359–392, Dec 1998.
Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM 2001, pages 313–320. IEEE, 2001.
Michihiro Kuramochi and George Karypis. Finding frequent patterns in a large sparse graph. Data Mining and Knowledge Discovery, 11(3):243–271, 2005.
Vance Chiang-Chi Liao and Ming-Syan Chen. Dfsp: a depth-first spelling algorithm for sequential pattern mining of biological sequences. Knowledge and Information Systems, pages 1–17, 2013.
Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: the twitter experience. ACM SIGKDD Explorations Newsletter, 14(2):6–19, 2013.
Ming-Yen Lin, Pei-Yu Lee, and Sue-Chen Hsueh. Apriori-based frequent itemset mining algorithms on mapreduce. In Proceedings of the Sixth International Conference on Ubiquitous Information Management and Communication, ICUIMC '12, pages 76:1–76:8, New York, NY, USA, 2012. ACM.
Yang Liu, Xiaohong Jiang, Huajun Chen, Jun Ma, and Xiangyu Zhang. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In Advanced Parallel Processing Technologies, pages 341–355. Springer, 2009.
Wei Lu, Gang Chen, Anthony KH Tung, and Feng Zhao. Efficiently extracting frequent subgraphs using mapreduce. In 2013 IEEE International Conference on Big Data, pages 639–647. IEEE, 2013.
Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3):259–289, January 1997.
Thorsten Meinl, Marc Worlein, Ingrid Fischer, and Michael Philippsen. Mining molecular datasets on symmetric multiprocessor systems. In IEEE International Conference on Systems, Man and Cybernetics, volume 2 of SMC '06, pages 1269–1274. IEEE, 2006.
Iris Miliaraki, Klaus Berberich, Rainer Gemulla, and Spyros Zoupanos. Mind the gap: Large-scale frequent sequence mining. In ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 797–808, New York, NY, USA, 2013. ACM.
Sandy Moens, Emin Aksehirli, and Bart Goethals. Frequent itemset mining for big data. In 2013 IEEE International Conference on Big Data, pages 111–118. IEEE, 2013.
Andreas Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical report, University of Maryland at College Park, College Park, MD, USA, 1995.
Christos H. Papadimitriou and Kenneth Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Dover Publications, 1998.
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, and Hua Zhu. Mining access patterns efficiently from web logs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications, PAKDD '00, pages 396–407, London, UK, UK, 2000. Springer-Verlag.
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In International Conference on Data Engineering, ICDE '01, pages 215–224, Washington, DC, USA, 2001. IEEE Computer Society.
Shaojie Qiao, Changjie Tang, Shucheng Dai, Mingfang Zhu, Jing Peng, Hongjun Li, and Yungchang Ku. Partspan: Parallel sequence mining of trajectory patterns. In International Conference on Fuzzy Systems and Knowledge Discovery - Volume 05, FSKD '08, pages 363–367, Washington, DC, USA, 2008. IEEE Computer Society.
Shaojie Qiao, Tianrui Li, Jing Peng, and Jiangtao Qiu. Parallel sequential pattern mining of massive trajectory data. International Journal of Computational Intelligence Systems, 3(3):343–356, 2010.
A. Rajimol and G. Raju. Web access pattern mining — a survey. In International Conference on Data Engineering and Management, ICDEM '10, pages 24–31, Berlin, Heidelberg, 2012. Springer-Verlag.
Abhik Ray and Lawrence B. Holder. Efficiency improvements for parallel subgraph miners. In Florida Artificial Intelligence Research Society Conference, FLAIRS '12, 2012.
Steve Reinhardt and George Karypis. A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph. In International Symposium on Parallel and Distributed Processing, IPDPS 2007, pages 1–8, 2007.
Isidore Rigoutsos and Aris Floratos. Combinatorial pattern discovery in biological sequences: The teiresias algorithm. Bioinformatics, 14(1):55–67, 1998.
Majed Sahli, Essam Mansour, and Panos Kalnis. Parallel motif extraction from very long sequences. In ACM International Conference on Conference on Information & Knowledge Management, CIKM '13, pages 549–558, New York, NY, USA, 2013. ACM.
Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In International Conference on Very Large Data Bases, VLDB '95, pages 432–444, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.
Takahiko Shintani and Masaru Kitsuregawa. Hash based parallel algorithms for mining association rules. In International Conference on Parallel and Distributed Information Systems, pages 19–30, Dec 1996.
Takahiko Shintani and Masaru Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach. In Xindong Wu, Kotagiri Ramamohanarao, and Kevin B. Korb, editors, Pacific-Asia Conference on Knowledge Discovery and Data Mining, volume 1394 of PAKDD '98, pages 283–294. Springer, 1998.
Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and performance improvements. In International Conference on Extending Database Technology: Advances in Database Technology, EDBT '96, pages 3–17, London, UK, UK, 1996. Springer-Verlag.
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.
Jianyong Wang and Jiawei Han. Bide: Efficient mining of frequent closed sequences. In International Conference on Data Engineering, ICDE '04, pages 79–91, Washington, DC, USA, 2004. IEEE Computer Society.
Chao Wang and Srinivasan Parthasarathy. Parallel algorithms for mining frequent structural motifs in scientific data. In Annual International Conference on Supercomputing, ICS '04, pages 31–40, New York, NY, USA, 2004. ACM.
Ke Wang, Yabo Xu, and Jeffrey Xu Yu. Scalable sequential pattern mining for biological sequences. In International Conference on Information and Knowledge Management, CIKM '04, pages 178–187, New York, NY, USA, 2004. ACM.
Tom White. Hadoop: The Definitive Guide. O’Reilly Media, 2009.
Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining, ICDM 2002, pages 721–724. IEEE, 2002.
Xifeng Yan, Jiawei Han, and Ramin Afshar. Clospan: Mining closed sequential patterns in large databases. In Daniel Barbará and Chandrika Kamath, editors, SIAM International Conference on Data Mining, SDM 2003. SIAM, 2003.
Mohammed J. Zaki. Efficient enumeration of frequent sequences. In Seventh International Conference on Information and Knowledge Management, CIKM '98, pages 68–75, New York, NY, USA, 1998. ACM.
Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372–390, May 2000.
Mohammed J. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1–2):31–60, January 2001.
Mohammed J. Zaki. Parallel sequence mining on shared-memory machines. Journal of Parallel and Distributed Computing, 61(3):401–426, Mar 2001. Special issue on High Performance Data Mining.
Mohammed J. Zaki, Mitsunori Ogihara, Srinivasan Parthasarathy, and Wei Li. Parallel data mining for association rules on shared-memory multi-processors. In ACM/IEEE Conference on Supercomputing, pages 43–43, 1996.
Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. Parallel algorithms for discovery of association rules. Data Mining and Knowledge Discovery, 1(4):343–373, 1997.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Anastasiu, D., Iverson, J., Smith, S., Karypis, G. (2014). Big Data Frequent Pattern Mining. In: Aggarwal, C., Han, J. (eds) Frequent Pattern Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-07821-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-07821-2_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07820-5
Online ISBN: 978-3-319-07821-2
eBook Packages: Computer ScienceComputer Science (R0)