Skip to main content

Big Data Frequent Pattern Mining

  • Chapter
  • First Online:
Book cover Frequent Pattern Mining

Abstract

Frequent pattern mining is an essential data mining task, with a goal of discovering knowledge in the form of repeated patterns. Many efficient pattern mining algorithms have been discovered in the last two decades, yet most do not scale to the type of data we are presented with today, the so-called “Big Data”. Scalable parallel algorithms hold the key to solving the problem in this context. In this chapter, we review recent advances in parallel frequent pattern mining, analyzing them through the Big Data lens. We identify three areas as challenges to designing parallel frequent pattern mining algorithms: memory scalability, work partitioning, and load balancing. With these challenges as a frame of reference, we extract and describe key algorithmic design patterns from the wealth of research conducted in this domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In International Conference on Very Large Data Bases, VLDB ’94, pages 487–499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.

    Google Scholar 

  2. Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In International Conference on Data Engineering, ICDE '95, pages 3–14, Washington, DC, USA, 1995. IEEE Computer Society.

    Google Scholar 

  3. Rakesh Agrawal and John C. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969, 1996.

    Article  Google Scholar 

  4. Ramesh C. Agarwal, Charu C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3):350–371, March 2001.

    Article  MATH  Google Scholar 

  5. Big data meets big data analytics. http://www.sas.com/resources/whitepaper/wp_46345.pdf. Accessed: 2014-03-06.

  6. Christian Borgelt and Michael R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. In IEEE International Conference on Data Mining, ICDM 2002, pages 51–58. IEEE, 2002.

    Google Scholar 

  7. Dhruba Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 11:2–1, 2007.

    Google Scholar 

  8. Gregory Buehrer, Srinivasan Parthasarathy, Anthony Nguyen, Daehyun Kim, Yen-Kuang Chen, and Pradeep Dubey. Parallel graph mining on shared memory architectures. Technical report, The Ohio State University, Columbus, OH, USA, 2005.

    Google Scholar 

  9. Shengnan Cong, Jiawei Han, Jay Hoeflinger, and David Padua. A sampling-based framework for parallel data mining. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05, pages 255–265, New York, NY, USA, 2005. ACM.

    Google Scholar 

  10. Shengnan Cong, Jiawei Han, and David Padua. Parallel mining of closed sequential patterns. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD '05, pages 562–567, New York, NY, USA, 2005. ACM.

    Google Scholar 

  11. Diane J Cook, Lawrence B Holder, Gehad Galal, and Ron Maglothin. Approaches to parallel graph-based knowledge discovery. Journal of Parallel and Distributed Computing, 61(3):427–446, 2001.

    Article  MATH  Google Scholar 

  12. Brian A. Davey and Hilary A. Priestley. Introduction to lattices and order. Cambridge University Press, Cambridge, 1990.

    MATH  Google Scholar 

  13. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, January 2008.

    Article  Google Scholar 

  14. Giuseppe Di Fatta and Michael R. Berthold. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems, 17(8):773–785, 2006.

    Article  Google Scholar 

  15. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In ACM SIGOPS Operating Systems Review, volume 37, pages 29–43. ACM, 2003.

    Google Scholar 

  16. Carole A. Goble and David De Roure. The impact of workflow tools on data-centric research. In Tony Hey, Stewart Tansley, and Kristin M. Tolle, editors, The Fourth Paradigm, pages 137–145. Microsoft Research, 2009.

    Google Scholar 

  17. Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. Introduction to Parallel Computing (2nd Edition). Addison Wesley, second edition, 2003.

    Google Scholar 

  18. Valerie Guralnik and George Karypis. Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4):443–472, April 2004.

    Article  Google Scholar 

  19. Eui-Hong Han, George Karypis, and Vipin Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD International Conference on Management of Data, SIGMOD '97, pages 277–288, New York, NY, USA, 1997. ACM.

    Google Scholar 

  20. Jiawei Han, Jian Pei, Behzad Mortazavi-Asl, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Freespan: Frequent pattern-projected sequential pattern mining. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '00, pages 355–359, New York, NY, USA, 2000. ACM.

    Google Scholar 

  21. Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD International Conference on Management of Data, SIGMOD '00, pages 1–12, New York, NY, USA, 2000. ACM.

    Google Scholar 

  22. Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent pattern mining: Current status and future directions. Data Mining and Knowledge Discovery, 15(1):55–86, August 2007.

    Article  MathSciNet  Google Scholar 

  23. Carl Hewitt, Peter Bishop, and Richard Steiger. A universal modular actor formalism for artificial intelligence. In Third International Joint Conference on Artificial intelligence, IJCAI-73, pages 235–245. Morgan Kaufmann Publishers Inc., 1973.

    Google Scholar 

  24. Lawrence B Holder, Diane J Cook, Surnjani Djoko, et al. Substucture discovery in the subdue system. In AAAI Workshop on Knowledge Discovery in Databases, KDD-94, pages 169–180, 199–4.

    Google Scholar 

  25. Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Data Mining and Knowledge Discovery, pages 13–23. Springer, 2000.

    Google Scholar 

  26. Mahesh V. Joshi, George Karypis, and Vipin Kumar. A universal formulation of sequential patterns. Technical Report 99-021, Department of Computer Science, University of Minnesota, 1999.

    Google Scholar 

  27. Mahesh V. Joshi, George Karypis, and Vipin Kumar. Parallel algorithms for mining sequential associations: Issues and challenges. Technical report, Department of Computer Science, University of Minnesota, 2000.

    Google Scholar 

  28. George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing, 20(1):359–392, Dec 1998.

    Article  MathSciNet  Google Scholar 

  29. Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM 2001, pages 313–320. IEEE, 2001.

    Google Scholar 

  30. Michihiro Kuramochi and George Karypis. Finding frequent patterns in a large sparse graph. Data Mining and Knowledge Discovery, 11(3):243–271, 2005.

    Article  MathSciNet  Google Scholar 

  31. Vance Chiang-Chi Liao and Ming-Syan Chen. Dfsp: a depth-first spelling algorithm for sequential pattern mining of biological sequences. Knowledge and Information Systems, pages 1–17, 2013.

    Google Scholar 

  32. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: the twitter experience. ACM SIGKDD Explorations Newsletter, 14(2):6–19, 2013.

    Article  Google Scholar 

  33. Ming-Yen Lin, Pei-Yu Lee, and Sue-Chen Hsueh. Apriori-based frequent itemset mining algorithms on mapreduce. In Proceedings of the Sixth International Conference on Ubiquitous Information Management and Communication, ICUIMC '12, pages 76:1–76:8, New York, NY, USA, 2012. ACM.

    Google Scholar 

  34. Yang Liu, Xiaohong Jiang, Huajun Chen, Jun Ma, and Xiangyu Zhang. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In Advanced Parallel Processing Technologies, pages 341–355. Springer, 2009.

    Google Scholar 

  35. Wei Lu, Gang Chen, Anthony KH Tung, and Feng Zhao. Efficiently extracting frequent subgraphs using mapreduce. In 2013 IEEE International Conference on Big Data, pages 639–647. IEEE, 2013.

    Google Scholar 

  36. Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3):259–289, January 1997.

    Article  Google Scholar 

  37. Thorsten Meinl, Marc Worlein, Ingrid Fischer, and Michael Philippsen. Mining molecular datasets on symmetric multiprocessor systems. In IEEE International Conference on Systems, Man and Cybernetics, volume 2 of SMC '06, pages 1269–1274. IEEE, 2006.

    Google Scholar 

  38. Iris Miliaraki, Klaus Berberich, Rainer Gemulla, and Spyros Zoupanos. Mind the gap: Large-scale frequent sequence mining. In ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 797–808, New York, NY, USA, 2013. ACM.

    Google Scholar 

  39. Sandy Moens, Emin Aksehirli, and Bart Goethals. Frequent itemset mining for big data. In 2013 IEEE International Conference on Big Data, pages 111–118. IEEE, 2013.

    Google Scholar 

  40. Andreas Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical report, University of Maryland at College Park, College Park, MD, USA, 1995.

    Google Scholar 

  41. Christos H. Papadimitriou and Kenneth Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Dover Publications, 1998.

    Google Scholar 

  42. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, and Hua Zhu. Mining access patterns efficiently from web logs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications, PAKDD '00, pages 396–407, London, UK, UK, 2000. Springer-Verlag.

    Google Scholar 

  43. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In International Conference on Data Engineering, ICDE '01, pages 215–224, Washington, DC, USA, 2001. IEEE Computer Society.

    Google Scholar 

  44. Shaojie Qiao, Changjie Tang, Shucheng Dai, Mingfang Zhu, Jing Peng, Hongjun Li, and Yungchang Ku. Partspan: Parallel sequence mining of trajectory patterns. In International Conference on Fuzzy Systems and Knowledge Discovery - Volume 05, FSKD '08, pages 363–367, Washington, DC, USA, 2008. IEEE Computer Society.

    Google Scholar 

  45. Shaojie Qiao, Tianrui Li, Jing Peng, and Jiangtao Qiu. Parallel sequential pattern mining of massive trajectory data. International Journal of Computational Intelligence Systems, 3(3):343–356, 2010.

    Article  Google Scholar 

  46. A. Rajimol and G. Raju. Web access pattern mining — a survey. In International Conference on Data Engineering and Management, ICDEM '10, pages 24–31, Berlin, Heidelberg, 2012. Springer-Verlag.

    Google Scholar 

  47. Abhik Ray and Lawrence B. Holder. Efficiency improvements for parallel subgraph miners. In Florida Artificial Intelligence Research Society Conference, FLAIRS '12, 2012.

    Google Scholar 

  48. Steve Reinhardt and George Karypis. A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph. In International Symposium on Parallel and Distributed Processing, IPDPS 2007, pages 1–8, 2007.

    Google Scholar 

  49. Isidore Rigoutsos and Aris Floratos. Combinatorial pattern discovery in biological sequences: The teiresias algorithm. Bioinformatics, 14(1):55–67, 1998.

    Article  Google Scholar 

  50. Majed Sahli, Essam Mansour, and Panos Kalnis. Parallel motif extraction from very long sequences. In ACM International Conference on Conference on Information & Knowledge Management, CIKM '13, pages 549–558, New York, NY, USA, 2013. ACM.

    Google Scholar 

  51. Ashoka Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In International Conference on Very Large Data Bases, VLDB '95, pages 432–444, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.

    Google Scholar 

  52. Takahiko Shintani and Masaru Kitsuregawa. Hash based parallel algorithms for mining association rules. In International Conference on Parallel and Distributed Information Systems, pages 19–30, Dec 1996.

    Google Scholar 

  53. Takahiko Shintani and Masaru Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach. In Xindong Wu, Kotagiri Ramamohanarao, and Kevin B. Korb, editors, Pacific-Asia Conference on Knowledge Discovery and Data Mining, volume 1394 of PAKDD '98, pages 283–294. Springer, 1998.

    Google Scholar 

  54. Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and performance improvements. In International Conference on Extending Database Technology: Advances in Database Technology, EDBT '96, pages 3–17, London, UK, UK, 1996. Springer-Verlag.

    Google Scholar 

  55. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  56. Jianyong Wang and Jiawei Han. Bide: Efficient mining of frequent closed sequences. In International Conference on Data Engineering, ICDE '04, pages 79–91, Washington, DC, USA, 2004. IEEE Computer Society.

    Google Scholar 

  57. Chao Wang and Srinivasan Parthasarathy. Parallel algorithms for mining frequent structural motifs in scientific data. In Annual International Conference on Supercomputing, ICS '04, pages 31–40, New York, NY, USA, 2004. ACM.

    Google Scholar 

  58. Ke Wang, Yabo Xu, and Jeffrey Xu Yu. Scalable sequential pattern mining for biological sequences. In International Conference on Information and Knowledge Management, CIKM '04, pages 178–187, New York, NY, USA, 2004. ACM.

    Google Scholar 

  59. Tom White. Hadoop: The Definitive Guide. O’Reilly Media, 2009.

    Google Scholar 

  60. Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining, ICDM 2002, pages 721–724. IEEE, 2002.

    Google Scholar 

  61. Xifeng Yan, Jiawei Han, and Ramin Afshar. Clospan: Mining closed sequential patterns in large databases. In Daniel Barbará and Chandrika Kamath, editors, SIAM International Conference on Data Mining, SDM 2003. SIAM, 2003.

    Google Scholar 

  62. Mohammed J. Zaki. Efficient enumeration of frequent sequences. In Seventh International Conference on Information and Knowledge Management, CIKM '98, pages 68–75, New York, NY, USA, 1998. ACM.

    Google Scholar 

  63. Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372–390, May 2000.

    Article  MathSciNet  Google Scholar 

  64. Mohammed J. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1–2):31–60, January 2001.

    Article  MATH  Google Scholar 

  65. Mohammed J. Zaki. Parallel sequence mining on shared-memory machines. Journal of Parallel and Distributed Computing, 61(3):401–426, Mar 2001. Special issue on High Performance Data Mining.

    Article  MATH  Google Scholar 

  66. Mohammed J. Zaki, Mitsunori Ogihara, Srinivasan Parthasarathy, and Wei Li. Parallel data mining for association rules on shared-memory multi-processors. In ACM/IEEE Conference on Supercomputing, pages 43–43, 1996.

    Google Scholar 

  67. Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, and Wei Li. Parallel algorithms for discovery of association rules. Data Mining and Knowledge Discovery, 1(4):343–373, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David C. Anastasiu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Anastasiu, D., Iverson, J., Smith, S., Karypis, G. (2014). Big Data Frequent Pattern Mining. In: Aggarwal, C., Han, J. (eds) Frequent Pattern Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-07821-2_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07821-2_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07820-5

  • Online ISBN: 978-3-319-07821-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics