Skip to main content

Advertisement

Log in

A novel mapreduce algorithm for distributed mining of sequential patterns using co-occurrence information

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Sequential Pattern Mining (SPM) problem is much studied and extended in several directions. With the tremendous growth in the size of datasets, traditional algorithms are not scalable. In order to solve the scalability issue, recently few researchers have developed distributed algorithms based on MapReduce. However, the existing MapReduce algorithms require multiple rounds of MapReduce, which increases communication and scheduling overhead. Also, they do not address the issue of handling long sequences. They generate huge number of candidate sequences that do not appear in the input database and increases the search space. This results in more number of candidate sequences for support counting. Our algorithm is a two phase MapReduce algorithm that generates the promising candidate sequences using the pruning strategies. It also reduces the search space and thus the support computation is effective. We make use of the item co-occurrence information and the proposed Sequence Index List (SIL) data structure helps in computing the support at fast. The experimental results show that the proposed algorithm has better performance over the existing MapReduce algorithms for the SPM problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://fimi.ua.ac.be/data/

  2. http://kdd.ics.uci.edu/databases/msnbc/msnbc.html

  3. http://www.msnbc.com

  4. https://www.gazelle.com/

  5. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php

References

  1. Agrawal R, Srikant R (1995) Mining Sequential Patterns. In: Proceedings of the Eleventh international conference on data engineering, pp 3–14

  2. Aseervatham S, Osmani A, Viennet E (2006) bitSPADE: a lattice-based sequential pattern mining algorithm using bitmap representation. In: Proceedings of the Sixth international conference on data mining

  3. Ayres J, Flannick J, Gehrke J, Yiu T (2002) Sequential PAttern mining using a bitmap representation. In: Proceedings of the Eighth ACM SIGKDD international conference on knowledge discovery and data mining

  4. Chen CC, Shuai HH, Chen MS (2017) Distributed and scalable sequential pattern mining through stream processing. Knowl Inf Syst 53(2):365–390

    Article  Google Scholar 

  5. Chen CC, Tseng CY, Chen MS (2013) Highly scalable sequential pattern mining based on MapReduce model on the cloud. In: Proceedings of IEEE international congress on big data, pp 310–317

  6. Chen J (2010) An UpDown directed acyclic graph approach for sequential pattern mining. IEEE Trans Knowl Data Eng 22(7):913–928

    Article  Google Scholar 

  7. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51 (1):107–113

    Article  Google Scholar 

  8. Fournier-Viger P, Gomariz A, Campos M, Thomas R (2014) Fast vertical mining of sequential patterns using co-occurrence information. In: Tseng VS, Ho TB, Zhou ZH, Chen ALP, Kao HY (eds) Advances in knowledge discovery and data mining. Springer, Cham, pp 40–52

  9. Fournier-Viger P, Lin JCW, Kiran RU, Koh YS, Thomas R (2017) A survey of sequential pattern mining. Data Science and Pattern Recognition 1(1):54–77

    Google Scholar 

  10. Fumarola F, Lanotte PF, Ceci M, Malerba D (2016) cloFAST: closed sequential pattern mining using sparse and vertical id-lists. Knowl Inf Syst 48(2):429–463

    Article  Google Scholar 

  11. Gomariz A, Campos M, Marin R, Goethals B (2013) claSP: an efficient algorithm for mining frequent closed sequences. In: Pei J, Tseng VS, Cao L, Motoda H, Xu G (eds) Advances in knowledge discovery and data mining, vol 7818. Springer, Heidelberg, pp 50–61

  12. Guralnik V, Karypis G (2004) Parallel tree-projection-based sequence mining algorithms. Parallel Comput 30(4):443–472

    Article  Google Scholar 

  13. Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of the Sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 355–359

  14. Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a Frequent-Pattern tree approach. Data Min Knowl Disc 8(1):53–87

    Article  MathSciNet  Google Scholar 

  15. Hoang T, Le B, Tran MT (2017) Distributed algorithm for sequential pattern mining on a large sequence dataset. In: Proceedings of the Ninth international conference on knowledge and systems engineering, pp 18–23

  16. Huang JW, Lin SC, Chen MS (2010) DPSP: distributed progressive sequential pattern mining on the cloud. In: Zaki MJ, Yu JX, Ravindran B, Pudi V (eds) Advances in knowledge discovery and data mining. Springer, Berlin, pp 27–34

  17. Huynh B, Vo B, Snasel V (2017) An efficient method for mining frequent sequential patterns using multi-Core processors. Appl Intell 46(3):703–716

    Article  Google Scholar 

  18. Kieu T, Vo B, Le T, Deng ZH, Le B (2017) Mining top-k co-occurrence items with sequential pattern. Expert Syst Appl 85(1):123–133

    Article  Google Scholar 

  19. Mabroukeh NR, Ezeife CI (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv 43(1):3:1–3:41

    Article  Google Scholar 

  20. Masseglia F, Cathala F, Poncelet P (1998) The PSP approach for mining sequential patterns. In: Proceedings of the Second European symposium on principles of data mining and knowledge discovery, Lect Notes Comput Sci, vol 1510, pp 176–184

  21. Miliaraki I, Berberich K, Gemulla R, Zoupanos S (2013) Mind the gap: large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 797–808

  22. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu MC (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440

    Article  Google Scholar 

  23. Salvemini E, Fumarola F, Malerba D, Han J (2011) FAST sequence mining based on sparse Id-Lists. In: Kryszkiewicz M, Rybinski H, Skowron A, Ras ZW (eds) Foundations of intelligent systems. Springer, Berlin, pp 316–325

  24. Shintani T, Kitsuregawa M (1998) Mining algorithms for sequential patterns in parallel : hash based approach. In: Wu X, Kotagiri R, Korb KB (eds) Research and development in knowledge discovery and data mining, vol 1394. Springer, Berlin, pp 283–294

  25. Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the Fifth international conference on extending database technology, vol 1057, pp 3–17

  26. Wang J, Huang JL, Chen YC (2016) On efficiently mining high utility sequential patterns knowledge information systems. https://doi.org/10.1007/s10115-015-0914-8

  27. Wang X, Wang J, Wang T, Li H, Yang D (2010) Parallel sequential pattern mining by transaction decomposition. In: Proceedings of the Seventh international conference on fuzzy systems and knowledge discovery, pp 1746–1750

  28. White T (2015) Hadoop: The Definitive guide, fourth edn O’Reilly Media

  29. Yang Z, Kitsuregawa M (2005) LAPIN-SPAM: an improved algorithm for mining sequential pattern. In: Proceedings of the 21st international conference on data engineering

  30. Yang Z, Wang Y, Kitsuregawa M (2007) LAPIN: Effective sequential pattern mining algorithms by last position induction for dense databases. In: Kotagiri R, Krishna PR, Mohania M, Nantajeewarawat E (eds) Advances in databases: concepts, systems and applications, vol 4443. Springer, Berlin, pp 1020–1023

  31. Yong-qing W, Dong L, Lin-shan D (2012) Distributed prefixspan algorithm based on MapReduce. In: Proceedings of 2012 internatioanl symposium on information technology in medicine and education, pp 901–904

  32. Yu X, Liu J, Liu X, Ma C, Li B (2015) A MapReduce reinforced distributed sequential pattern mining algorithm. In: Wang G, Zomaya A, Martinez G, Li K (eds) Algorithms and architectures for parallel processing, vol 9529. Springer, Cham, pp 183– 197

  33. Zaki MJ (2001) Parallel sequence mining on Shared-Memory machines. J Parallel Distrib Comput 61(3):401–426

    Article  MATH  Google Scholar 

  34. Zaki MJ (2001) SPADE: An efficient algorithm for mining frequent sequences. Mach Learn 42(1-2):31–60

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sumalatha Saleti.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saleti, S., Subramanyam, R.B.V. A novel mapreduce algorithm for distributed mining of sequential patterns using co-occurrence information. Appl Intell 49, 150–171 (2019). https://doi.org/10.1007/s10489-018-1259-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1259-2

Keywords

Navigation