Advertisement

Data Mining and Knowledge Discovery

, Volume 29, Issue 3, pp 732–764 | Cite as

On measuring similarity for sequences of itemsets

  • Elias EghoEmail author
  • Chedy Raïssi
  • Toon Calders
  • Nicolas Jay
  • Amedeo Napoli
Article

Abstract

Computing the similarity between sequences is a very important challenge for many different data mining tasks. There is a plethora of similarity measures for sequences in the literature, most of them being designed for sequences of items. In this work, we study the problem of measuring the similarity between sequences of itemsets. We focus on the notion of common subsequences as a way to measure similarity between a pair of sequences composed of a list of itemsets. We present new combinatorial results for efficiently counting distinct and common subsequences. These theoretical results are the cornerstone of an effective dynamic programming approach to deal with this problem. In addition, we propose an approximate method to speed up the computation process for long sequences. We have applied our method to various data sets: healthcare trajectories, online handwritten characters and synthetic data. Our results confirm that our measure of similarity produces competitive scores and indicate that our method is relevant for large scale sequential data analysis.

References

  1. Berndt, Donald J, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: KDD Workshop. Seattle, Association for the Advancement of Artificial Intelligence, pp 359–370Google Scholar
  2. Chothia C, Gerstein M (1997) Protein evolution. How far can sequences diverge? Nature 6617(385):579–581CrossRefGoogle Scholar
  3. Elzinga Cees, Rahmann Sven, Wang Hui (2008) Algorithms for subsequence combinatorics. Theor Comput Sci 409(3):394–404CrossRefzbMATHMathSciNetGoogle Scholar
  4. Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD international conference on management of data, SIGMOD ’94, New York, ACM, pp 419–429Google Scholar
  5. Gao Xinbo, Xiao Bing, Tao Dacheng, Li Xuelong (2010) A survey of graph edit distance. Pattern Anal Appl 13(1):113–129CrossRefMathSciNetGoogle Scholar
  6. Herranz Javier, Nin Jordi, Sole Marc (2011) Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans Knowl Data Eng 23:1541–1554CrossRefGoogle Scholar
  7. Hirschberg DS, Hirschberg DS (1975) A linear space algorithm for computing maximal common subsequences. Commun ACM 18(6):341–343CrossRefzbMATHMathSciNetGoogle Scholar
  8. Zaki M, Sequeira K (2002) Admit: anomaly-base data mining for intrusions. In: 8th ACM SIGKDD international conference on knowledge discovery and data mining. New York, ACM, pp 386–395Google Scholar
  9. Keogh E (2002) Exact indexing of dynamic time warping. In: Proceedings of the 28th international conference on very large data bases. VLDB ’02, Hong Kong, Morgan Kaufmann, pp 406–417. VLDB Endowment.Google Scholar
  10. Leslie C, Eskin E, Stafford-Noble W (2002) The spectrum kernel: a string kernel for svm protein classification. Pac Symp Biocomput 575(50):564–575Google Scholar
  11. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Doklady 10(8):707–710MathSciNetGoogle Scholar
  12. Linial Nathan, Nisan Noam (1990) Approximate inclusion–exclusion. Combinatorica 10(4):349–365CrossRefzbMATHMathSciNetGoogle Scholar
  13. Christopher D, Manning, Prabhakar R, Schütze Hinrich (2008) Introduction to Information Retrieval. New York, Cambridge University Press. ISBN 0521865719, 9780521865715Google Scholar
  14. Muzaffar F, Mohsin B, Naz F, Jawed F (2005) Dsp implementation of voice recognition using dynamic time warping algorithm. Karachi, IEEE Explore, pp 1–7Google Scholar
  15. Myers JL, Well AD (2003) Research design and statistical analysis. Lawrence Erlbaum Associates, MahwahGoogle Scholar
  16. Oncina Jose, Sebban Marc (2006) Learning stochastic edit distance: application in handwritten character recognition. Pattern Recognit 39(9):1575–1587CrossRefzbMATHGoogle Scholar
  17. R Core Team (2012) R: a language and environment for statistical computing. Vienna, R Foundation for Statistical ComputingGoogle Scholar
  18. Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1(9):56–68CrossRefGoogle Scholar
  19. Serrà Joan, Kantz Holger, Serra Xavier, Andrzejak Ralph G (2012) Predictability of music descriptor time series and its application to cover song detection. IEEE Trans Audio Speech Lang Process 20(2):514–525Google Scholar
  20. Vlachos Michail, Hadjieleftheriou Marios, Gunopulos Dimitrios , Keogh Eamonn J. (2003) Indexing multi-dimensional time-series with support for multiple distance measures. In: Getoor L, Senator TE, Domingos P, Faloutsos C (ed) In: Proceedings of SIGKDD. Washington DC, ACM, pp 216–225Google Scholar
  21. Wang H, Lin Z (2007) A novel algorithm for counting all common subsequences. In: Proceedings of the 2007 IEEE international conference on granular computing, GRC ’07. Washington DC, IEEE Computer Society, p 502Google Scholar
  22. Wodak SJ, Janin J (2002) Structural basis of macromolecular recognition. Adv Protein Chem 61:9–73CrossRefGoogle Scholar
  23. Xiong T, Wang S, Jiang Q, Huang JZ (2011) A new markov model for clustering categorical sequences. In: Proceedings of the 2011 IEEE 11th international conference on data mining, ICDM ’11. Washington DC, IEEE Computer Society, pp 854–863Google Scholar
  24. Yan X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large datasets. In: In SDM. pp 166–177Google Scholar
  25. Yang Q, Zhang HH. Web-log mining for predictive web caching. IEEE Trans Knowl Data Eng 15(4):1050–1053. ISSN 1041–4347Google Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  • Elias Egho
    • 1
    Email author
  • Chedy Raïssi
    • 2
  • Toon Calders
    • 3
  • Nicolas Jay
    • 1
  • Amedeo Napoli
    • 1
  1. 1.LORIAVandoeuvre-les-NancyFrance
  2. 2.Nancy Grand EstINRIANancyFrance
  3. 3.Université Libre de BruxellesBrusselsBelgium

Personalised recommendations