Skip to main content
Log in

On measuring similarity for sequences of itemsets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Computing the similarity between sequences is a very important challenge for many different data mining tasks. There is a plethora of similarity measures for sequences in the literature, most of them being designed for sequences of items. In this work, we study the problem of measuring the similarity between sequences of itemsets. We focus on the notion of common subsequences as a way to measure similarity between a pair of sequences composed of a list of itemsets. We present new combinatorial results for efficiently counting distinct and common subsequences. These theoretical results are the cornerstone of an effective dynamic programming approach to deal with this problem. In addition, we propose an approximate method to speed up the computation process for long sequences. We have applied our method to various data sets: healthcare trajectories, online handwritten characters and synthetic data. Our results confirm that our measure of similarity produces competitive scores and indicate that our method is relevant for large scale sequential data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. A subsequence is a sequence that can be derived from another sequence by deleting items without changing the order of itemsets. The notion of subsequence will be further developed in Sect. 3.

  2. Programme de Médicalisation des Systèmes d’Information.

  3. Moselle is one of the 101 departments of France.

  4. http://apps.who.int/classifications/apps/icd/icd10online/.

  5. http://www.ameli.fr/accueil-de-la-ccam/index.php.

  6. http://archive.ics.uci.edu/ml/datasets/Online+Handwritten+Assamese+Characters+Dataset.

References

  • Berndt, Donald J, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: KDD Workshop. Seattle, Association for the Advancement of Artificial Intelligence, pp 359–370

  • Chothia C, Gerstein M (1997) Protein evolution. How far can sequences diverge? Nature 6617(385):579–581

    Article  Google Scholar 

  • Elzinga Cees, Rahmann Sven, Wang Hui (2008) Algorithms for subsequence combinatorics. Theor Comput Sci 409(3):394–404

    Article  MATH  MathSciNet  Google Scholar 

  • Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD international conference on management of data, SIGMOD ’94, New York, ACM, pp 419–429

  • Gao Xinbo, Xiao Bing, Tao Dacheng, Li Xuelong (2010) A survey of graph edit distance. Pattern Anal Appl 13(1):113–129

    Article  MathSciNet  Google Scholar 

  • Herranz Javier, Nin Jordi, Sole Marc (2011) Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans Knowl Data Eng 23:1541–1554

    Article  Google Scholar 

  • Hirschberg DS, Hirschberg DS (1975) A linear space algorithm for computing maximal common subsequences. Commun ACM 18(6):341–343

    Article  MATH  MathSciNet  Google Scholar 

  • Zaki M, Sequeira K (2002) Admit: anomaly-base data mining for intrusions. In: 8th ACM SIGKDD international conference on knowledge discovery and data mining. New York, ACM, pp 386–395

  • Keogh E (2002) Exact indexing of dynamic time warping. In: Proceedings of the 28th international conference on very large data bases. VLDB ’02, Hong Kong, Morgan Kaufmann, pp 406–417. VLDB Endowment.

  • Leslie C, Eskin E, Stafford-Noble W (2002) The spectrum kernel: a string kernel for svm protein classification. Pac Symp Biocomput 575(50):564–575

    Google Scholar 

  • Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Doklady 10(8):707–710

    MathSciNet  Google Scholar 

  • Linial Nathan, Nisan Noam (1990) Approximate inclusion–exclusion. Combinatorica 10(4):349–365

    Article  MATH  MathSciNet  Google Scholar 

  • Christopher D, Manning, Prabhakar R, Schütze Hinrich (2008) Introduction to Information Retrieval. New York, Cambridge University Press. ISBN 0521865719, 9780521865715

  • Muzaffar F, Mohsin B, Naz F, Jawed F (2005) Dsp implementation of voice recognition using dynamic time warping algorithm. Karachi, IEEE Explore, pp 1–7

  • Myers JL, Well AD (2003) Research design and statistical analysis. Lawrence Erlbaum Associates, Mahwah

    Google Scholar 

  • Oncina Jose, Sebban Marc (2006) Learning stochastic edit distance: application in handwritten character recognition. Pattern Recognit 39(9):1575–1587

    Article  MATH  Google Scholar 

  • R Core Team (2012) R: a language and environment for statistical computing. Vienna, R Foundation for Statistical Computing

  • Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1(9):56–68

    Article  Google Scholar 

  • Serrà Joan, Kantz Holger, Serra Xavier, Andrzejak Ralph G (2012) Predictability of music descriptor time series and its application to cover song detection. IEEE Trans Audio Speech Lang Process 20(2):514–525

    Google Scholar 

  • Vlachos Michail, Hadjieleftheriou Marios, Gunopulos Dimitrios , Keogh Eamonn J. (2003) Indexing multi-dimensional time-series with support for multiple distance measures. In: Getoor L, Senator TE, Domingos P, Faloutsos C (ed) In: Proceedings of SIGKDD. Washington DC, ACM, pp 216–225

  • Wang H, Lin Z (2007) A novel algorithm for counting all common subsequences. In: Proceedings of the 2007 IEEE international conference on granular computing, GRC ’07. Washington DC, IEEE Computer Society, p 502

  • Wodak SJ, Janin J (2002) Structural basis of macromolecular recognition. Adv Protein Chem 61:9–73

    Article  Google Scholar 

  • Xiong T, Wang S, Jiang Q, Huang JZ (2011) A new markov model for clustering categorical sequences. In: Proceedings of the 2011 IEEE 11th international conference on data mining, ICDM ’11. Washington DC, IEEE Computer Society, pp 854–863

  • Yan X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large datasets. In: In SDM. pp 166–177

  • Yang Q, Zhang HH. Web-log mining for predictive web caching. IEEE Trans Knowl Data Eng 15(4):1050–1053. ISSN 1041–4347

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elias Egho.

Additional information

Responsible editors: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen and Filip Železný.

Appendix

Appendix

1.1 Proof of Lemma 1

Let \(T=\left\langle T_1,\ldots ,T_m\right\rangle \) be a sequence that is counted multiple times; i.e., \(T\in (\varphi (S)\circ \mathcal {P}_{\ge 1}(Y))\cap \varphi (S)\). Clearly \(T_m\in \mathcal {P}_{\ge 1}(Y)\) as otherwise \(T\) would not have been in \(\varphi (S)\circ \mathcal {P}_{\ge 1}(Y)\). Let \(\ell \) denote \(\max \{j|T_m \subseteq S[j]\}\). Since \(T\in \varphi (S)\), such \(\ell \) must exist. Then, \(\ell \in L(S,Y)\), since \(\ell \) is the largest index for which \(S[\ell ]\cap Y\) includes \(T_m\). Therefore, \(T\in S^{\ell -1}\circ (\mathcal {P}_{\ge 1}(S[\ell ]\cap Y))\) for a \(\ell \in L(S,Y)\). \(\Box \)

1.2 Proof of Theorem 1

The proof is a simple application of the inclusion–exclusion principle to compute the cardinality of the union of Lemma 1:

$$\begin{aligned} R(S,Y)&= \left| \bigcup _{\ell \in L} \left\{ \varphi (S^{\ell -1})\circ \mathcal {P}_{\ge 1}(S[\ell ]\cap Y) \right\} \right| \\&= \sum _{K\subseteq L}(-1)^{|K|+1} \left| \bigcap _{\ell \in K} \left\{ \varphi (S^{\ell -1})\circ \mathcal {P}_{\ge 1}(S[\ell ]\cap Y) \right\} \right| \end{aligned}$$

The proof is completed by the following two observations:

$$\begin{aligned} set _K:=\bigcap _{\ell \in K} \left\{ \varphi (S^{\ell -1})\circ \mathcal {P}_{\ge 1}(S[\ell ]\cap Y) \right\} =\varphi (S^{\min (K)-1})\circ \mathcal {P}_{\ge 1}(\left( \cap _{k\in K}S[k]\right) \cap Y) \end{aligned}$$

Indeed; any sequence of length \(m\) in \( set _K\) has \(T^{m-1}\in S^{\min (K)-1}\), and \(T_m\in \mathcal {P}_{\ge 1}(S[k]\cap Y)\), for all \(k\in K\) and

$$\begin{aligned} \left| \varphi (S^{\min (K)-1})\circ \mathcal {P}_{\ge 1}(\left( \cap _{k\in K}S[k]\right) \cap Y)\right| =\left| \phi (S^{\min (K)-1})\right| \cdot \left( 2^{\left| \left( \cap _{k\in K}S[k]\right) \cap Y\right| }-1\right) \end{aligned}$$

\(\Box \)

1.3 Proof of Lemma 2

Let \(Z=\left\langle Z_1,\ldots ,Z_m\right\rangle \) be a new subsequence that is added to \(\varphi (S ,T)\) after concatenating sequence S with itemset Y; i.e., \(Z \in \varphi (S \circ Y ,T) \setminus \varphi (S,T)\). Clearly \(Z_m\in \mathcal {P}_{\ge 1}(Y)\) as otherwise \(Z\) would not have been added to \(\varphi (S\circ Y,T)\). Let \(\ell ^{'}=\max \{j|Z_m \subseteq T[j]\}\). Since \(Z \in \varphi (S \circ Y,T)\), with \(Z \preceq T\), then such \(\ell ^{'}\in L(T,Y)\) must exist. \(\ell ^{'}\) is the largest index for which \(T[\ell ^{'}]\cap Y\) includes \(Z_m\). Therefore, \(Z\in \varphi (S,T^{\ell ^{'}-1})\circ (\mathcal {P}_{\ge 1}(T[\ell ^{'}]\cap Y))\) for a \(\ell ^{'} \in L(T,Y)\).

Let \(W=\left\langle W_1,\ldots ,W_m\right\rangle \) be a sequence that is counted multiple times; i.e., \(W \in (\varphi (S,T^{\ell ^{'}-1})\circ \mathcal {P}_{\ge 1}(T[\ell ^{'}] \cap Y))\cap \varphi (S,T)\) where \(\ell ^{'}\in L(T,Y)\). Clearly \(W_m\in \mathcal {P}_{\ge 1}(T[\ell ^{'}] \cap Y)\) as otherwise \(W\) would not have been in \(\varphi (S,T^{\ell ^{'}-1})\circ \mathcal {P}_{\ge 1}(T[\ell ^{'}] \cap Y)\). Let \(\ell =\max \{j|W_m \subseteq S[j]\}\). Since \(W\in \varphi (S,T)\), such \(\ell \in L(S,Y)\) must exist, since \(\ell \) is the largest index for which \(S[\ell ]\cap Y\) includes \(Z_m\). Therefore, \(Z \in \varphi (S^{\ell -1},T^{\ell ^{'} -1})\circ (\mathcal {P}_{\ge 1}(S[\ell ] \cap T[\ell ^{'}] \cap Y))\) for \(\ell \in L(S,Y)\) and \(\ell ^{'} \in L(T,Y)\).   \(\square \)

1.4 Proof of Theorem 2

  1. Case 1:

    No items in \(Y\) appear in any itemset of \(S\) and \(T\), in this case the set of all common distinct subsequences between \(S \circ Y\) and \(T\) is exactly the same set of all common distinct subsequences between \(S\) and \(T\). Hence, \(\phi (S \circ Y, T)=\phi (S, T)\).

  2. Case 2:

    If at least an item in \(Y\) appears in either one of the sequences \(S\) or \(T\) (or both), then \(\varphi (S \circ Y, T)\) is expressed as the union of the set of all common distinct subsequences between \(S\) and \(T\) with the set of added sequences \(\mathcal {A}\) without the set of repeated sequences \(\mathcal {R}\). Formally,

    $$\begin{aligned} \varphi (S \circ Y, T) = \varphi (S, T) \cup \mathcal {A} \backslash \mathcal {R} \end{aligned}$$
    (8)

    with

    $$\begin{aligned} \mathcal {A}=\left\{ \displaystyle \bigcup _{\ell ^{'} \in L(T,Y)}\varphi (S,T^{\ell ^{'}-1}) \circ \mathcal {P}_{\ge 1}(T[\ell ^{'}] \cap Y)\right\} \end{aligned}$$
    $$\begin{aligned} \mathcal {R}=\left\{ \displaystyle \bigcup _{\ell \in L(S,Y)} \left\{ \displaystyle \bigcup _{\ell ^{'} \in L(T,Y)} \varphi (S^{\ell -1},T^{ \ell ^{'}-1}) \circ \mathcal {P}_{\ge 1}(S[\ell ] \cap T[\ell ^{'}] \cap Y) \right\} \right\} \end{aligned}$$

    Notice that because these three sets are disjoint, the cardinality of \(\varphi (S \circ Y, T)\) can be simply expressed as \( |\varphi (S \circ Y, T)| = |\varphi (S, T)| + |\mathcal {A}| - |\mathcal {R}| \). Using the inclusion–exclusion principle, \(|\mathcal {A}|\), denoted as \(A(S,T,Y)\) can be written as,

    $$\begin{aligned} A(S,T,Y)&= \left| \left\{ \displaystyle \bigcup _{\ell \in L(T,Y) } \varphi (S,T^{\ell -1})\circ \mathcal {P}_{\ge 1}(T[\ell ] \cap Y) \right\} \right| \nonumber \\&= \sum _{K\subseteq L(T,Y)} (-1)^{|K|+1} \left| \bigcap _{\ell \in K} \left\{ \varphi (S,T^{\ell -1})\circ \mathcal {P}_{\ge 1}(T[\ell ] \cap Y)\right\} \right| \end{aligned}$$
    (9)

    \(A(S,T,Y)\) is completed by the following two observations:

    $$\begin{aligned} set _K&:= \bigcap _{\ell \in K} \left\{ \varphi (S,T^{\ell -1})\circ \mathcal {P}_{\ge 1}(T[\ell ] \cap Y)\right\} \\&= \varphi (S,T^{\min (K)-1})\circ \mathcal {P}_{\ge 1}(\left( \cap _{k\in K}T[k]\right) \cap Y) \end{aligned}$$

    And, the second observation:

    $$\begin{aligned} \left| set _K\right| =\phi (S,T^{\min (K)-1})\cdot \left( 2^{\left| \left( \cap _{k\in K}T[k]\right) \cap Y\right| }-1\right) \end{aligned}$$

    \(A(S,T,Y)\) can be written as,

    $$\begin{aligned} A(S,T,Y) = \displaystyle \sum _{K\subseteq L(T,Y)} (-1)^{|K|+1} \left( \phi (S,T^{\min (K)-1})\cdot \left( 2^{|\left( \bigcap _{j\in K}T[j]\right) \cap Y|} - 1 \right) \right) \end{aligned}$$

    The same inclusion–exclusion reasoning applies to the cardinality of \(\mathcal {R}\), denoted \(R(S,T,Y)\)

    $$\begin{aligned} R(S,T,Y)=\left| \left\{ \displaystyle \bigcup _{\ell \in L(S,Y)}\! \left\{ \displaystyle \bigcup _{\ell ^{'} \in L(T,Y)}\! \varphi (S^{\ell -1},T^{ \ell ^{'}-1})\circ \mathcal {P}_{\ge 1}(S[\ell ] \cap T[\ell ^{'}] \cap Y) \right\} \! \right\} \right| \end{aligned}$$
    $$\begin{aligned}&=\sum _{K\subseteq L(S,Y)} (-1)^{|K|+1}\left( \sum _{K^{'}\subseteq L(T,Y)} (-1)^{|K^{'}|+1}\left| \bigcap _{\ell \in K}\bigcap _{\ell ^{'}\in K^{'}} \left\{ \varphi (S^{\ell -1},T^{\ell ^{'}-1})\circ \right. \right. \right. \\&\qquad \left. \left. \left. \mathcal {P}_{\ge 1}(S[\ell ]\cap T[\ell ^{'}]\cap Y) \right\} \right| \right) \end{aligned}$$

    The final result follows after noticing that,

    $$\begin{aligned} set _{K,K^{'}}&= \bigcap _{\ell \in K}\bigcap _{\ell ^{'}\in K^{'}} \varphi (S^{\ell -1},T^{\ell ^{'}-1})\circ \mathcal {P}_{\ge 1}(S[\ell ]\cap T[\ell ^{'}]\cap Y) \\ set _{K,K^{'}}&= \varphi (S^{\min (K)-1},T^{\min (K^{'})-1})\circ \mathcal {P}_{\ge 1}(\left( \cap _{k\in K}S[k]\right) \cap \left( \cap _{k^{'}\in K^{'}}T[k^{'}]\right) \cap Y) \end{aligned}$$

    \(R(S,T,Y)\) can be written as,

    $$\begin{aligned} R(S,T,Y) = \displaystyle \sum _{K\subseteq L(S,Y)} (-1)^{|K|+1} \left( \sum _{K^{'}\subseteq L(T,Y)} (-1)^{|K^{'}|+1} \cdot f(K,K^{'}) \right) \end{aligned}$$

    where:

    $$\begin{aligned} f(K,K^{'})=\phi (S^{\min (K)-1},T^{\min (K^{'})-1}) \cdot \left( 2^{| \left( \bigcap _{j\in K}S[j] \right) \cap \left( \bigcap _{j^{'}\in K^{'}}T[{j^{'}]} \right) \cap Y|} - 1 \right) \end{aligned}$$

\(\square \)

1.5 Details of Linial–Nisan approximation for Example 7

To do the Linial–Nisan approximation, remark that \(N=|L|=9\) and \(K=\lceil \sqrt{N} \rceil =3\). The vector of Linial–Nisan coefficients is defined as \(\overrightarrow{\alpha }=(\alpha _1^{3,9} , \alpha _2^{3,9},\alpha _3^{3,9}) = \overrightarrow{t} \cdot \mathcal {M}^{-1}\) where \(\mathcal {A}\) is the matrix whose \((i,j)\) entry is \({j \atopwithdelims ()i}\). The inverse matrix \(\mathcal {A}^{-1}(i,j)\) is defined as \((-1)^{i+j}{j \atopwithdelims ()i}\). In our example,

$$\begin{aligned} \mathcal {M}^{-1}&= \begin{pmatrix} 1 &{} -2 &{} 3 \\ 0 &{} 1 &{} -3 \\ 0 &{} 0 &{} 1 \\ \end{pmatrix}\\ \end{aligned}$$

\(\overrightarrow{t}=(q_{K,N}(1),q_{K,N}(2),\dots ,q_{K,N}(K))\) is the vector of linearly transformed Chebyshev polynomials and is computed using the polynomial \(T_K(x)\) as follows:

$$\begin{aligned} q_{3,9}(1)&= 1- \frac{T_{3}(\frac{2-(9+1)}{9-1})}{T_{3}(\frac{-(9+1)}{9-1})} =1-\frac{T_{3}(-1)}{T_{3}(-\frac{10}{8})}=1- \frac{-1}{-4,06}=0.75\\ q_{3,9}(2)&= 1- \frac{T_{3}(\frac{4-(9+1)}{9-1})}{T_{3}(\frac{-(9+1)}{9-1})}= 1-\frac{T_{3}(-\frac{6}{8})}{T_{3}(-\frac{10}{8})}=1- \frac{0.56}{-4,06}=1.13\\ q_{3,9}(3)&= 1- \frac{T_{3}(\frac{6-(9+1)}{9-1})}{T_{3}(\frac{-(9+1)}{9-1})}= 1-\frac{T_{3}(-\frac{4}{10})}{T_{3}(-\frac{10}{8})}=1- \frac{1}{-4,06}=1.24\\ \end{aligned}$$

This vector is necessary to solve the system of linear equations:

$$\begin{aligned} \overrightarrow{\alpha }&= (\alpha _1^{3,9} , \alpha _2^{3,9},\alpha _3^{3,9})=\overrightarrow{t} \cdot \mathcal {M}^{-1}\\&= \begin{pmatrix} 0.75 &{} 1.13 &{} 1.24 \\ \end{pmatrix}.\begin{pmatrix} 1 &{} -2 &{} 3 \\ 0 &{} 1 &{} -3 \\ 0 &{} 0 &{} 1 \\ \end{pmatrix}\\&= \begin{pmatrix} 0.75 &{} -0.36 &{} 0.1 \\ \end{pmatrix} \end{aligned}$$

A solution to the system above is given by \(\alpha _1^{3,9}=0.75;~\alpha _2^{3,9}=-0.36;~\alpha _3^{3,9}=0.1\). The real numbers \(\alpha _k^{3,9}\) can now be used to approximate the inclusion–exclusion formula in the correction term as follows:

$$\begin{aligned} R_{LN}(S^{9},S[10])&= \displaystyle \sum _{k=1}^{3} \alpha _k^{3,9} \displaystyle \sum _{\begin{array}{c} O \subseteq L(S,Y) \\ |O|=k \end{array}} \phi _{LN}(S^{\min (O)-1})\cdot \left( 2^{|\left( \bigcap _{j\in O}S[j]\right) \cap S[10]|} - 1 \right) \\&= 9,298\,1279.10^26 \end{aligned}$$

Notice here that the formula contains only \(\sum _{i=1}^{3}{9 \atopwithdelims ()i}\) terms, which is already a significant computation gain in comparison with the \(\sum _{i=1}^{9}{9 \atopwithdelims ()i}\) terms in the classical approach. Finally, the approximated number of distinct subsequences for sequence \(S\) is \(\phi _{LN}(S^{10})=2^{|\{a,b,c,d,e,f,g,h,i,j,,k\}|} \cdot \phi _{LN}(S^{9})-R_{LN}(S^{9},S[10])=2,524\,4956.10^30\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Egho, E., Raïssi, C., Calders, T. et al. On measuring similarity for sequences of itemsets. Data Min Knowl Disc 29, 732–764 (2015). https://doi.org/10.1007/s10618-014-0362-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0362-1

Keywords

Navigation