Skip to main content
Log in

Sequence graph transform (SGT): a feature embedding function for sequence data mining

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Sequence feature embedding is a challenging task due to the unstructuredness of sequences—arbitrary strings of arbitrary length. Existing methods are efficient in extracting short-term dependencies but typically suffer from computation issues for the long-term. Sequence Graph Transform (SGT), a feature embedding function, that can extract a varying amount of short- to long-term dependencies without increasing the computation is proposed. SGT’s properties are analytically proved for interpretation under normal and uniform distribution assumptions. SGT features yield significantly superior results in sequence clustering and classification with higher accuracy and lower computation as compared to the existing methods, including the state-of-the-art sequence/string Kernels and LSTM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. Unless otherwise mentioned, the analyses are done on 2.2 GHz Quad-Core Intel Core i7 16 GB 1600 MHz DDR3 machine.

  2. https://www.uniprot.org.

  3. https://www.ll.mit.edu/ideval/data/1998data.html.

  4. http://archive.ics.uci.edu/ml/datasets/msnbc.com+anonymous+web+data.

  5. The configurations details of r4.xlarge is available here https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/memory-optimized-instances.html.

  6. The protein sequence of A0A2T0PYE0 is, MSAAADRPTVEISTDFYSLDALMALVDEPPRLALAPEVAERIDAGARYVERIAPQDRHIY GINTGFGPLCETRIPADRMSELQHKHLVSHACGVGEPVPERVSRLAMLVKLLTFRAGYSG ISLEAVQRVLDLWNADVIPVVPKKGTVGASGDLAPLAHLALPLIGLGKVRVDGRITDAGA VLEAMGWKPLRLKPKEGLALTNGVQYINALALDSVLRSERLIKAADLIAGLSIQGFSCAD TFYQPILHATSLHPERSAVAGNLVRLLDGSNHHTLPQGNAAREDPYSFRCAPQVHAAVRQ TCGFARDIVGRECNSVSDNPLFFPEHDQVILGGNLHGESTAFALDFLAIAMSELANISER RTYQLLSGQHGLPDFLAPEPGVDSGLMIPQYTSAALVNENKVLATPASIDTIPTSQLQED HVSMGGTSAYKLWTILDNCEYVLAVELMTAVQAIDLNQGLRPSPATRGVVAEFRQEVGFL REDRLQADDIEKSRRYLRGRLRTWAKDLD.

  7. Essentially, SGT-based search could be used for problems where distribution-based methods like Markov or Hidden Markov model are used as opposed to alignment.

References

  • Aggarwal CC, Han J (2014) Frequent pattern mining. Springer, Berlin

    Book  MATH  Google Scholar 

  • Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402

    Article  Google Scholar 

  • Ayres J, Flannick J, Gehrke J, Yiu T (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 429–435

  • Bagnall A, Lines J, Hills J, Bostrom A (2015) Time-series classification with COTE: the collective of transformation-based ensembles. IEEE Trans Knowle Data Eng 27(9):2522–2535

    Article  Google Scholar 

  • Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Discov 31(3):606–660

    Article  MathSciNet  Google Scholar 

  • Bailey TL, Elkan C et al (1994) Fitting a mixture model by expectation maximization to discover motifs in bipolymers

  • Baydogan MG, Runger G, Tuv E (2013) A bag-of-features framework to classify time series. IEEE Trans Pattern Anal Mach Intell 35(11):2796–2802

    Article  Google Scholar 

  • Bostrom A, Bagnall A (2017) Binary shapelet transform for multiclass time series classification. In: Transactions on large-scale data-and knowledge-centered systems XXXII. Springer, pp 24–46

  • Buhler J (2001) Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17(5):419–428

    Article  Google Scholar 

  • Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9(2):225–242

    Article  Google Scholar 

  • Cadez I, Heckerman D, Meek C, Smyth P, White S (2003) Model-based clustering and visualization of navigation patterns on a web site. Data Min Knowl Discov 7(4):399–424

    Article  MathSciNet  Google Scholar 

  • Chiu D-Y, Wu Y-H, Chen AL (2004) An efficient algorithm for mining frequent sequences by a new strategy without support counting. In: Proceedings of 20th international conference on data engineering. IEEE, pp 375–386

  • Comin M, Verzotto D (2012) Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol 7(1):34

    Article  Google Scholar 

  • Costa F, De Grave K (2010) Fast neighborhood subgraph pairwise distance kernel. In: Proceedings of the 26th international conference on machine learning. Omnipress, pp 255–262

  • Cristianini N, Shawe-Taylor J et al (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Deng H, Runger G, Tuv E, Vladimir M (2013) A time series forest for classification and feature extraction. Inf Sci 239:142–153

    Article  MathSciNet  MATH  Google Scholar 

  • Didier G, Corel E, Laprevotte I, Grossmann A, Landés-Devauchelle C (2012) Variable length local decoding and alignment-free sequence comparison. Theor Comput Sci 462:1–11

    Article  MathSciNet  MATH  Google Scholar 

  • Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform 5(1):113

    Article  Google Scholar 

  • Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460–2461

    Article  Google Scholar 

  • Eskin E, Weston J, Noble WS, Leslie CS (2003) Mismatch string kernels for SVM protein classification. In: Advances in neural information processing systems, pp 1441–1448

  • Farhan M, Tariq J, Zaman A, Shabbir M, Khan IU (2017) Efficient approximation algorithms for strings kernel based sequence classification. In: Advances in neural information processing systems, pp 6938–6948

  • Ferreira F, Pacheco A (2005) Simulation of semi-Markov processes and Markov chains ordered in level crossing. In: Next generation internet networks. IEEE, pp 121–128

  • Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152

    Article  Google Scholar 

  • Gamboa JCB (2017) Deep learning for time-series analysis. arXiv preprint arXiv:1701.01887

  • Glusman G, Mauldin DE, Hood LE, Robinson M (2017) Ultrafast comparison of personal genomes via precomputed genome fingerprints. Front Genet 8:136

    Article  Google Scholar 

  • Graves A (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850

  • Han J, Pei J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th international conference on data engineering, pp 215–224

  • Haussler D (1999) Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California

  • Helske S, Helske J (2017) Mixture hidden Markov models for sequence data: the seqHMM package in R. arXiv preprint arXiv:1704.00543

  • Hills J, Lines J, Baranauskas E, Mapp J, Bagnall A (2014) Classification of time series by shapelet transformation. Data Min Knowl Discov 28(4):851–881

    Article  MathSciNet  MATH  Google Scholar 

  • Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM, pp 604–613

  • Kate RJ (2016) Using dynamic time warping distances as features for improved time series classification. Data Min Knowl Discov 30(2):283–312

    Article  MathSciNet  MATH  Google Scholar 

  • Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C (2005) Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 3(03):527–550

    Article  Google Scholar 

  • Kuksa PP, Huang P-H, Pavlovic V (2009) Scalable algorithms for string kernels with inexact matching. In: Advances in neural information processing systems, pp 881–888

  • Kumar P, Krishna PR, Raju SB (2012) Pattern discovery using sequence data mining: applications and studies. Information Science Reference, Hershey

    Book  Google Scholar 

  • Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214

    Article  Google Scholar 

  • Leslie C, Eskin E, Noble WS (2001) The spectrum kernel: a string kernel for SVM protein classification. In: Biocomputing 2002. World Scientific, pp 564–575

  • Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476

    Article  Google Scholar 

  • Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483

    Article  Google Scholar 

  • Lines J, Bagnall A (2015) Time series classification with ensembles of elastic distance measures. Data Min Knowl Discov 29(3):565–592

    Article  MathSciNet  MATH  Google Scholar 

  • Lines J, Taylor S, Bagnall A (2016) HIVE-COTE: the hierarchical vote collective of transformation-based ensembles for time series classification. In: IEEE 16th international conference on data mining (ICDM). IEEE, pp 1041–1046

  • Lines J, Taylor S, Bagnall A (2018) Time series classification with HIVE-COTE: The hierarchical vote collective of transformation-based ensembles. ACM Trans Knowl Discov Data (TKDD) 12(5):52

    Google Scholar 

  • Lipman DJ, Altschul SF, Kececioglu JD (1989) A tool for multiple sequence alignment. Proc Natl Acad Sci 86(12):4412–4415

    Article  Google Scholar 

  • Liu C, Wang F, Hu J, Xiong H (2015) Temporal phenotyping from longitudinal electronic health records: a graph based framework. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 705–714

  • Liu C, Zhang K, Xiong H, Jiang G, Yang Q (2016) Temporal skeletonization on sequential data: patterns, categorization, and visualization. IEEE Trans Knowl Data Eng 28(1):211–223

    Article  Google Scholar 

  • Masseglia F, Cathala F, Poncelet P (1998) The PSP approach for mining sequential patterns. In: Principles of data mining and knowledge discovery, pp 176–184

  • Morgenstern B (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics (Oxford, England) 15(3):211–218

    Google Scholar 

  • Neamtu R, Ahsan R, Rundensteiner EA, Sarkozy G, Keogh E, Dau HA, Nguyen C, Lovering C (2018) Generalized dynamic time warping: unleashing the warping power hidden in point-wise distances. In: IEEE 34th international conference on data engineering (ICDE). IEEE, pp 521–532

  • Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

    Article  Google Scholar 

  • Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217

    Article  Google Scholar 

  • Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183:63–98

    Article  Google Scholar 

  • Ranjan C, Paynabar K, Helm JE, Pan J (2015) The impact of estimation: a new method for clustering and trajectory estimation in patient flow modeling. In: Production and operations management

  • Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9(2):173–175

    Article  Google Scholar 

  • Sandve GK, Drabløs F (2006) A survey of motif discovery methods in an integrated framework. Biol Direct 1(1):11

    Article  Google Scholar 

  • Schäfer P (2015) The BOSS is concerned with time series classification in the presence of noise. Data Min Knowl Discov 29(6):1505–1530

    Article  MathSciNet  MATH  Google Scholar 

  • Shi Q, Petterson J, Dror G, Langford J, Smola A, Vishwanathan S (2009) Hash kernels for structured data. J Mach Learn Res 10:2615–2637

    MathSciNet  MATH  Google Scholar 

  • Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46(1):13

    Article  MATH  Google Scholar 

  • Siyari P, Dilkina B, Dovrolis C (2016) Lexis: An optimization framework for discovering the hierarchical structure of sequential data. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1185–1194

  • Smith TF, Waterman MS (1981) Comparison of biosequences. Adv Appl Math 2(4):482–489

    Article  Google Scholar 

  • Smola AJ, Vishwanathan S (2003) Fast kernels for string and tree matching. In: Advances in neural information processing systems, pp 585–592

  • Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Advances in database technology–EDBT’96, pp 1–17

  • Stoye J, Moulton V, Dress AW (1997) DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Bioinformatics 13(6):625–626

    Article  Google Scholar 

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112

  • Thompson JD, Higgins DG, Gibson TJ (1994a) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680

    Article  Google Scholar 

  • Thompson JD, Higgins DG, Gibson TJ (1994b) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Bioinformatics 10(1):19–29

    Article  Google Scholar 

  • Wang JT, Zaki MJ, Toivonen HT, Shasha D (2005) Introduction to data mining in bioinformatics. In: Data mining in bioinformatics. Springer, pp 3–8

  • Wang L, Jiang T (1994) On the complexity of multiple sequence alignment. J Comput Biol 1(4):337–348

    Article  Google Scholar 

  • Wesselink J-J, de la Iglesia B, James SA, Dicks JL, Roberts IN, Rayward-Smith VJ (2002) Determining a unique defining DNA sequence for yeast species using hashing techniques. Bioinformatics 18(7):1004–1010

    Article  Google Scholar 

  • Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34 (suppl\_1): D187–D191

  • Wu L, Yen IE-H, Huo S, Zhao L, Xu K, Ma L, Ji S, Aggarwal C (2019) Efficient global string kernel with random features: beyond counting substructures. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining, pp 520–528

  • Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM Sigkdd Explor Newsl 12(1):40–48

    Article  Google Scholar 

  • Zainuddin Z, Kumar M (2008) Radial basic function neural networks in protein sequence classification. Malays J Math Sci 2(2):195–204

    Google Scholar 

  • Zaki MJ (2001) SPADE: An efficient algorithm for mining frequent sequences. Mach Learn 42(1):31–60

    Article  MATH  Google Scholar 

  • Zaki NM, Deris S, Illias RM (2004) Features extraction for protein homology detection using hidden Markov models combining scores. Int J Comput Intell Appl 4(01):1–12

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chitta Ranjan.

Additional information

Responsible editor: Bart Goethals.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices

Mean and variance of \(\psi _{uv}\)

Consider an arbitrary sequence s, where the sequence has an ordered list of symbols. These symbols belong to a finite set \({\mathcal {V}}\). SGT embedding works by finding the dependencies between every pair of symbols \((u,v); u, v \in {\mathcal {V}}\).

To easily denote various (uv) pairs in s, we use a term, mth neighboring pair, where an mth neighbor pair for (uv) will have \(m-1\) other u’s in between. A first neighbor is thus the immediate (uv) neighboring pairs, while the 2nd-neighbor has one other u in between, and so on (see Fig. 4 for illustration). The immediate neighbor mentioned in the assumption in Sect. 2.4.1 is the same as the first neighbor defined here.

Based on the sequence patterns assumption in Sect. 2.4.1 and assuming uv occur uniformly in the sequence with probability p, the expected number of first-neighbor (uv) pairs is given as \(M=pL\). Consequently, it is easy to show that the expected number of mth neighboring (uv) pairs is \((M-m+1)\), i.e., the second neighboring (uv) pairs will be \((M-1)\), \((M-2)\) for the third, so on and so forth, till one instance for the Mth neighbor. The gap distance for an mth neighbor is given as \(Z_{1}=X;Z_{m}=X+\sum _{i=2}^{m}Y_{i},m=2,\ldots ,M\).

Besides, the total number of (uv) pair instances will be \(\sum _{m=1}^{M}m=\frac{M(M+1)}{2}\)(\(=|\varLambda _{uv}|\), by definition). Suppose we define a set that contains distances for each possible (uv) pairs as \({\mathcal {Z}}=\{Z_{m}^{i},i=1,\ldots ,(M-m+1);m=1,\ldots ,M\}\). Also, since \(Z_{m}\sim N(\mu _{\alpha }+(m-1)\mu _{\beta },\sigma _{\alpha }^{2}+(m-1)\sigma _{\beta }^{2})\), \(\phi _{\kappa }(Z_{m})\) becomes a lognormal distribution. Thus,

$$\begin{aligned} E[\phi _{\kappa }(Z_{m})]= & {} e^{-{\tilde{\mu }}_{\alpha }-(m-1){\tilde{\mu }}_{\beta }} \end{aligned}$$
(13)
$$\begin{aligned} \text {var}[\phi _{\kappa }(Z_{m})]= & {} e^{-2{\tilde{\mu }}_{\alpha }'-2(m-1){\tilde{\mu }}_{\beta }'}-e^{-2{\tilde{\mu }}_{\alpha }-2(m-1){\tilde{\mu }}_{\beta }} \end{aligned}$$
(14)

where,

$$\begin{aligned} {\tilde{\mu }}_{\alpha }=\kappa \mu _{\alpha }-\frac{\kappa ^{2}}{2}\sigma _{\alpha }^{2}&;&{\tilde{\mu }}'_{\alpha }=\kappa \mu _{\alpha }-\kappa ^{2}\sigma _{\alpha }^{2}\nonumber \\ {\tilde{\mu }}_{\beta }=\kappa \mu _{\beta }-\frac{\kappa ^{2}}{2}\sigma _{\beta }^{2}&;&{\tilde{\mu }}'_{\beta }=\kappa \mu _{\beta }-\kappa ^{2}\sigma _{\beta }^{2} \end{aligned}$$
(15)

Besides, the feature, \(\psi _{uv}\), in Eq. (3a) can be expressed and further derived using Sect. A.1 as,

$$\begin{aligned} E[\psi _{uv}]= & {} \frac{\sum _{Z\in {\mathcal {Z}}}E[\phi _{\kappa }(Z)]}{M(M+1)/2}\nonumber \\= & {} \frac{\sum _{m=1}^{M}(M-(m-1))e^{-{\tilde{\mu }}_{\alpha }-(m-1){\tilde{\mu }}_{\beta }}}{M(M+1)/2}\nonumber \\= & {} \frac{2}{pL+1}\frac{e^{-{\tilde{\mu }}_{\alpha }}}{\underbrace{\left| \left( 1-e^{-{\tilde{\mu }}_{\beta }}\right) \left[ 1-\frac{1-e^{-pL{\tilde{\mu }}_{\beta }}}{pL(e^{{\tilde{\mu }}_{\beta }}-1)}\right] \right| }_{\gamma }} \end{aligned}$$
(16)

This yields to the expectation expression in Eq. (4). Besides, the variances will be

$$\begin{aligned} \text {var}(\psi _{uv})= & {} \left( \frac{1}{pL(pL+1)/2}\right) ^{2}\left[ \left\{ \frac{e^{-2{\tilde{\mu }}'_{\alpha }}}{1-e^{-2{\tilde{\mu }}'_{\beta }}}\left( pL\right. \right. \right. \\&\left. \left. -e^{-2{\tilde{\mu }}'_{\beta }}\left( \frac{1-e^{-2pL{\tilde{\mu }}'_{\beta }}}{1-e^{-2{\tilde{\mu }}'_{\beta }}}\right) \right) \right\} \\&-\underbrace{\left. \left\{ \frac{e^{-2{\tilde{\mu }}{}_{\alpha }}}{1-e^{-2{\tilde{\mu }}{}_{\beta }}}\left( pL-e^{-2{\tilde{\mu }}_{\beta }}\left( \frac{1-e^{-2pL{\tilde{\mu }}{}_{\beta }}}{1-e^{-2{\tilde{\mu }}{}_{\beta }}}\right) \right) \right\} \right] }_{\pi }\\ \text {var}(\psi _{uv})= & {} {\left\{ \begin{array}{ll} \left( \frac{1}{pL(pL+1)/2}\right) ^{2}\pi ; &{} \text {length sensitive}\\ \left( \frac{1}{p(pL+1)/2}\right) ^{2}\pi ; &{} \text {length insensitive} \end{array}\right. } \end{aligned}$$

1.1 Arithmetico-geometric series

The sum of a series, where the kth term for \(k\ge 1\) can be expressed as,

$$\begin{aligned} t_{k}= & {} \left( a+(k-1)d\right) br^{k-1} \end{aligned}$$

is called an arithmetico-geometric because of a combination of arithmetic series term \((a+(k-1)d)\), where a is the initial term and common difference d, and geometric \(br^{k-1}\), where b is the initial value and common ratio being r.

Suppose the sum of the series till n terms is denoted as,

$$\begin{aligned} S_{n}= & {} \sum _{k=1}^{n}\left( a+(k-1)d\right) br^{k-1} \end{aligned}$$
(17)

Without loss of generality we can assume \(b=1\) for deriving the expression for \(S_{n}\) (the sum for any other value of b can be easily obtained by multiplying the expression for \(S_{n}\) with b). Expanding Eq. (17),

$$\begin{aligned} S_{n}= & {} a+(a+d)r+\ldots +(a+(n-1)d)r^{n-1} \end{aligned}$$
(18)

Now multiplying \(S_{n}\) with r,

$$\begin{aligned} rS_{n}= & {} ar+(a+d)r^{2}+\ldots +(a+(n-1)d)r^{n} \end{aligned}$$
(19)

Subtracting Eq. (19) from (18), if \(|r|<1\), else we subtract the latter from the former, we get,

Therefore,

$$\begin{aligned} S_{n}= & {} \left| \frac{1}{1-r}\left[ a+\frac{dr(1-r^{n-1})}{1-r}-\left( a+(n-1)d\right) r^{n}\right] \right| \end{aligned}$$
(20)

or, for any value of b,

$$\begin{aligned} S_{n}= & {} b\left| \frac{1}{1-r}\left[ a+\frac{dr(1-r^{n-1})}{1-r}-\left( a+(n-1)d\right) r^{n}\right] \right| \end{aligned}$$
(21)

\({\mathbf {W}}^{(\kappa )}\) independent with respect to \(\kappa \)

Proof

We have \({\mathbf {W}}^{(\kappa )} = [W^{(\kappa )}_{uv}],\,u,v\in {\mathcal {V}}\) where

$$\begin{aligned} W^{(\kappa )}_{uv} = \sum _{\forall (l,m) \in \varLambda _{uv}(s)}e^{-\kappa |m-l|} \end{aligned}$$
(22)

To prove the independence of \({\mathbf {W}}^{(\kappa )}\) with respect to \(\kappa \), we need to show \({\mathbf {w}}_{u\cdot }^{(\kappa )} \perp \!\!\! \perp {\mathbf {w}}_{u\cdot }^{(\kappa + \delta )}\) where \({\mathbf {w}}_{u\cdot }\) is a column in \({\mathbf {W}}\).

Without loss of generality assume \(\varLambda _{uv}(s) = 1\) and replacing \(|m-l|\) with \(x, x>0\) in Eq. (22) the column \({\mathbf {w}}_{u\cdot }\) for \(\kappa \) and \(\kappa + \delta \) will be, \({\mathbf {w}}_{u\cdot }^{(\kappa )} = [e^{-\kappa x_i}]\) and \({\mathbf {w}}_{u\cdot }^{(\kappa + \delta )} = [e^{-\kappa x_i}]\) where \(i = 1, \ldots , |{\mathcal {V}}|\).

\({\mathbf {w}}_{u\cdot }^{(\kappa )}\) and \({\mathbf {w}}_{u\cdot }^{(\kappa +\delta )}\) will be dependent iff there is a nontrivial solution for ab in Eq. (23).

$$\begin{aligned} a{\mathbf {w}}_{u\cdot }^{(\kappa )} - b{\mathbf {w}}_{u\cdot }^{(\kappa + \delta )} = 0 \end{aligned}$$
(23)

Solving this for a and b by taking any element i in \({\mathbf {w}}_{u\cdot }\).

$$\begin{aligned}&a w_{ui}^{(\kappa )} - b w_{ui}^{(\kappa +\delta )}&=0\\ \implies&a e^{-\kappa x_i} - b e^{-(\kappa +\delta ) x_i}&=0\\ \implies&e^{-\kappa x_i}(a - b e^{-\delta x_i})&=0\\ \end{aligned}$$

Since \(e^{-\kappa x_i}\ne 0\), we have \(a = b e^{-\delta x_i}\). Plugging this in the equation for another element j in \({\mathbf {w}}_{u\cdot }\) we get,

$$\begin{aligned}&a w_{uj}^{(\kappa )} - b w_{uj}^{(\kappa +\delta )}&=0\\ \implies&b e^{-\delta x_i} e^{-\kappa x_j} - b e^{-(\kappa +\delta ) x_j}&=0\\ \implies&b e^{-\kappa x_j}(e^{-\delta x_i} - e^{-\delta x_j})&=0\\ \end{aligned}$$

Since \(e^{-\delta x_i} \ne e^{-\delta x_j} \forall i,j\), and \(e^{-\kappa x_j} \ne 0\) we have \(b=0\). Consequently, \(a=0\).

Therefore, Eq. (23) has a trivial solution \(a,b=0\). Thus, \({\mathbf {w}}_{u\cdot }^{(\kappa )}\) is independent with respect to \(\kappa \) implying independence of \({\mathbf {W}}^{(\kappa )}\). \(\square \)

Proof for symbol clustering

We have, \(\frac{\partial \varDelta }{\partial \kappa }=\frac{\partial }{\partial \kappa }E[\phi _{\kappa }(X)-\phi _{\kappa }(Y)]=E[\frac{\partial }{\partial \kappa }\phi _{\kappa }(X)-\frac{\partial }{\partial \kappa }\phi _{\kappa }(Y)]\). For \(E[X]<E[Y]\), we want, \(\frac{\partial \varDelta }{\partial \kappa }>0\), in turn, \(\frac{\partial }{\partial \kappa }\phi _{\kappa }(X)>\frac{\partial }{\partial \kappa }\phi _{\kappa }(Y)\). This will hold if \(\frac{\partial ^{2}}{\partial d\partial \kappa }\phi _{\kappa }(d)>0\), that is, slope, \(\frac{\partial }{\partial \kappa }\phi _{\kappa }(d)\) is increasing with d. For an exponential expression for \(\phi \) (Eq. 1), the above condition holds true if \(\kappa d>1\). Hence, under these conditions, the separation increases as we increase the tuning parameter, \(\kappa \).

Sequence simulation

In this section, we explain the sequence generation for Exp 1–2 in Sect. 4.

A sequence is generated by randomly selecting a string from the motif set and placing them in random order between arbitrary strings. These interspersed arbitrary symbols are the noise in a sequence. Figure 17 shows an example of a sequence generated from a set of seed motifs. About 50% of the sequence is noise—arbitrary strings. Note that due to the random motif selection, a simulated sequence does not necessarily contain all seed motifs.

1.1 Generating sequence clusters

Suppose we have to generate K sequence clusters. We first randomly simulate K sets of motifs. In each set, the motifs are of random lengths (between 2 and 8 in our simulations). The size of a set is also randomly chosen (between 6 and 11 in this paper). Sequences are then generated from each motif set as described above.

Fig. 17
figure 17

A simulated sequence from seed motifs

Fig. 18
figure 18

Motif sets of overlapping clusters

1.2 What are overlapping clusters?

In our experiments, we have to test the efficacy of clustering methods when the clusters are difficult to separate. This is the case when the clusters overlap. In traditional multidimensional data in Euclidean space, it means the centroids of the clusters are close. In our problem, overlapping clusters imply that they have some common seed motifs. In other words, the intersection of the clusters’ motif sets is not null. Thus, 0% overlap implies the intersection of motif sets is null, and a 100% overlap implies the intersection is equal to the union. Figure 18 shows an example of overlapping motif sets of two clusters.

Code repository and data sets

The source code and data sets are available on GitHub as https://github.com/cran2367/sgt. Its python package is also provided at https://pypi.org/project/sgt/.

The python package can be installed as follows.

figure f

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ranjan, C., Ebrahimi, S. & Paynabar, K. Sequence graph transform (SGT): a feature embedding function for sequence data mining. Data Min Knowl Disc 36, 668–708 (2022). https://doi.org/10.1007/s10618-021-00813-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-021-00813-0

Keywords

Navigation