Abstract
Sequence feature embedding is a challenging task due to the unstructuredness of sequences—arbitrary strings of arbitrary length. Existing methods are efficient in extracting short-term dependencies but typically suffer from computation issues for the long-term. Sequence Graph Transform (SGT), a feature embedding function, that can extract a varying amount of short- to long-term dependencies without increasing the computation is proposed. SGT’s properties are analytically proved for interpretation under normal and uniform distribution assumptions. SGT features yield significantly superior results in sequence clustering and classification with higher accuracy and lower computation as compared to the existing methods, including the state-of-the-art sequence/string Kernels and LSTM.
Similar content being viewed by others
Notes
Unless otherwise mentioned, the analyses are done on 2.2 GHz Quad-Core Intel Core i7 16 GB 1600 MHz DDR3 machine.
The configurations details of r4.xlarge is available here https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/memory-optimized-instances.html.
The protein sequence of A0A2T0PYE0 is, MSAAADRPTVEISTDFYSLDALMALVDEPPRLALAPEVAERIDAGARYVERIAPQDRHIY GINTGFGPLCETRIPADRMSELQHKHLVSHACGVGEPVPERVSRLAMLVKLLTFRAGYSG ISLEAVQRVLDLWNADVIPVVPKKGTVGASGDLAPLAHLALPLIGLGKVRVDGRITDAGA VLEAMGWKPLRLKPKEGLALTNGVQYINALALDSVLRSERLIKAADLIAGLSIQGFSCAD TFYQPILHATSLHPERSAVAGNLVRLLDGSNHHTLPQGNAAREDPYSFRCAPQVHAAVRQ TCGFARDIVGRECNSVSDNPLFFPEHDQVILGGNLHGESTAFALDFLAIAMSELANISER RTYQLLSGQHGLPDFLAPEPGVDSGLMIPQYTSAALVNENKVLATPASIDTIPTSQLQED HVSMGGTSAYKLWTILDNCEYVLAVELMTAVQAIDLNQGLRPSPATRGVVAEFRQEVGFL REDRLQADDIEKSRRYLRGRLRTWAKDLD.
Essentially, SGT-based search could be used for problems where distribution-based methods like Markov or Hidden Markov model are used as opposed to alignment.
References
Aggarwal CC, Han J (2014) Frequent pattern mining. Springer, Berlin
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Ayres J, Flannick J, Gehrke J, Yiu T (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 429–435
Bagnall A, Lines J, Hills J, Bostrom A (2015) Time-series classification with COTE: the collective of transformation-based ensembles. IEEE Trans Knowle Data Eng 27(9):2522–2535
Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Discov 31(3):606–660
Bailey TL, Elkan C et al (1994) Fitting a mixture model by expectation maximization to discover motifs in bipolymers
Baydogan MG, Runger G, Tuv E (2013) A bag-of-features framework to classify time series. IEEE Trans Pattern Anal Mach Intell 35(11):2796–2802
Bostrom A, Bagnall A (2017) Binary shapelet transform for multiclass time series classification. In: Transactions on large-scale data-and knowledge-centered systems XXXII. Springer, pp 24–46
Buhler J (2001) Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17(5):419–428
Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9(2):225–242
Cadez I, Heckerman D, Meek C, Smyth P, White S (2003) Model-based clustering and visualization of navigation patterns on a web site. Data Min Knowl Discov 7(4):399–424
Chiu D-Y, Wu Y-H, Chen AL (2004) An efficient algorithm for mining frequent sequences by a new strategy without support counting. In: Proceedings of 20th international conference on data engineering. IEEE, pp 375–386
Comin M, Verzotto D (2012) Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol 7(1):34
Costa F, De Grave K (2010) Fast neighborhood subgraph pairwise distance kernel. In: Proceedings of the 26th international conference on machine learning. Omnipress, pp 255–262
Cristianini N, Shawe-Taylor J et al (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Deng H, Runger G, Tuv E, Vladimir M (2013) A time series forest for classification and feature extraction. Inf Sci 239:142–153
Didier G, Corel E, Laprevotte I, Grossmann A, Landés-Devauchelle C (2012) Variable length local decoding and alignment-free sequence comparison. Theor Comput Sci 462:1–11
Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform 5(1):113
Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460–2461
Eskin E, Weston J, Noble WS, Leslie CS (2003) Mismatch string kernels for SVM protein classification. In: Advances in neural information processing systems, pp 1441–1448
Farhan M, Tariq J, Zaman A, Shabbir M, Khan IU (2017) Efficient approximation algorithms for strings kernel based sequence classification. In: Advances in neural information processing systems, pp 6938–6948
Ferreira F, Pacheco A (2005) Simulation of semi-Markov processes and Markov chains ordered in level crossing. In: Next generation internet networks. IEEE, pp 121–128
Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152
Gamboa JCB (2017) Deep learning for time-series analysis. arXiv preprint arXiv:1701.01887
Glusman G, Mauldin DE, Hood LE, Robinson M (2017) Ultrafast comparison of personal genomes via precomputed genome fingerprints. Front Genet 8:136
Graves A (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850
Han J, Pei J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th international conference on data engineering, pp 215–224
Haussler D (1999) Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California
Helske S, Helske J (2017) Mixture hidden Markov models for sequence data: the seqHMM package in R. arXiv preprint arXiv:1704.00543
Hills J, Lines J, Baranauskas E, Mapp J, Bagnall A (2014) Classification of time series by shapelet transformation. Data Min Knowl Discov 28(4):851–881
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM, pp 604–613
Kate RJ (2016) Using dynamic time warping distances as features for improved time series classification. Data Min Knowl Discov 30(2):283–312
Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C (2005) Profile-based string kernels for remote homology detection and motif extraction. J Bioinform Comput Biol 3(03):527–550
Kuksa PP, Huang P-H, Pavlovic V (2009) Scalable algorithms for string kernels with inexact matching. In: Advances in neural information processing systems, pp 881–888
Kumar P, Krishna PR, Raju SB (2012) Pattern discovery using sequence data mining: applications and studies. Information Science Reference, Hershey
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208–214
Leslie C, Eskin E, Noble WS (2001) The spectrum kernel: a string kernel for SVM protein classification. In: Biocomputing 2002. World Scientific, pp 564–575
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476
Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483
Lines J, Bagnall A (2015) Time series classification with ensembles of elastic distance measures. Data Min Knowl Discov 29(3):565–592
Lines J, Taylor S, Bagnall A (2016) HIVE-COTE: the hierarchical vote collective of transformation-based ensembles for time series classification. In: IEEE 16th international conference on data mining (ICDM). IEEE, pp 1041–1046
Lines J, Taylor S, Bagnall A (2018) Time series classification with HIVE-COTE: The hierarchical vote collective of transformation-based ensembles. ACM Trans Knowl Discov Data (TKDD) 12(5):52
Lipman DJ, Altschul SF, Kececioglu JD (1989) A tool for multiple sequence alignment. Proc Natl Acad Sci 86(12):4412–4415
Liu C, Wang F, Hu J, Xiong H (2015) Temporal phenotyping from longitudinal electronic health records: a graph based framework. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 705–714
Liu C, Zhang K, Xiong H, Jiang G, Yang Q (2016) Temporal skeletonization on sequential data: patterns, categorization, and visualization. IEEE Trans Knowl Data Eng 28(1):211–223
Masseglia F, Cathala F, Poncelet P (1998) The PSP approach for mining sequential patterns. In: Principles of data mining and knowledge discovery, pp 176–184
Morgenstern B (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics (Oxford, England) 15(3):211–218
Neamtu R, Ahsan R, Rundensteiner EA, Sarkozy G, Keogh E, Dau HA, Nguyen C, Lovering C (2018) Generalized dynamic time warping: unleashing the warping power hidden in point-wise distances. In: IEEE 34th international conference on data engineering (ICDE). IEEE, pp 521–532
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217
Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183:63–98
Ranjan C, Paynabar K, Helm JE, Pan J (2015) The impact of estimation: a new method for clustering and trajectory estimation in patient flow modeling. In: Production and operations management
Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9(2):173–175
Sandve GK, Drabløs F (2006) A survey of motif discovery methods in an integrated framework. Biol Direct 1(1):11
Schäfer P (2015) The BOSS is concerned with time series classification in the presence of noise. Data Min Knowl Discov 29(6):1505–1530
Shi Q, Petterson J, Dror G, Langford J, Smola A, Vishwanathan S (2009) Hash kernels for structured data. J Mach Learn Res 10:2615–2637
Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46(1):13
Siyari P, Dilkina B, Dovrolis C (2016) Lexis: An optimization framework for discovering the hierarchical structure of sequential data. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1185–1194
Smith TF, Waterman MS (1981) Comparison of biosequences. Adv Appl Math 2(4):482–489
Smola AJ, Vishwanathan S (2003) Fast kernels for string and tree matching. In: Advances in neural information processing systems, pp 585–592
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Advances in database technology–EDBT’96, pp 1–17
Stoye J, Moulton V, Dress AW (1997) DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Bioinformatics 13(6):625–626
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Thompson JD, Higgins DG, Gibson TJ (1994a) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680
Thompson JD, Higgins DG, Gibson TJ (1994b) Improved sensitivity of profile searches through the use of sequence weights and gap excision. Bioinformatics 10(1):19–29
Wang JT, Zaki MJ, Toivonen HT, Shasha D (2005) Introduction to data mining in bioinformatics. In: Data mining in bioinformatics. Springer, pp 3–8
Wang L, Jiang T (1994) On the complexity of multiple sequence alignment. J Comput Biol 1(4):337–348
Wesselink J-J, de la Iglesia B, James SA, Dicks JL, Roberts IN, Rayward-Smith VJ (2002) Determining a unique defining DNA sequence for yeast species using hashing techniques. Bioinformatics 18(7):1004–1010
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34 (suppl\_1): D187–D191
Wu L, Yen IE-H, Huo S, Zhao L, Xu K, Ma L, Ji S, Aggarwal C (2019) Efficient global string kernel with random features: beyond counting substructures. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining, pp 520–528
Xing Z, Pei J, Keogh E (2010) A brief survey on sequence classification. ACM Sigkdd Explor Newsl 12(1):40–48
Zainuddin Z, Kumar M (2008) Radial basic function neural networks in protein sequence classification. Malays J Math Sci 2(2):195–204
Zaki MJ (2001) SPADE: An efficient algorithm for mining frequent sequences. Mach Learn 42(1):31–60
Zaki NM, Deris S, Illias RM (2004) Features extraction for protein homology detection using hidden Markov models combining scores. Int J Comput Intell Appl 4(01):1–12
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Bart Goethals.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
Mean and variance of \(\psi _{uv}\)
Consider an arbitrary sequence s, where the sequence has an ordered list of symbols. These symbols belong to a finite set \({\mathcal {V}}\). SGT embedding works by finding the dependencies between every pair of symbols \((u,v); u, v \in {\mathcal {V}}\).
To easily denote various (u, v) pairs in s, we use a term, mth neighboring pair, where an mth neighbor pair for (u, v) will have \(m-1\) other u’s in between. A first neighbor is thus the immediate (u, v) neighboring pairs, while the 2nd-neighbor has one other u in between, and so on (see Fig. 4 for illustration). The immediate neighbor mentioned in the assumption in Sect. 2.4.1 is the same as the first neighbor defined here.
Based on the sequence patterns assumption in Sect. 2.4.1 and assuming u, v occur uniformly in the sequence with probability p, the expected number of first-neighbor (u, v) pairs is given as \(M=pL\). Consequently, it is easy to show that the expected number of mth neighboring (u, v) pairs is \((M-m+1)\), i.e., the second neighboring (u, v) pairs will be \((M-1)\), \((M-2)\) for the third, so on and so forth, till one instance for the Mth neighbor. The gap distance for an mth neighbor is given as \(Z_{1}=X;Z_{m}=X+\sum _{i=2}^{m}Y_{i},m=2,\ldots ,M\).
Besides, the total number of (u, v) pair instances will be \(\sum _{m=1}^{M}m=\frac{M(M+1)}{2}\)(\(=|\varLambda _{uv}|\), by definition). Suppose we define a set that contains distances for each possible (u, v) pairs as \({\mathcal {Z}}=\{Z_{m}^{i},i=1,\ldots ,(M-m+1);m=1,\ldots ,M\}\). Also, since \(Z_{m}\sim N(\mu _{\alpha }+(m-1)\mu _{\beta },\sigma _{\alpha }^{2}+(m-1)\sigma _{\beta }^{2})\), \(\phi _{\kappa }(Z_{m})\) becomes a lognormal distribution. Thus,
where,
Besides, the feature, \(\psi _{uv}\), in Eq. (3a) can be expressed and further derived using Sect. A.1 as,
This yields to the expectation expression in Eq. (4). Besides, the variances will be
1.1 Arithmetico-geometric series
The sum of a series, where the kth term for \(k\ge 1\) can be expressed as,
is called an arithmetico-geometric because of a combination of arithmetic series term \((a+(k-1)d)\), where a is the initial term and common difference d, and geometric \(br^{k-1}\), where b is the initial value and common ratio being r.
Suppose the sum of the series till n terms is denoted as,
Without loss of generality we can assume \(b=1\) for deriving the expression for \(S_{n}\) (the sum for any other value of b can be easily obtained by multiplying the expression for \(S_{n}\) with b). Expanding Eq. (17),
Now multiplying \(S_{n}\) with r,
Subtracting Eq. (19) from (18), if \(|r|<1\), else we subtract the latter from the former, we get,
Therefore,
or, for any value of b,
\({\mathbf {W}}^{(\kappa )}\) independent with respect to \(\kappa \)
Proof
We have \({\mathbf {W}}^{(\kappa )} = [W^{(\kappa )}_{uv}],\,u,v\in {\mathcal {V}}\) where
To prove the independence of \({\mathbf {W}}^{(\kappa )}\) with respect to \(\kappa \), we need to show \({\mathbf {w}}_{u\cdot }^{(\kappa )} \perp \!\!\! \perp {\mathbf {w}}_{u\cdot }^{(\kappa + \delta )}\) where \({\mathbf {w}}_{u\cdot }\) is a column in \({\mathbf {W}}\).
Without loss of generality assume \(\varLambda _{uv}(s) = 1\) and replacing \(|m-l|\) with \(x, x>0\) in Eq. (22) the column \({\mathbf {w}}_{u\cdot }\) for \(\kappa \) and \(\kappa + \delta \) will be, \({\mathbf {w}}_{u\cdot }^{(\kappa )} = [e^{-\kappa x_i}]\) and \({\mathbf {w}}_{u\cdot }^{(\kappa + \delta )} = [e^{-\kappa x_i}]\) where \(i = 1, \ldots , |{\mathcal {V}}|\).
\({\mathbf {w}}_{u\cdot }^{(\kappa )}\) and \({\mathbf {w}}_{u\cdot }^{(\kappa +\delta )}\) will be dependent iff there is a nontrivial solution for a, b in Eq. (23).
Solving this for a and b by taking any element i in \({\mathbf {w}}_{u\cdot }\).
Since \(e^{-\kappa x_i}\ne 0\), we have \(a = b e^{-\delta x_i}\). Plugging this in the equation for another element j in \({\mathbf {w}}_{u\cdot }\) we get,
Since \(e^{-\delta x_i} \ne e^{-\delta x_j} \forall i,j\), and \(e^{-\kappa x_j} \ne 0\) we have \(b=0\). Consequently, \(a=0\).
Therefore, Eq. (23) has a trivial solution \(a,b=0\). Thus, \({\mathbf {w}}_{u\cdot }^{(\kappa )}\) is independent with respect to \(\kappa \) implying independence of \({\mathbf {W}}^{(\kappa )}\). \(\square \)
Proof for symbol clustering
We have, \(\frac{\partial \varDelta }{\partial \kappa }=\frac{\partial }{\partial \kappa }E[\phi _{\kappa }(X)-\phi _{\kappa }(Y)]=E[\frac{\partial }{\partial \kappa }\phi _{\kappa }(X)-\frac{\partial }{\partial \kappa }\phi _{\kappa }(Y)]\). For \(E[X]<E[Y]\), we want, \(\frac{\partial \varDelta }{\partial \kappa }>0\), in turn, \(\frac{\partial }{\partial \kappa }\phi _{\kappa }(X)>\frac{\partial }{\partial \kappa }\phi _{\kappa }(Y)\). This will hold if \(\frac{\partial ^{2}}{\partial d\partial \kappa }\phi _{\kappa }(d)>0\), that is, slope, \(\frac{\partial }{\partial \kappa }\phi _{\kappa }(d)\) is increasing with d. For an exponential expression for \(\phi \) (Eq. 1), the above condition holds true if \(\kappa d>1\). Hence, under these conditions, the separation increases as we increase the tuning parameter, \(\kappa \).
Sequence simulation
In this section, we explain the sequence generation for Exp 1–2 in Sect. 4.
A sequence is generated by randomly selecting a string from the motif set and placing them in random order between arbitrary strings. These interspersed arbitrary symbols are the noise in a sequence. Figure 17 shows an example of a sequence generated from a set of seed motifs. About 50% of the sequence is noise—arbitrary strings. Note that due to the random motif selection, a simulated sequence does not necessarily contain all seed motifs.
1.1 Generating sequence clusters
Suppose we have to generate K sequence clusters. We first randomly simulate K sets of motifs. In each set, the motifs are of random lengths (between 2 and 8 in our simulations). The size of a set is also randomly chosen (between 6 and 11 in this paper). Sequences are then generated from each motif set as described above.
1.2 What are overlapping clusters?
In our experiments, we have to test the efficacy of clustering methods when the clusters are difficult to separate. This is the case when the clusters overlap. In traditional multidimensional data in Euclidean space, it means the centroids of the clusters are close. In our problem, overlapping clusters imply that they have some common seed motifs. In other words, the intersection of the clusters’ motif sets is not null. Thus, 0% overlap implies the intersection of motif sets is null, and a 100% overlap implies the intersection is equal to the union. Figure 18 shows an example of overlapping motif sets of two clusters.
Code repository and data sets
The source code and data sets are available on GitHub as https://github.com/cran2367/sgt. Its python package is also provided at https://pypi.org/project/sgt/.
The python package can be installed as follows.
Rights and permissions
About this article
Cite this article
Ranjan, C., Ebrahimi, S. & Paynabar, K. Sequence graph transform (SGT): a feature embedding function for sequence data mining. Data Min Knowl Disc 36, 668–708 (2022). https://doi.org/10.1007/s10618-021-00813-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-021-00813-0