Skip to main content
Log in

Learning sequential classifiers from long and noisy discrete-event sequences efficiently

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes, making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of \(\log {n}\), which leads to a \(O(n\log {n})\) learning cost (where \(n\) is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. A similar trend is observed for SC-SC algorithm.

  2. The reason we include the LAC algorithm (which does not exploit adjacency information) as a baseline, is to evaluate the possible benefits of exploiting adjacency information while producing the classifier.

References

  • Agrawal R, Srikant R (1995) Mining sequential patterns. In: ICDE, pp 3–14

  • Bannister W (2007) Associative and sequential classification with adaptive constrained regression methods. PhD thesis, Tempe, AZ, USA

  • Baum L, Petrie T (1966) Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat 37(6):1554–1563

    Article  MathSciNet  MATH  Google Scholar 

  • Baum L, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat 41:164

    Article  MathSciNet  MATH  Google Scholar 

  • Bicego M, Murino V, Figueiredo M (2003a) A sequential pruning strategy for the selection of the number of states in hidden Markov models. Pattern Recognit Lett 24(9–10):1395–1407

    Article  MATH  Google Scholar 

  • Bicego M, Murino V, Figueiredo M (2004) Similarity-based classification of sequences using hidden Markov models. Pattern Recognit 37(12):2281–2291

    Article  Google Scholar 

  • Bicego M, Murino V, Pelillo M, Torsello A (2006) Similarity-based pattern recognition. Pattern Recognit 39(10):1813–1814

    Article  Google Scholar 

  • Bühlmann P, Wyner A (1999) Variable length Markov chains. Ann Stat 27(2):480–513

    Article  MATH  Google Scholar 

  • Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27, software available at http://www.csie.ntu.edu.tw/cjlin/libsvm

  • Davis A, Veloso A, da Silva A, Laender A, Meira W Jr (2012) Named entity disambiguation in streaming data. In: ACL, pp 815–824

  • Deshpande M, Karypis G (2004) Selective Markov models for predicting web page accesses. ACM Trans Internet Technol 4(2):163–184

    Article  Google Scholar 

  • Durbin R, Eddy AKS, Mitchison G (1998) Biological sequence analysis. Cambridge University Press

  • Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press

  • Du Preez J (1998) Efficient training of high-order hidden Markov models using first-order representations. Comput Speech Lang 12(1):23–39

    Article  Google Scholar 

  • Eddy S (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763

    Article  Google Scholar 

  • Galassi U, Giordana A, Saitta L (2007) Incremental construction of structured hidden Markov models. In: IJCAI, pp 798–803

  • Golding A, Roth D (1996) Applying winnow to context-sensitive spelling correction. CoRR

  • Han H, Giles C, Zha H, Li C, Tsioutsiouliklis K (2004) Two supervised learning approaches for name disambiguation in author citations. In: JCDL, pp 296–305

  • Han H, Zha H, Giles C (2005) Name disambiguation in author citations using a k-way spectral clustering method. In: JCDL, pp 334–343

  • Haussler D (1999) Convolution kernels on discrete structures. Tech. rep,Technical report, UC Santa Cruz

  • Hu J, Brown M, Turin W (1996) Hmm based on-line handwriting recognition. IEEE Trans Pattern Anal Mach Intell 18(10):1039–1045

    Article  Google Scholar 

  • Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 12(2):95–107

    Google Scholar 

  • Kriouile A, Mari J, Haon J (1990) Some improvements in speech recognition algorithms based on HMM. In: ICASSP, pp 545–548

  • Kuksa P, Huang PH, Pavlovic V (2008) A fast, large-scale learning method for protein sequence classification. In: 8th international workshop on data mining in bioinformatics, pp 29–37

  • Lane T, Brodley C (1999) Temporal sequence learning and data reduction for anomaly detection. ACM Trans Inf Syst Secur 2(3):295–331

    Article  Google Scholar 

  • Law H, Chan C (1996) N-th order ergodic multigram hmm for modeling of languages without marked word boundaries. In: COLING, pp 204–209

  • Lesh N, Zaki M, Ogihara M (1999) Mining features for sequence classification. In: KDD, pp 342–346

  • Leslie C, Kuang R (2003) Fast kernels for inexact string matching. In: COLT, pp 114–128

  • Leslie C, Kuang R (2004) Fast string kernels using inexact matching for protein sequences. J Mach Learn Res 5:1435–1455

    MathSciNet  MATH  Google Scholar 

  • Leslie C, Eskin E, Noble WS (2002a) The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput 7:566–575

    Google Scholar 

  • Leslie C, Eskin E, Weston J, Noble W (2002b) Mismatch string kernels for SVM protein classification. In: NIPS, pp 1417–1424

  • Leslie C, Eskin E, Cohen A, Weston J, Noble W (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4):467–476

    Article  Google Scholar 

  • Lin M, Hsueh S, Chen M, Hsu H (2009) Mining sequential patterns for image classification in ubiquitous multimedia systems. In: IIH-MSP, pp 303–306

  • Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C (2002) Text classification using string kernels. J Mach Learn Res 2:419–444

    MATH  Google Scholar 

  • Lodhi H, Muggleton S, Sternberg M (2009) Multi-class protein fold recognition using large margin logic based divide and conquer learning. SIGKDD Explor 11(2):117–122

    Article  Google Scholar 

  • Malik H, Kender J (2008) Classifying high-dimensional text and web data using very short patterns. In: ICDM, pp 923–928

  • Müller S, Eickeler S, Rigoll G (2000) Crane gesture recognition using pseudo 3-d hidden Markov models. In: FG (Conf. on Automatic Face and Gesture Recognition), pp 398–402

  • Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540

    Google Scholar 

  • Pitkow J, Pirolli P (1999) Mining longest repeating subsequences to predict world wide web surfing. In: USENIX symposium on Internet technologies and systems

  • Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286

    Article  Google Scholar 

  • Rieck K, Laskov P (2008) Linear-time computation of similarity measures for sequential data. J Mach Learn Res 9:23–48

    MATH  Google Scholar 

  • Rousu J, Shawe-Taylor J (2005) Efficient computation of gapped substring kernels on large alphabets. J Mach Learn Res 6(1):1323–1344

    MathSciNet  MATH  Google Scholar 

  • Saul L, Jordan M (1999) Mixed memory Markov models: decomposing complex stochastic processes as mixtures of simpler ones. Mach Learn 37(1):75–87

    Article  MATH  Google Scholar 

  • Schwardt L, Preez JD (2000) Efficient mixed-order hidden Markov model inference. In: ICSLP, pp 238–241

  • Sha F, Saul L (2006) Large margin hidden Markov models for automatic speech recognition. In: NIPS, pp 1249–1256

  • Silva I, Gomide J, Veloso A, Meira Jr W, Ferreira R (2011) Effective sentiment stream analysis with self-augmenting training and demand-driven projection. In: SIGIR, pp 475–484

  • Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: EDBT, pp 3–17

  • Srivatsan L, Sastry P, Unnikrishnan K (2005) Discovering frequent episodes and learning hidden Markov models: a formal connection. IEEE TKDE 17:1505–1517

    Google Scholar 

  • Syed Z, Indyk P, Guttag J (2009) Learning approximate sequential patterns for classification. J Mach Learn Res 10:1913–1936

    MathSciNet  MATH  Google Scholar 

  • Szymanski B (2004) Recursive data mining for masquerade detection and author identification. Workshop on Information Assurance, pp 424–431

  • Tseng V, Lee C (2005) CBS: a new classification method by using sequential patterns. In: SDM

  • Vapnik V (1979) Estimation of dependences based on empirical data (in Russian). Nauka

  • Veloso A, Meira W Jr (2011) Demand-driven associative classification. Springer

  • Wang Y, Zhou L, Feng J, Wang J, Liu Z (2006) Mining complex time-series data by learning Markovian models. In: ICDM, pp 1136–1140

  • Watkins C (1999) Dynamic alignment kernels. Advances in neural information processing systems, pp 39–50

  • Ye L, Keogh E (2009) Time series shapelets: a new primitive for data mining. In: KDD, pp 947–956

  • Ye L, Keogh E (2011) Time series shapelets: a novel technique that allows accurate, interpretable and fast classification. Data Min Knowl Discov 22(1–2):149–182

    Article  MathSciNet  MATH  Google Scholar 

  • Zaki M (2000) Sequence mining in categorical domains: Incorporating constraints. In: CIKM, pp 422–429

  • Zaki M, Carothers C, Szymanski B (2010) Vogue: a variable order hidden Markov model with duration based on frequent sequence mining. TKDD 4(1):1–31

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adriano Veloso.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite and Alipio Jorge.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dafé, G., Veloso, A., Zaki, M. et al. Learning sequential classifiers from long and noisy discrete-event sequences efficiently. Data Min Knowl Disc 29, 1685–1708 (2015). https://doi.org/10.1007/s10618-014-0391-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0391-9

Keywords

Navigation