Skip to main content
Log in

An optimized sequential pattern matching methodology for sequence classification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper we present a novel methodology for sequence classification, based on sequential pattern mining and optimization algorithms. The proposed methodology automatically generates a sequence classification model, based on a two stage process. In the first stage, a sequential pattern mining algorithm is applied to a set of sequences and the sequential patterns are extracted. Then, the score of every pattern with respect to each sequence is calculated using a scoring function and the score of each class under consideration is estimated by summing the specific pattern scores. Each score is updated, multiplied by a weight and the output of the first stage is the classification confusion matrix of the sequences. In the second stage an optimization technique, aims to finding a set of weights which minimize an objective function, defined using the classification confusion matrix. The set of the extracted sequential patterns and the optimal weights of the classes comprise the sequence classification model. Extensive evaluation of the methodology was carried out in the protein classification domain, by varying the number of training and test sequences, the number of patterns and the number of classes. The methodology is compared with other similar sequence classification approaches. The proposed methodology exhibits several advantages, such as automated weight assignment to classes using optimization techniques and knowledge discovery in the domain of application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Advances in knowledge discovery and data mining. AAAI Press/MIT Press, Cambridge, pp 1–36

  2. Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann, Menlo Park

  3. Rabiner L (1989) A tutorial on hidden Markov models and selected application in speech recognition. Proc IEEE 77: 257–286

    Article  Google Scholar 

  4. Loewenstern D, Berman H, Hirsh H (1998) Maximum a posteriori classification of DNA structure from sequence information. In: Proceedings of Pacific symposium in Biotech, pp 667–668

  5. Lesh N, Zaki MJ, Ogihara M (1999) Mining features for sequence classification. In: 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). San Diego, pp 342–346

  6. Lesh N, Zaki MJ, Ogihara M (2000) Scalable feature mining for sequential data. IEEE Intell Syst 15(2): 48–56

    Article  Google Scholar 

  7. Shin-Mu Tseng V, Lee C-H (2005) CBS: a new classification method by using sequential patterns. In: Proceedings of SIAM international data mining conference. California, USA

  8. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases. Santiago, Chile

  9. Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering, Taiwan, pp 3–14

  10. Bayardo Jr RJ (1997) Brute-force mining of high-confidence classification rules. In: Proceedings of the third international conference on knowledge discovery and data mining, pp 123–126

  11. Quinlan JR (1992) C4.5: Programs for machine learning, Morgan Kaufman, Menlo Park

  12. Zaki MJ (1998) Efficient enumeration of frequent sequences. In: 7th International conference on information and knowledge management, Washington DC, pp 68–75

  13. Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: Proceedings of fourth international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 80–86 (1998)

  14. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U (2004) Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowl Data Eng 16: 1424–1440

    Article  Google Scholar 

  15. Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential pattern mining using bitmaps. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Canada, pp 429–435

  16. Chen T-Z, Hsu S-C (2007) Mining frequent tree-like patterns in large databases. Data Knowl Eng 62(1): 65–83

    Article  Google Scholar 

  17. Zaki MJ (2000) Sequence mining in categorical domains: incorporating constraints. In: Proceedings of the 9th international conference on information and knowledge management, USA, pp 422–429

  18. Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Proceedings 5th International Conference Extending Database Technology, EDBT. Springer, Heidelberg, vol 1057, pp 3–17 (1996)

  19. Garofalakis M, Rastogi R, Shim K (1999) SPIRIT: sequential pattern mining with regular expression constraint. In: Proceedings of the 25th international conference on very large databases, pp 223–234

  20. Lin M-Y, Lee S-Y (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inf Syst 7: 499–514

    Article  Google Scholar 

  21. Lagarias JC, Reeds JA, Wright MH, Wright PE (1998) Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J. Optim. 9(1): 112–147

    Article  MATH  MathSciNet  Google Scholar 

  22. Exarchos TP, Papaloukas C, Lampros C, Fotiadis DI (2007) Mining sequential patterns for protein fold recognition. J. Biomed. Inf. (in press)

  23. Exarchos TP, Papaloukas C, Lampros C, Fotiadis DI (2006) Protein classification using sequential pattern mining. In: Proceedings of IEEE engineering in medicine and biology conference, New York, USA, pp 5814–5817

  24. Wang K, Hu Y, Hu Yu J (2004) Scalable sequential pattern mining for biological sequences. In: Proceedings of the 13th ACM conference on information and knowledge management, USA, pp 178–187

  25. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28: 235–242

    Article  Google Scholar 

  26. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540

    Google Scholar 

  27. Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS 12(2): 95–107

    Google Scholar 

  28. Karplus K, Karchin R, Shackelford G, Hughey R (2005) Calibrating evalues for hidden Markov models using reverse-sequence null models. Bioinformatics 21: 4107–4115

    Article  Google Scholar 

  29. Baum LE (1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3: 1–8

    Google Scholar 

  30. Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A (2005) Critical assessment of methods of protein structure prediction (CASP)-round 6. Proteins 61(Suppl 7): 3–7

    Article  Google Scholar 

  31. Kum, H.-C., Chang, JH., Wang, W.: Intelligent sequential mining via alignment: optimization techniques for very large DB. In: Advances in knowledge discovery and data mining. Lecture Notes in Computer Science, vol 4426, pp 587–597 (2007)

  32. Rigoutsos I, Floratos A (1998) Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14(1): 55–67

    Article  Google Scholar 

  33. Tzvetkov P, Yan X, Han J (2005) TSP: mining top-k closed sequential patterns. Knowl Inf Syst 7: 438–457

    Article  Google Scholar 

  34. Luo C, Chung SM (2007) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inf Syst (in press)

  35. Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst 11(3): 259–286

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitrios I. Fotiadis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Exarchos, T.P., Tsipouras, M.G., Papaloukas, C. et al. An optimized sequential pattern matching methodology for sequence classification. Knowl Inf Syst 19, 249–264 (2009). https://doi.org/10.1007/s10115-008-0146-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0146-2

Keywords

Navigation