VOGUE: A Novel Variable Order-Gap State Machine for Modeling Sequences

  • Bouchra Bouqata
  • Christopher D. Carothers
  • Boleslaw K. Szymanski
  • Mohammed J. Zaki
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4213)

Abstract

We present VOGUE, a new state machine that combines two separate techniques for modeling long range dependencies in sequential data: data mining and data modeling. VOGUE relies on a novel Variable-Gap Sequence mining method (VGS), to mine frequent patterns with different lengths and gaps between elements. It then uses these mined sequences to build the state machine. We applied VOGUE to the task of protein sequence classification on real data from the PROSITE protein families. We show that VOGUE yields significantly better scores than higher-order Hidden Markov Models. Moreover, we show that VOGUE’s classification sensitivity outperforms that of HMMER, a state-of-the-art method for protein classification.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Antunes, C., Oliveira, A.L.: Generalization of pattern-growth methods for sequential pattern mining with gap constraints. In: Perner, P., Rosenfeld, A. (eds.) MLDM 2003. LNCS, vol. 2734, pp. 239–251. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  2. 2.
    Botta, M., Galassi, U., Giordana, A.: Learning Complex and Sparse Events in Long Sequences. In: European Conference on Artificial Intelligence (2004)Google Scholar
  3. 3.
    Deshpande, M., Karypis, G.: Selective markov models for predicting web-page accesses. In: SIAM International Conference on Data Mining (April 2001)Google Scholar
  4. 4.
    Eddy, S.R.: Profile hidden markov models. Bioinformatics 14, 755–763 (1998)CrossRefGoogle Scholar
  5. 5.
    Evangelista, P.F., Embrechts, M.J., Bonissone, P., Szymanski, B.K.: Fuzzy ROC curves for unsupervised nonparametric ensemble techniques. IJCNN (2005)Google Scholar
  6. 6.
    Laxman, S., et al.: Discovering frequent episodes and learning hidden markov models: A formal connection. IEEE TKDE 17(11), 1505–1517 (2005)Google Scholar
  7. 7.
    Nanopoulos, A., Katsaros, D., Manolopoulos, Y.: A data mining algorithm for generalized web prefetching. IEEE TKDE 15(5), 1155–1169 (2003)Google Scholar
  8. 8.
    Pitkow, J., Pirolli, P.: Mining longest repeating subsequence to predict WWW surfing. In: 2nd USENIX Symp. on Internet Technologies and Systems (1999)Google Scholar
  9. 9.
    Saul, L., Jordan, M.: Mixed memory markov models: Decomposing complex stochastic processes as mix of simpler ones. Machine Learning 37(1), 75–87 (1999)MATHCrossRefGoogle Scholar
  10. 10.
    Schwardt, L.C., du Preez, J.A.: Efficient mixed-order hidden markov model inference. In: Int’l Conf. on Spoken Language Processing (October 2000)Google Scholar
  11. 11.
    Zaki, M.J.: Sequences mining in categorical domains: Incorporating constraints. In: 9th Int’l Conf. on Information and Knowledge Management (November 2000)Google Scholar
  12. 12.
    Zaki, M.J.: SPADE: An efficient algorithm for mining frequent sequences. Machine Learning Journal 42(1/2), 31–60 (2001)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Bouchra Bouqata
    • 1
  • Christopher D. Carothers
    • 1
  • Boleslaw K. Szymanski
    • 1
  • Mohammed J. Zaki
    • 1
  1. 1.CS DepartmentRensselaer Polytechnic InstituteTroyUSA

Personalised recommendations