Data Mining and Knowledge Discovery

, Volume 19, Issue 2, pp 227–244 | Cite as

Taxonomy-driven lumping for sequence mining

  • Francesco Bonchi
  • Carlos Castillo
  • Debora Donato
  • Aristides GionisEmail author


Given a taxonomy of events and a dataset of sequences of these events, we study the problem of finding efficient and effective ways to produce a compact representation of the sequences. We model sequences with Markov models whose states correspond to nodes in the provided taxonomy, and each state represents the events in the subtree under the corresponding node. By lumping observed events to states that correspond to internal nodes in the taxonomy, we allow more compact models that are easier to understand and visualize, at the expense of a decrease in the data likelihood. We formally define and characterize our problem, and we propose a scalable search method for finding a good trade-off between two conflicting goals: maximizing the data likelihood, and minimizing the model complexity. We implement these ideas in Taxomo, a taxonomy-driven modeler, which we apply in two different domains, query-log mining and mining of moving-object trajectories. The empirical evaluation confirms the feasibility and usefulness of our approach.


Data mining Sequence analysis Markov models Query-log analysis Spatial-data analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bicego M, Dovier A, Murino V (2001) Designing the minimal structure of hidden Markov model by bisimulation. Energy Minimization Methods Comput Vis Pattern Recognit 2001:75–90CrossRefGoogle Scholar
  2. Bicego M, Murino V, Figueiredo M (2003) Similarity-based clustering of sequences using hidden Markov models. Mach Learn Data Min Pattern Recognit 2003:95–104Google Scholar
  3. Borges J, Levene M (2004) A dynamic clustering-based Markov model for web usage mining. arxiv:cs/0406032Google Scholar
  4. Brinkhoff T (2003) Generating traffic data. IEEE Data Eng Bull 26(2):19–25Google Scholar
  5. Cakmak A, Özsoyoglu G (2008) Taxonomy-superimposed graph mining. In: Proceedings of 11th international conference on Extending Database Technology (EDBT)Google Scholar
  6. Cao H, Jiang D, Pei J, He Q, Liao Z, Chen E, Li H (2008) Context-aware query suggestion by mining click-through and session data. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD)Google Scholar
  7. Cover TM, Thomas JA (1991) Elements of information theory. Wiley-InterscienceGoogle Scholar
  8. Felzenszwalb PF, Huttenlocher DP, Kleinberg JM (2004) Fast algorithms for large-state-space hmms with applications to web usage analysis. In: Advances in Neural Information Processing Systems (NIPS)Google Scholar
  9. Girolami M, Kaban A (2003) Simplicial mixtures of Markov chains: distributed modelling of dynamic user profiles. In: Advances in Neural Information Processing Systems (NIPS)Google Scholar
  10. Guralnik V, Karypis G (2001) A scalable algorithm for clustering protein sequences. In: BIOKDDGoogle Scholar
  11. Kemeny J, Snell JL (1959) Finite Markov chains. Springer-VerlagGoogle Scholar
  12. Law MH, Kwok JT (2000) Rival penalized competitive learning for model-based sequence clustering. Pattern Recognition, International Conference on 2Google Scholar
  13. Lee HK, Kim JH (1999) An hmm-based threshold model approach for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21(10): 961–973CrossRefGoogle Scholar
  14. Lee JG, Han J, Li X, Gonzalez H (2008) raClass: trajectory classification using hierarchical region-based and trajectory-based clustering. In: Proceedings of the 34th international conference on Very Large Databases (VLDB)Google Scholar
  15. Lee JG, Han J, Whang KY (2007) Trajectory clustering: a partition-and-group framework. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data (SIGMOD)Google Scholar
  16. Li X, Han J, Lee JG, Gonzalez H (2007) Traffic density-based discovery of hot routes in road networks. In: Proceedings of the 10th international Symposium on Advances in Spatial and Temporal Databases (SSTD)Google Scholar
  17. Manavoglu E, Pavlov D, Giles CL (2003) Probabilistic user behavior models. In: Proceedings of 3rd IEEE International Conference on Data Mining (ICDM)Google Scholar
  18. Manning AM, Brass A, Goble CA, Keane JA (1997) Clustering techniques in biological sequence analysis. In: PKDDGoogle Scholar
  19. Meyer CD (1989) Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems. SIAM Rev 31(2): 240–272zbMATHCrossRefMathSciNetGoogle Scholar
  20. Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3): 267–289CrossRefGoogle Scholar
  21. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2)Google Scholar
  22. Simon H, Ando J (1961) Aggregation of variables in dynamic systems. Econometrica 29: 111–138zbMATHCrossRefGoogle Scholar
  23. Srikant R, Agrawal R (1995) Mining generalized association rules. In Proceedings of 21th international conference on Very Large Data Bases (VLDB)Google Scholar
  24. Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of 5th international conference on Extending Database Technology (EDBT)Google Scholar
  25. Smyth P (1997) Clustering sequences with hidden Markov models. In: Advances in neural information processing systems, vol 9, pp 648–654Google Scholar
  26. Stolcke A, Omohundro SM (1994) Best-first model merging for hidden Markov model inductionGoogle Scholar
  27. Tijms H (1986) Stochastic modelling and analysis: a computational approach. Wiley, New YorkGoogle Scholar
  28. Wang J, Zhang Y, Zhou L, Karypis G, Aggarwal CC (2007) Discriminating subsequence discovery for sequence clustering. In: SDMGoogle Scholar
  29. Welch LR (2003) Hidden Markov models and the baum-welch algorithm. IEEE Inf Theory Soc Newsl 53(4)Google Scholar
  30. White LB, Mahony R, Brushe GD (2000) Lumpable hidden Markov models - model reduction and reduced complexity filtering. IEEE Trans Automat Contr 45(12)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Francesco Bonchi
    • 1
  • Carlos Castillo
    • 1
  • Debora Donato
    • 1
  • Aristides Gionis
    • 1
    Email author
  1. 1.Yahoo! ResearchBarcelonaSpain

Personalised recommendations