Skip to main content
Log in

Taxonomy-driven lumping for sequence mining

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Given a taxonomy of events and a dataset of sequences of these events, we study the problem of finding efficient and effective ways to produce a compact representation of the sequences. We model sequences with Markov models whose states correspond to nodes in the provided taxonomy, and each state represents the events in the subtree under the corresponding node. By lumping observed events to states that correspond to internal nodes in the taxonomy, we allow more compact models that are easier to understand and visualize, at the expense of a decrease in the data likelihood. We formally define and characterize our problem, and we propose a scalable search method for finding a good trade-off between two conflicting goals: maximizing the data likelihood, and minimizing the model complexity. We implement these ideas in Taxomo, a taxonomy-driven modeler, which we apply in two different domains, query-log mining and mining of moving-object trajectories. The empirical evaluation confirms the feasibility and usefulness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bicego M, Dovier A, Murino V (2001) Designing the minimal structure of hidden Markov model by bisimulation. Energy Minimization Methods Comput Vis Pattern Recognit 2001:75–90

    Article  Google Scholar 

  • Bicego M, Murino V, Figueiredo M (2003) Similarity-based clustering of sequences using hidden Markov models. Mach Learn Data Min Pattern Recognit 2003:95–104

    Google Scholar 

  • Borges J, Levene M (2004) A dynamic clustering-based Markov model for web usage mining. arxiv:cs/0406032

  • Brinkhoff T (2003) Generating traffic data. IEEE Data Eng Bull 26(2):19–25

    Google Scholar 

  • Cakmak A, Özsoyoglu G (2008) Taxonomy-superimposed graph mining. In: Proceedings of 11th international conference on Extending Database Technology (EDBT)

  • Cao H, Jiang D, Pei J, He Q, Liao Z, Chen E, Li H (2008) Context-aware query suggestion by mining click-through and session data. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD)

  • Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience

  • Felzenszwalb PF, Huttenlocher DP, Kleinberg JM (2004) Fast algorithms for large-state-space hmms with applications to web usage analysis. In: Advances in Neural Information Processing Systems (NIPS)

  • Girolami M, Kaban A (2003) Simplicial mixtures of Markov chains: distributed modelling of dynamic user profiles. In: Advances in Neural Information Processing Systems (NIPS)

  • Guralnik V, Karypis G (2001) A scalable algorithm for clustering protein sequences. In: BIOKDD

  • Kemeny J, Snell JL (1959) Finite Markov chains. Springer-Verlag

  • Law MH, Kwok JT (2000) Rival penalized competitive learning for model-based sequence clustering. Pattern Recognition, International Conference on 2

  • Lee HK, Kim JH (1999) An hmm-based threshold model approach for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21(10): 961–973

    Article  Google Scholar 

  • Lee JG, Han J, Li X, Gonzalez H (2008) raClass: trajectory classification using hierarchical region-based and trajectory-based clustering. In: Proceedings of the 34th international conference on Very Large Databases (VLDB)

  • Lee JG, Han J, Whang KY (2007) Trajectory clustering: a partition-and-group framework. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data (SIGMOD)

  • Li X, Han J, Lee JG, Gonzalez H (2007) Traffic density-based discovery of hot routes in road networks. In: Proceedings of the 10th international Symposium on Advances in Spatial and Temporal Databases (SSTD)

  • Manavoglu E, Pavlov D, Giles CL (2003) Probabilistic user behavior models. In: Proceedings of 3rd IEEE International Conference on Data Mining (ICDM)

  • Manning AM, Brass A, Goble CA, Keane JA (1997) Clustering techniques in biological sequence analysis. In: PKDD

  • Meyer CD (1989) Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems. SIAM Rev 31(2): 240–272

    Article  MATH  MathSciNet  Google Scholar 

  • Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3): 267–289

    Article  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2)

  • Simon H, Ando J (1961) Aggregation of variables in dynamic systems. Econometrica 29: 111–138

    Article  MATH  Google Scholar 

  • Srikant R, Agrawal R (1995) Mining generalized association rules. In Proceedings of 21th international conference on Very Large Data Bases (VLDB)

  • Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of 5th international conference on Extending Database Technology (EDBT)

  • Smyth P (1997) Clustering sequences with hidden Markov models. In: Advances in neural information processing systems, vol 9, pp 648–654

  • Stolcke A, Omohundro SM (1994) Best-first model merging for hidden Markov model induction

  • Tijms H (1986) Stochastic modelling and analysis: a computational approach. Wiley, New York

    Google Scholar 

  • Wang J, Zhang Y, Zhou L, Karypis G, Aggarwal CC (2007) Discriminating subsequence discovery for sequence clustering. In: SDM

  • Welch LR (2003) Hidden Markov models and the baum-welch algorithm. IEEE Inf Theory Soc Newsl 53(4)

  • White LB, Mahony R, Brushe GD (2000) Lumpable hidden Markov models - model reduction and reduced complexity filtering. IEEE Trans Automat Contr 45(12)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aristides Gionis.

Additional information

Responsible editors: Aleksander Kołcz, Wray Buntine, Marko Grobelnik, Dunja Mladenic, and John Shawe, Taylor.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bonchi, F., Castillo, C., Donato, D. et al. Taxonomy-driven lumping for sequence mining. Data Min Knowl Disc 19, 227–244 (2009). https://doi.org/10.1007/s10618-009-0141-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0141-6

Keywords

Navigation