Taxonomy-driven lumping for sequence mining

Bonchi, Francesco; Castillo, Carlos; Donato, Debora; Gionis, Aristides

doi:10.1007/s10618-009-0141-6

Taxonomy-driven lumping for sequence mining

Published: 21 July 2009

Volume 19, pages 227–244, (2009)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Francesco Bonchi¹,
Carlos Castillo¹,
Debora Donato¹ &
…
Aristides Gionis¹

216 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

Given a taxonomy of events and a dataset of sequences of these events, we study the problem of finding efficient and effective ways to produce a compact representation of the sequences. We model sequences with Markov models whose states correspond to nodes in the provided taxonomy, and each state represents the events in the subtree under the corresponding node. By lumping observed events to states that correspond to internal nodes in the taxonomy, we allow more compact models that are easier to understand and visualize, at the expense of a decrease in the data likelihood. We formally define and characterize our problem, and we propose a scalable search method for finding a good trade-off between two conflicting goals: maximizing the data likelihood, and minimizing the model complexity. We implement these ideas in Taxomo, a taxonomy-driven modeler, which we apply in two different domains, query-log mining and mining of moving-object trajectories. The empirical evaluation confirms the feasibility and usefulness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

A survey of density based clustering algorithms

Article 29 September 2020

Time-Dependent Graphs: Definitions, Applications, and Algorithms

Article Open access 25 September 2019

References

Bicego M, Dovier A, Murino V (2001) Designing the minimal structure of hidden Markov model by bisimulation. Energy Minimization Methods Comput Vis Pattern Recognit 2001:75–90
Article Google Scholar
Bicego M, Murino V, Figueiredo M (2003) Similarity-based clustering of sequences using hidden Markov models. Mach Learn Data Min Pattern Recognit 2003:95–104
Google Scholar
Borges J, Levene M (2004) A dynamic clustering-based Markov model for web usage mining. arxiv:cs/0406032
Brinkhoff T (2003) Generating traffic data. IEEE Data Eng Bull 26(2):19–25
Google Scholar
Cakmak A, Özsoyoglu G (2008) Taxonomy-superimposed graph mining. In: Proceedings of 11th international conference on Extending Database Technology (EDBT)
Cao H, Jiang D, Pei J, He Q, Liao Z, Chen E, Li H (2008) Context-aware query suggestion by mining click-through and session data. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD)
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience
Felzenszwalb PF, Huttenlocher DP, Kleinberg JM (2004) Fast algorithms for large-state-space hmms with applications to web usage analysis. In: Advances in Neural Information Processing Systems (NIPS)
Girolami M, Kaban A (2003) Simplicial mixtures of Markov chains: distributed modelling of dynamic user profiles. In: Advances in Neural Information Processing Systems (NIPS)
Guralnik V, Karypis G (2001) A scalable algorithm for clustering protein sequences. In: BIOKDD
Kemeny J, Snell JL (1959) Finite Markov chains. Springer-Verlag
Law MH, Kwok JT (2000) Rival penalized competitive learning for model-based sequence clustering. Pattern Recognition, International Conference on 2
Lee HK, Kim JH (1999) An hmm-based threshold model approach for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21(10): 961–973
Article Google Scholar
Lee JG, Han J, Li X, Gonzalez H (2008) raClass: trajectory classification using hierarchical region-based and trajectory-based clustering. In: Proceedings of the 34th international conference on Very Large Databases (VLDB)
Lee JG, Han J, Whang KY (2007) Trajectory clustering: a partition-and-group framework. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data (SIGMOD)
Li X, Han J, Lee JG, Gonzalez H (2007) Traffic density-based discovery of hot routes in road networks. In: Proceedings of the 10th international Symposium on Advances in Spatial and Temporal Databases (SSTD)
Manavoglu E, Pavlov D, Giles CL (2003) Probabilistic user behavior models. In: Proceedings of 3rd IEEE International Conference on Data Mining (ICDM)
Manning AM, Brass A, Goble CA, Keane JA (1997) Clustering techniques in biological sequence analysis. In: PKDD
Meyer CD (1989) Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems. SIAM Rev 31(2): 240–272
Article MATH MathSciNet Google Scholar
Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3): 267–289
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2)
Simon H, Ando J (1961) Aggregation of variables in dynamic systems. Econometrica 29: 111–138
Article MATH Google Scholar
Srikant R, Agrawal R (1995) Mining generalized association rules. In Proceedings of 21th international conference on Very Large Data Bases (VLDB)
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of 5th international conference on Extending Database Technology (EDBT)
Smyth P (1997) Clustering sequences with hidden Markov models. In: Advances in neural information processing systems, vol 9, pp 648–654
Stolcke A, Omohundro SM (1994) Best-first model merging for hidden Markov model induction
Tijms H (1986) Stochastic modelling and analysis: a computational approach. Wiley, New York
Google Scholar
Wang J, Zhang Y, Zhou L, Karypis G, Aggarwal CC (2007) Discriminating subsequence discovery for sequence clustering. In: SDM
Welch LR (2003) Hidden Markov models and the baum-welch algorithm. IEEE Inf Theory Soc Newsl 53(4)
White LB, Mahony R, Brushe GD (2000) Lumpable hidden Markov models - model reduction and reduced complexity filtering. IEEE Trans Automat Contr 45(12)

Download references

Author information

Authors and Affiliations

Yahoo! Research, Diagonal 177, Barcelona, 080018, Spain
Francesco Bonchi, Carlos Castillo, Debora Donato & Aristides Gionis

Authors

Francesco Bonchi
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Castillo
View author publications
You can also search for this author in PubMed Google Scholar
Debora Donato
View author publications
You can also search for this author in PubMed Google Scholar
Aristides Gionis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aristides Gionis.

Additional information

Responsible editors: Aleksander Kołcz, Wray Buntine, Marko Grobelnik, Dunja Mladenic, and John Shawe, Taylor.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bonchi, F., Castillo, C., Donato, D. et al. Taxonomy-driven lumping for sequence mining. Data Min Knowl Disc 19, 227–244 (2009). https://doi.org/10.1007/s10618-009-0141-6

Download citation

Received: 12 June 2009
Accepted: 24 June 2009
Published: 21 July 2009
Issue Date: October 2009
DOI: https://doi.org/10.1007/s10618-009-0141-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Taxonomy-driven lumping for sequence mining

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

A survey of density based clustering algorithms

Time-Dependent Graphs: Definitions, Applications, and Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Taxonomy-driven lumping for sequence mining

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

A survey of density based clustering algorithms

Time-Dependent Graphs: Definitions, Applications, and Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation