Taxonomy-driven lumping for sequence mining
- 193 Downloads
- 3 Citations
Abstract
Given a taxonomy of events and a dataset of sequences of these events, we study the problem of finding efficient and effective ways to produce a compact representation of the sequences. We model sequences with Markov models whose states correspond to nodes in the provided taxonomy, and each state represents the events in the subtree under the corresponding node. By lumping observed events to states that correspond to internal nodes in the taxonomy, we allow more compact models that are easier to understand and visualize, at the expense of a decrease in the data likelihood. We formally define and characterize our problem, and we propose a scalable search method for finding a good trade-off between two conflicting goals: maximizing the data likelihood, and minimizing the model complexity. We implement these ideas in Taxomo, a taxonomy-driven modeler, which we apply in two different domains, query-log mining and mining of moving-object trajectories. The empirical evaluation confirms the feasibility and usefulness of our approach.
Keywords
Data mining Sequence analysis Markov models Query-log analysis Spatial-data analysisPreview
Unable to display preview. Download preview PDF.
References
- Bicego M, Dovier A, Murino V (2001) Designing the minimal structure of hidden Markov model by bisimulation. Energy Minimization Methods Comput Vis Pattern Recognit 2001:75–90CrossRefGoogle Scholar
- Bicego M, Murino V, Figueiredo M (2003) Similarity-based clustering of sequences using hidden Markov models. Mach Learn Data Min Pattern Recognit 2003:95–104Google Scholar
- Borges J, Levene M (2004) A dynamic clustering-based Markov model for web usage mining. arxiv:cs/0406032Google Scholar
- Brinkhoff T (2003) Generating traffic data. IEEE Data Eng Bull 26(2):19–25Google Scholar
- Cakmak A, Özsoyoglu G (2008) Taxonomy-superimposed graph mining. In: Proceedings of 11th international conference on Extending Database Technology (EDBT)Google Scholar
- Cao H, Jiang D, Pei J, He Q, Liao Z, Chen E, Li H (2008) Context-aware query suggestion by mining click-through and session data. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD)Google Scholar
- Cover TM, Thomas JA (1991) Elements of information theory. Wiley-InterscienceGoogle Scholar
- Felzenszwalb PF, Huttenlocher DP, Kleinberg JM (2004) Fast algorithms for large-state-space hmms with applications to web usage analysis. In: Advances in Neural Information Processing Systems (NIPS)Google Scholar
- Girolami M, Kaban A (2003) Simplicial mixtures of Markov chains: distributed modelling of dynamic user profiles. In: Advances in Neural Information Processing Systems (NIPS)Google Scholar
- Guralnik V, Karypis G (2001) A scalable algorithm for clustering protein sequences. In: BIOKDDGoogle Scholar
- Kemeny J, Snell JL (1959) Finite Markov chains. Springer-VerlagGoogle Scholar
- Law MH, Kwok JT (2000) Rival penalized competitive learning for model-based sequence clustering. Pattern Recognition, International Conference on 2Google Scholar
- Lee HK, Kim JH (1999) An hmm-based threshold model approach for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21(10): 961–973CrossRefGoogle Scholar
- Lee JG, Han J, Li X, Gonzalez H (2008) raClass: trajectory classification using hierarchical region-based and trajectory-based clustering. In: Proceedings of the 34th international conference on Very Large Databases (VLDB)Google Scholar
- Lee JG, Han J, Whang KY (2007) Trajectory clustering: a partition-and-group framework. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data (SIGMOD)Google Scholar
- Li X, Han J, Lee JG, Gonzalez H (2007) Traffic density-based discovery of hot routes in road networks. In: Proceedings of the 10th international Symposium on Advances in Spatial and Temporal Databases (SSTD)Google Scholar
- Manavoglu E, Pavlov D, Giles CL (2003) Probabilistic user behavior models. In: Proceedings of 3rd IEEE International Conference on Data Mining (ICDM)Google Scholar
- Manning AM, Brass A, Goble CA, Keane JA (1997) Clustering techniques in biological sequence analysis. In: PKDDGoogle Scholar
- Meyer CD (1989) Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems. SIAM Rev 31(2): 240–272zbMATHCrossRefMathSciNetGoogle Scholar
- Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects. J Intell Inf Syst 27(3): 267–289CrossRefGoogle Scholar
- Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2)Google Scholar
- Simon H, Ando J (1961) Aggregation of variables in dynamic systems. Econometrica 29: 111–138zbMATHCrossRefGoogle Scholar
- Srikant R, Agrawal R (1995) Mining generalized association rules. In Proceedings of 21th international conference on Very Large Data Bases (VLDB)Google Scholar
- Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of 5th international conference on Extending Database Technology (EDBT)Google Scholar
- Smyth P (1997) Clustering sequences with hidden Markov models. In: Advances in neural information processing systems, vol 9, pp 648–654Google Scholar
- Stolcke A, Omohundro SM (1994) Best-first model merging for hidden Markov model inductionGoogle Scholar
- Tijms H (1986) Stochastic modelling and analysis: a computational approach. Wiley, New YorkGoogle Scholar
- Wang J, Zhang Y, Zhou L, Karypis G, Aggarwal CC (2007) Discriminating subsequence discovery for sequence clustering. In: SDMGoogle Scholar
- Welch LR (2003) Hidden Markov models and the baum-welch algorithm. IEEE Inf Theory Soc Newsl 53(4)Google Scholar
- White LB, Mahony R, Brushe GD (2000) Lumpable hidden Markov models - model reduction and reduced complexity filtering. IEEE Trans Automat Contr 45(12)Google Scholar