Abstract
We propose mixtures of hidden Markov models for modelling clickstreams of web surfers. Hence, the page categorization is learned from the data without the need for a (possibly cumbersome) manual categorization. We provide an EM algorithm for training a mixture of HMMs and show that additional static user data can be incorporated easily to possibly enhance the labelling of users. Furthermore, we use prior knowledge to enhance generalization and avoid numerical problems. We use parameter tying to decrease the danger of overfitting and to reduce computational overhead. We put a flat prior on the parameters to deal with the problem that certain transitions between page categories occur very seldom or not at all, in order to ensure that a nonzero transition probability between these categories nonetheless remains. In applications to artificial data and real-world web logs we demonstrate the usefulness of our approach. We train a mixture of HMMs on artificial navigation patterns, and show that the correct model is being learned. Moreover, we show that the use of static ’satellite data’ may enhance the labeling of shorter navigation patterns. When applying a mixture of HMMs to real-world web logs from a large Dutch commercial web site, we demonstrate that sensible page categorizations are being learned.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Cadez, I., Gaffney, S., Smyth, P.: A general probabilistic framework for clustering individuals. Technical report, Univ. Calif., Irvine (March 2000)
Cadez, I., Heckerman, D., Meek, C., Smyth, P., White, S.: Visualization of navigation patterns on a web site using model-based clustering. Technical report, Univ. Calif., Irvine (March 2000)
Cooley, R.W.: Web usage mining: discovery and application of interesting patterns from web data. PhD thesis, University of Minnesota, USA (2000)
Huberman, B.A., Pirolli, P.L.T., Pitkow, J.E., Lukose, R.M.: Strong regularities in world wide web surfing. Science 280, 95–97 (1998)
Jordan, M.I., Ghahramani, Z., Jaakola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Learning in graphical models. Kluwer Academic Publishers, Dordrecht (1998)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: Proc. Of 9th ACM-SIAM Symposium on Discrete Algorithms (1998)
Levene, M., Loizou, G.: Computing the entropy of user navigation in the web. Technical report, Department of Computer Science, University College London (1999)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–285 (1989)
Ramoni, M., Sebastiani, P., Cohen, P.: Bayesian clustering by dynamics. Machine learning, 91–121 (2002)
Sarukkai, R.R.: Link prediction and path analysis using markov chains. In: Proceedings of the Ninth International World Wide Web Conference, Amsterdam (2000)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Smyth, P.: Clustering sequences with hidden markov models. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in NIPS 9 (1997)
Smyth, P.: Probabilistic model-based clustering of multivariate and sequential data. In: Proc. of 7th Int. Workshop AI and Statistics, pp. 299–304 (1999)
Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1(2) (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ypma, A., Heskes, T. (2003). Automatic Categorization of Web Pages and User Clustering with Mixtures of Hidden Markov Models. In: Zaïane, O.R., Srivastava, J., Spiliopoulou, M., Masand, B. (eds) WEBKDD 2002 - Mining Web Data for Discovering Usage Patterns and Profiles. WebKDD 2002. Lecture Notes in Computer Science(), vol 2703. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39663-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-39663-5_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20304-9
Online ISBN: 978-3-540-39663-5
eBook Packages: Springer Book Archive