Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

Abstract

We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we first partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach we employ is model-based (as opposed to distance-based) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of first-order Markov models using the Expectation-Maximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data; and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on user-traffic data from msnbc.com.

This is a preview of subscription content, access via your institution.

References

  1. Anderson, C., Domingos, P., and Weld, D. 2001. Adaptive Web navigation for wireless devices. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, San Francisco, CA: Morgan Kaufmann, pp. 879–884.

    Google Scholar 

  2. Banfield, J. and Raftery, A. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821.

    Google Scholar 

  3. Bernardo, J. 1979. Expected information as expected utility. Annals of Statistics, 7:686–690.

    Google Scholar 

  4. Bernardo, J. and Smith, A. 1994. Bayesian Theory. New York: John Wiley and Sons.

    Google Scholar 

  5. Bestavros, A. 1996. Speculative data dissemination and service to reduce server load, network traffic, and service time in distributed information systems. In Proceedings of the Twelfth International Conference on Data Engineering, (S. Y. W. Su (Ed.)), IEEE Computer Society, pp. 180–187.

  6. Borges, J. and Levene, M. 2000. Data mining of user navigation patterns. In Web Usage Analysis and User Profiling, (B. Masand, and M., Spiliopoulou (Eds.)). Berlin: Springer, pp. 92–111.

    Google Scholar 

  7. Cadez, I. and Smyth, P. 1999. Probabilistic clustering using hierarchical models. Technical Report 99-16, Information and Computer Science, University of California, Irvine.

  8. Cheeseman, P. and Stutz, J. 1995. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining, (U. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.)). Menlo Park, CA: AAAI Press, pp. 153–180.

    Google Scholar 

  9. Chen, M.-S., Park, J., and Yu, P. 1998. Efficient data mining for traversal patterns. IEEE Transactions on Knowledge and Data Engineering, 10:209–221.

    Google Scholar 

  10. Cooley, R., Tan, P.-N., and Srivastava, J. 2000. Websift: the Web site information filter system. In Web Usage Analysis and User Profiling, (B. Masand, and M. Spiliopoulou (Eds.)). Berlin: Springer, pp. 163–182.

    Google Scholar 

  11. Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39:1–38.

    Google Scholar 

  12. Deshpande, M. and Karypis, G. 2003. Selective Markov models for predicting web-page accesses. ACM Transactions on Internet Technology. To appear.

  13. Fraley, C. and Raftery, A. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal, 41:578–588.

    Google Scholar 

  14. Fu, Y., Sandhu, K., and Shih, M.-Y. 2000. Clustering of Web users based on access patterns. In Web Usage Analysis and User Profiling, (B. Masand and M. Spiliopoulou (Eds.)). Berlin: Springer, pp. 21–38.

    Google Scholar 

  15. Good, I. 1965. The Estimation of Probabilities. Cambridge, MA: MIT Press.

    Google Scholar 

  16. Huberman, B., Pirolli, P., Pitkow, J., and Lukose, R. 1997. Strong regularities in World Wide Web surfing. Science, 280:95–97.

    Google Scholar 

  17. Krogh, A. 1994. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–1531.

    Google Scholar 

  18. McLachlan, G. and Basford, K. 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker.

  19. Minar, N. and Donath, J. 1999. Visualizing crowds at a Web site. In Conference on Human Factors in Computing Systems; CHI99, pp. 186–187.

  20. Padmanabhan, V. and Mogul, J. 1996. Using predictive pre-fetching to improve world wide web latency. ACM Computer Communication Review, 26:22–36.

    Google Scholar 

  21. Pirolli, P. and Pitkow, J. 1999. Distribution of surfer's paths through the world wide web. World Wide Web, 2:29–45.

    Google Scholar 

  22. Poulsen, C. 1990. Mixed Markov and latent Markov modelling applied to brand choice behavior. International Journal of Research in Marketing, 7:5–19.

    Google Scholar 

  23. Rabiner, L., Lee, C., Juang, B., and Wilpon, L. 1989.HMM clustering for connected word recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Los Alamitos, CA: IEEE Computer Society Press, pp. 405–408.

    Google Scholar 

  24. Ridgeway, G. and Altschuler, S. 1998. Clustering finite discrete Markov chains. In Proceedings of the Section on Physical and Engineering Sciences, pp. 228–229.

  25. Sarukkai, R. 2000. Link prediction and path analysis using Markov chains. Computer Networks, 33(1–6):377–386.

    Google Scholar 

  26. Sen, R. and Hansen, M. 2003. Predicting a Web user's next access based on log data. Journal of Computational Graphics and Statistics, 12(1):143–155.

    Google Scholar 

  27. Smyth, P. 1997. Clustering sequences using hidden Markov models. In Advances in Neural Information Processing Systems 9, (M. Mozer, M. Jordan, and T. Petsche (Eds.)). MIT Press, pp. 648–654.

  28. Smyth, P., Ide, K., and Ghil, M. 1999. Multiple regimes in Northern hemisphere height fields via mixture model clustering. Journal of the Atmospheric Sciences, 56:3704–3723.

    Google Scholar 

  29. Smyth, P. 1999. Probabilistic model-based clustering of multivariate and sequential data. In Proceedings of Seventh International Workshop on Artificial Intelligence and Statistics, San Francsico, CA: Morgan Kaufmann, pp. 299–304.

    Google Scholar 

  30. Spiliopoulou, M., Pohle, C., and Faulstich, L. 2000. Improving the effectiveness of a web site with Web usage mining. In Web Usage Analysis and User Profiling, (B. Masand and M. Spiliopoulou (Eds.)). Berlin: Springer, pp. 142–162.

    Google Scholar 

  31. Thiesson, B., Meek, C., Chickering, D., and Heckerman, D. 1999. Computationally efficient methods for selecting among mixtures of graphical models, with discussion. In Bayesian Statistics 6: Proceedings of the Sixth Valencia International Meeting, Oxford: Clarendon Press, pp. 631–656.

    Google Scholar 

  32. Wedel, M. and Kamakura, W. 1998. Market Segmentation: Conceptual and Methodological Foundations. Kluwer Academic Publishers.

  33. Wexelblat, A. and Maes, P. 1999. Footprints: History-rich tools for information foraging. In Proceedings of ACMCHI 99 Conference on Human Factors in Computing Systems, pp. 270–277.

  34. Yan, T., Jacobsen, M., Garcia-Molina, H., and Dayal, U. 1996. From user access patterns to dynamic hypertext linking. Computer Networks, 28(7–11):1007–1014.

    Google Scholar 

  35. Zaiane, O., Xin, M., and Han, J. 1998. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In Proceedings of the Advances in Digital Libraries Conference, pp. 19–29.

  36. Zuckerman, I., Albrecht, D., and Nicholson, A. 1999. Predicting user's requests on the WWW. In Proceedings of the Seventh International Conference on User Modeling, Springer Wien, pp. 275–284.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to David Heckerman.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Cadez, I., Heckerman, D., Meek, C. et al. Model-Based Clustering and Visualization of Navigation Patterns on a Web Site. Data Mining and Knowledge Discovery 7, 399–424 (2003). https://doi.org/10.1023/A:1024992613384

Download citation

  • model-based clustering
  • sequence clustering
  • data visualization
  • Internet
  • web