Skip to main content

Data Preparation for Mining World Wide Web Browsing Patterns

Abstract

The World Wide Web (WWW) continues to grow at an astounding rate in both the sheer volume of traffic and the size and complexity of Web sites. The complexity of tasks such as Web site design, Web server design, and of simply navigating through a Web site have increased along with this growth. An important input to these design tasks is the analysis of how a Web site is being used. Usage analysis includes straightforward statistics, such as page access frequency, as well as more sophisticated forms of analysis, such as finding the common traversal paths through a Web site. Web Usage Mining is the application of data mining techniques to usage logs of large Web data repositories in order to produce results that can be used in the design tasks mentioned above. However, there are several preprocessing tasks that must be performed prior to applying data mining algorithms to the data collected from server logs. This paper presents several data preparation techniques in order to identify unique users and user sessions. Also, a method to divide user sessions into semantically meaningful transactions is defined and successfully tested against two other methods. Transactions identified by the proposed methods are used to discover association rules from real world data using the WEBMINER system [15].

This is a preview of subscription content, access via your institution.

References

  1. 1.

    R. Agrawal, R. Srikant. Fast algorithms for mining association rules. In: Proc. of the 20th VLDB Conference, Santiago, Chile, 1994, pp.487-499.

  2. 2.

    T. Bray, J. Paoli, C. M. Sperberg-McQueen. Extensible markup language (XML) 1.0 W3C recommendation. Technical report, W3C, 1998.

  3. 3.

    M. Balabanovic, Y. Shoham. Learning information retrieval agents: Experiments with automated Web browsing. In: On-line Working Notes of the AAAI Spring Symposium Series on Information Gathering from Distributed, Heterogeneous Environments, 1995.

  4. 4.

    R. Cooley, B. Mobasher, J. Srivastava. Web mining: Information and pattern discovery on the World Wide Web. In: International Conference on Tools with Artificial Intelligence, Newport Beach, CA, 1997, pp. 558-567.

  5. 5.

    L. Catledge, J. Pitkow. Characterizing browsing behaviors on the World Wide Web, Computer Networks and ISDN Systems 27(6), 1995.

    Google Scholar 

  6. 6.

    M.S. Chen, J.S. Park, P.S. Yu. Data mining for path traversal patterns in a Web environment. In: Proc. 16th International Conference on Distributed Computing Systems, 1996, pp. 385-392.

  7. 7.

    S. Elo-Dean, M. Viveros. Data mining the IBM official 1996 Olympics Web site. Technical report, IBM T.J. Watson Research Center, 1997.

  8. 8.

    Software Inc. Webtrends. http://www.webtrends.com, 1995.

  9. 9.

    J. Gray, A. Bosworth, A. Layman, H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In: IEEE 12th International Conference on Data Engineering, 1996, pp. 152-159.

  10. 10.

    Open Market Inc. Open Market Web reporter. http://www.openmarket.com, 1996.

  11. 11.

    T. Joachims, D. Freitag, T. Mitchell. Webwatcher: A tour guide for the World Wide Web. In: Proc. 15th International Conference on Artificial Intelligence, Nagoya, Japan, 1997, pp. 770-775.

  12. 12.

    L. Kaufman, P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.

  13. 13.

    H. Lieberman. Letizia: An agent that assists Web browsing. In: Proc. 1995 International Joint Conference on Artificial Intelligence, Montreal, Canada, 1995.

  14. 14.

    A. Luotonen. The common log file format. http://www.w3.org/pub/WWW/, 1995.

  15. 15.

    B. Mobasher, N. Jain, E. Han, J. Srivastava. Web Mining: Pattern discovery from World Wide Web transactions. Technical Report TR 96-050, Department of Computer Science, University of Minnesota, Minneapolis, 1996.

    Google Scholar 

  16. 16.

    H. Mannila, H. Toivonen. Discovering generalized episodes using minimal ocurrences. In: Proc. Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, 1996, pp. 146-151.

  17. 17.

    H. Mannila, H. Toivonen, A. I. Verkamo. Discovering frequent episodes in sequences. In: Proc. First International Conference on Knowledge Discovery and Data Mining, Montreal, Quebec, 1995, pp. 210-215.

  18. 18.

    net.Genesis. net.analysis desktop. http://www.netgen.com, 1996.

  19. 19.

    R. Ng, J. Han. Efficient and effective clustering method for spatial data mining. In: Proc. 20th VLDB Conference, Santiago, Chile, 1994, pp. 144-155.

  20. 20.

    D.S.W. Ngu, X. Wu. Sitehelper: A localized agent that helps incremental exploration of the World Wide Web. In: 6th International World Wide Web Conference, Santa Clara, CA, 1997, pp. 691-700.

  21. 21.

    J. Pitkow. In search of reliable usage data on the WWW. In: Sixth International World Wide Web Conference, Santa Clara, CA, 1997, pp. 451-463.

  22. 22.

    M. Pazzani, L. Nguyen, S. Mantik. Learning from hotlists and coldlists: Towards a WWW information filtering and seeking agent. In: IEEE 1995 International Conference on Tools with Artificial Intelligence, 1995.

  23. 23.

    P. Pirolli, J. Pitkow, R. Rao. Silk from a sow’s ear: Extracting usable structures from the Web. In: Proc. 1996 Conference on Human Factors in Computing Systems (CHI-96), Vancouver, British Columbia, Canada, 1996.

  24. 24.

    Global Reach Internet Productions. GRIP. http://www.global-reach.com, 1997.

  25. 25.

    J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann: San Mateo, CA, 1993.

    Google Scholar 

  26. 26.

    R. Srikant, R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In: Proc. Fifth International Conference on Extending Database Technology, Avignon, France, 1996.

  27. 27.

    S. Schechter, M. Krishnan, M. D. Smith. Using path profiles to predict HTTP requests. In: 7th International World Wide Web Conference, Brisbane, Australia, 1998.

  28. 28.

    C. Shahabi, A. Zarkesh, J. Adibi, V. Shah. Knowledge discovery from users Web-page navigation. In: Workshop on Research Issues in Data Engineering, Birmingham, England, 1997.

  29. 29.

    T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. Prom user access patterns to dynamic hypertext linking. In: Fifth International World Wide Web Conference, Paris, France, 1996.

  30. 30.

    O. R. Zaiane, M. Xin, J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs, 1998.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Robert Cooley.

Additional information

Supported by NSF grant EHR-9554517

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Cooley, R., Mobasher, B. & Srivastava, J. Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems 1, 5–32 (1999). https://doi.org/10.1007/BF03325089

Download citation

Keywords

  • Data mining
  • World Wide Web
  • association rules
  • sequential patterns
  • path analysis