Advertisement

Journal of Intelligent Information Systems

, Volume 28, Issue 1, pp 79–104 | Cite as

A process of knowledge discovery from web log data: Systematization and critical review

  • Zidrina PabarskaiteEmail author
  • Aistis Raudys
Article

Abstract

This paper presents a comprehensive survey of web log/usage mining based on over 100 research papers. This is the first survey dedicated exclusively to web log/usage mining. The paper identifies several web log mining sub-topics including specific ones such as data cleaning, user and session identification. Each sub-topic is explained, weaknesses and strong points are discussed and possible solutions are presented. The paper describes examples of web log mining and lists some major web log mining software packages.

Keywords

Web log mining Web usage Personalisation Survey Data pre-processing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adomavicius, G. (1997). Discovery of actionable patterns in databases: The action hierarchy approach. Knowledge discovery and data mining. Newport Beach, CA, Menlo Park, CA.Google Scholar
  2. Agrawal, R., Imielinski, T., & Swami, S. (1993). Mining association rules between sets in large databases. Conference on Management of Data (ACM SIGMOD), Washington, DC.Google Scholar
  3. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 20th VLDB, Santiago, Chile.Google Scholar
  4. Agrawal, R., & Srikant R. (1995). Mining sequential patterns. Data engineering. Taipei: IEEE.Google Scholar
  5. Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating e-commerce and data mining: Architecture and challenges. Data mining. San Jose, CA: IEEE Computer Society.Google Scholar
  6. Balabanovic, M., Shoham, Y., & Yun, Y. (1995). An adaptive agent for automated web browsing. Visual Communication and Image Representation, 4.Google Scholar
  7. Barish, G., & Obraczka, K. (2000). World wide web caching: Trends and techniques. IEEE Communications Magazine, 5, 178–184.CrossRefGoogle Scholar
  8. Berendt, B., Mobasher, B., Nakagawa, M., & Spiliopoulou, M. (2002). The impact of site structure and user environment on session reconstruction in web usage analysis. 4th WebKDD 2002 Workshop, ACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD’2002), Edmonton, Alberta, Canada.Google Scholar
  9. Berendt, B., & Spiliopoulou, M. (2000). Analysis of navigation behaviour in web sites integrating multiple information systems. VLDB, 56–75.Google Scholar
  10. Berry, M. J. A., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. New York: Wiley.Google Scholar
  11. Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D. et al. (2001a). Web log data warehousing and mining for intelligent web caching. Elsevier Science.Google Scholar
  12. Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D., et al. (2001b). Web log data warehousing and mining for intelligent web caching. Data and Knowledge Engineering, 2, 165–189.CrossRefGoogle Scholar
  13. Buchner, A. G., Mulvenna, M. D., Anand, S. S., & Hughes, J. G. (1999). An Internet-enabled knowledge discovery process. International database conference; Heterogeneous and internet databases. Hong Kong: City University of Hong Kong.Google Scholar
  14. Catledge, L. D., & Pitkow, J. E. (1995). Characterizing browsing strategies in the world-wide web. Computer Networks and ISDN Systems, 6, 10–65.Google Scholar
  15. Chakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations, 2, 1–11.CrossRefMathSciNetGoogle Scholar
  16. Chen, M. S., Park, J. S., & Yu, P. S. (1996). Data mining for path traversal patterns in a web environment. Distributed computing systems. Hong Kong: IEEE Computer Society.Google Scholar
  17. Chi, E. H. (2002). Improving web usability through visualization. IEEE Internet Computing, 64–71.Google Scholar
  18. Consens, M. P., Eigler, F. C., Hasan, M. Z., Mendelzon, A. O., Noik, E. G., Ryman, A. G., et al. (1994). Architecture and applications of the Hy^+ visualization system. IBM Systems Journal, 3, 458.Google Scholar
  19. Cooley, R., Mobasher, B., & Srivastava, J. (1997a). Grouping web page references into transactions for mining world wide web browsing patterns. IEEE Knowledge and Data Engineering Exchange Workshop (KDEX’97). Los Alamitos, CA: IEEE Computer Society.Google Scholar
  20. Cooley, R., Mobasher, B., & Srivastava, J. (1997b). Web mining: Information and pattern discovery on the word wide web. 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97).Google Scholar
  21. Cooley, R., Mobasher, B., & Srivastava, J. (1999a). Automatic personalization based on web usage mining. Chicago, IL: Depaul University.Google Scholar
  22. Cooley, R., Mobasher, B., & Srivastava, J. (1999b). Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1, 5–32.Google Scholar
  23. Dai, H., Luo, T., Mobasher, B., Nakagawa, M., & Witshire, J. (2000). Discovery of aggregate usage profiles for web personalization. Mining for E-Commerce Workshop (WebKDD’2000, held in conjunction with the ACM-SIGKDD on Knowledge Discovery in Databases KDD’2000), Boston, MA.Google Scholar
  24. Dai, H., & Mobasher, B. (2003). A road map to more effective web personalization: Integrating domain knowledge with web usage mining. Proceedings of the International Conference on Internet Computing (IC’03), Las Vegas, NV.Google Scholar
  25. Davison, B. (1999). A survey of proxy cache evaluation techniques. 4th International Web Caching Workshop (WCW’99).Google Scholar
  26. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. New York: Wiley.Google Scholar
  27. Duska, B. M., Marwood, D., & Freeley, M. J. (1997). The measured access characteristics of world-wide-web client proxy caches. Proceedings of the USENIX Symposium on Internet Technologies and Systems, Monterey, CA: USENIX Association.Google Scholar
  28. Dyreson, C. (1997). Using an incomplete data cube as a summary data sieve. Bulletin of the IEEE Technical Committee on Data Engineering (March): 19–26.Google Scholar
  29. Elo-Dean, S., & Viveros M. (1997). Data mining the IBM official 1996 Olympics Web site, IBM T.J. Watson Research Center.Google Scholar
  30. Famili, A., Shen, W. M., Weber, R., & Simoudis, E. (1997). Data preprocessing and intelligent data analysis. Intelligent Data Analysis, 3–23.Google Scholar
  31. Faulstich, L. C., & Spiliopoulou, M. (1998). WUM: A tool for web utilization analysis. EDBT Workshop WebDB’98. Valencia, Spain: Springer-Verlag.Google Scholar
  32. Faulstich, L., Pohle, C., & Spiliopoulou, M. (1999). Improving the effectiveness of a web site with web usage mining. KDD Workshop WEBKDD’99, San Diego, CA.Google Scholar
  33. Faulstich, L., Spiliopoulou, M., & Wilkler, K. (1999). A data mining analyzing the navigational behaviour of web users. Workshop on Machine Learning User Modeling of the ACAI’99 International Conference, Creta, Greece.Google Scholar
  34. Fayyad, U. M. (1996). Advances in knowledge discovery and data mining. Menlo Park, CA: AAAI Press.Google Scholar
  35. Fleishman, G. (1996). Web log analysis, who’s doing what, when? Web Developer.Google Scholar
  36. Fong, J., Hughes, J. G., & Zhu, J. (2000). Online web mining transactions association rules using frame metadata model.Google Scholar
  37. Glassman, S. (1994). A caching relay for the world wide web. 1st World Wide Web Conference. Geneva, Switzerland: Elsevier.Google Scholar
  38. Han, J., Cai, Y., & Cercone, N. (1993). Date-driven discovery of quantitative rules in relational databases. IEEE Transactions on Knowledge and Data Engineering, 29–40.Google Scholar
  39. Han, J., Chiang, J., Chee, S., Chen, J., Chen, Q., Cheng, S., et al. (1997). DBMiner: A system for data mining in relational databases and data warehouses. CASCON’97: Meeting of Minds, Toronto, Canada.Google Scholar
  40. Han, J., He, Y., & Wang, K. (2000). Mining frequent itemsets using support constraints. International Conference on Very Large Databases (VLDB’00), Cairo, Egypt.Google Scholar
  41. Han, J., Xin, M., & Zaïane, O. R. (1998). Discovering web access patterns and trends by applying OLAP and data mining technology on web logs. Conference on Advances in Digital Libraries, Santa Barbara, CA.Google Scholar
  42. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining.Google Scholar
  43. He, D., & Goker, A. (2003). Detecting session boundaries from Web user logs.Google Scholar
  44. Jain, N., Han, E., Mobasher, B., & Srivastava, J. (1996). Web mining: Pattern discovery from world wide web transactions. Minneapolis, MN: University of Minnesota.Google Scholar
  45. Jain, N., Han, E., Mobasher, B., & Srivastava, J. (1997). Web mining: Pattern discovery from world wide web transactions. Minneapolis, MN: University of Minnesota.Google Scholar
  46. Kanth, K. V. R., & Siva, R. (2002). Personalization and location-based technologies for e-commerce applications, eJETA.Google Scholar
  47. Kato, H., Hiraishi, H., & Mizoguchi, F. (2001). Log summarizing agent for web access data using data mining techniques. IEEE Intelligent Systems and Their Applications, 2642–2647.Google Scholar
  48. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. ACM SIGKDD Explorations, 1, 1–15.CrossRefGoogle Scholar
  49. Lin, I. Y., Huang, X. M., & Chen, M. S. (1999). Capturing user access patterns in the web for data mining. Tools with artificial intelligence. Chicago, IL: IEEE Computer Society.Google Scholar
  50. Luotonen, A., & Altis, K. (1994). World-wide web proxies. Selected Papers of First World-Wide Web Conference, Elsevier Science Division. 147.Google Scholar
  51. Lup Low, W., Li Lee, M., & Wang Ling, T. (2001). A knowledge-based approach for duplicate elimination in data cleaning. Information Systems, 8, 585–606.CrossRefGoogle Scholar
  52. Mannila, H., Toivonen, H., & Verkamo, A. I. (1995). Discovering frequent episodes in sequences. Knowledge discovery & data mining. Montreal, Canada: AAAI Press.Google Scholar
  53. Mobasher, B., Dai, H., Luo, T., Sung, Y., & Zhu, J. (2000). Integrating web usage and content mining for more effective personalization. Proceedings of the International Conference on E-Commerce and Web Technologies (ECWeb2000), Greenwich, UK.Google Scholar
  54. Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2002). Using sequential and non-sequential patterns in predictive web usage mining tasks. International Conference on Data Mining. Maebashi City, Japan: IEEE Computer Society.Google Scholar
  55. Montgomery, A. L., & Faloutsos, C. (2001). Identifying web browsing trends and patterns. Computer, 7, 94–95.CrossRefGoogle Scholar
  56. Pabarskaite, Z. (2002). Implementing advanced cleaning and end-user interpretability technologies in web log mining. Information Technology Interfaces ITI2002, Collaboration and Interaction in Knowledge-Based Environments, Cavtat/Dubrovnik, Croatia.Google Scholar
  57. Pabarskaite, Z. (2003). Decision trees for web log mining. Intelligent Data Analysis, 2, 141–155.Google Scholar
  58. Pabarskaite, Z., & Raudys, A. (2002). Advances in web usage mining. The 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2002), Florida, USA.Google Scholar
  59. Padmanabhan, B., & Tuzhilin, A. (1999). Unexpectedness as a measure of interestingness in knowledge discovery. Decision Support Systems, 303–318.Google Scholar
  60. Padmanabhan, B., & Tuzhilin, A. (2000). Small is beautiful: Discovering the minimal set of unexpected patterns. International Conference on Knowledge Discovery and Data Mining; KDD 2000. Boston, MA: Association for Computing Machinery.Google Scholar
  61. Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulous, C., & Tzitziras, P. (1999). From web usage statistics to web usage analysis. IEEE International Conference on Systems Man and Cybernetics: II-159–II-164.Google Scholar
  62. Perkowitz, M., & Etzioni, O. (1997). Adaptive web sites: An AI challenge. Proceedings IJCAI’97, Nagoya, Japan.Google Scholar
  63. Perkowitz, M., & Etzioni, O. (1998). Adaptive web sites: Automatically synthesizing web pages. AAA/98.Google Scholar
  64. Perkowitz, M., & Etzioni, O. (1999). Towards adaptive web sites: Conceptual framework and case study. Eighth International World Wide Web Conference, Toronto, Ontario.Google Scholar
  65. Peterson, T., & Pinkelman, J. (2000). Microsoft OLAP Unleashed.Google Scholar
  66. Piatetsky-Shapiro, G., & Matheus, C. J. (1994). The interestingness of deviations. Knowledge discovery in databases. Seattle, WA: AAAI Press.Google Scholar
  67. Pirjo, M. (2000). Attribute, event sequence, and event type similarity notions for data mining. Department of Computer Science, 199.Google Scholar
  68. Pirolli, P., Pitkow, J., & Rao, R. (1996). Silk from a sow’s ear: Extracting usable structure from the web. Human factors in computing systems: Common ground; CHI 96, Vancouver; Canada, New York.Google Scholar
  69. Pitkow, J. (1997). In search of reliable usage data on the WWW. The Sixth International World Wide Web Conference, Santa Clara, CA.Google Scholar
  70. Pitkow, J., & Bharat, K. (1994). WEBVIZ: A tool for world-wide web access log analysis. First International World Wide Web Conference, CERN, Geneva, Switzerland.Google Scholar
  71. Pitkow, J., & Margaret, R. (1994). Integrating bottom-up and top-down analysis for intelligent hypertext. Intelligent Knowledge Management.Google Scholar
  72. Pitkow, J., & Pirolli, P. (1999). Mining longest repeating subsequences to predict world wide web surfing. Internet Technologies and Systems; USENIX Symposium on Internet Technologies and Systems. Boulder, CO: USENIX Association.Google Scholar
  73. Pitkow, J., & Recker, M. (1994). A simple yet robust caching algorithm based on dynamic access patterns. First International World Wide Web Conference, CERN, Geneva, Switzerland.Google Scholar
  74. Roberts, S. (2002). Users are still wary of cookies. Computer Weekly, 24.Google Scholar
  75. Sarukkai, R. R. (2000). Link prediction and path analysis using Markov chains. Computer Networks, 377–386.Google Scholar
  76. Savola, T., Brown, M., Jung, J., Brandon, B., Meegan, R., Murphy, K., et al. (1996). Using HTML.Google Scholar
  77. Schechter, S., Krishnan, M., & Smith, M. D. (1998). Using path profiles to predict HTTP requests. Computer Networks and ISDN Systems (1–7), 457–467.Google Scholar
  78. Shahabi, C., Zarkesh, A., Adibi, J., & Shah, V. (1997). Knowledge discovery from users Webpage navigation. Research Issues in Data Engineering, Birmingham, England.Google Scholar
  79. Silberschatz, A., & Tuzhilin, A. (1995). On subjective measures of interestingness in knowledge discovery. Knowledge discovery & data mining. Montreal, Canada: AAAI Press.Google Scholar
  80. Spiliopoulou, M. (1999). Managing interesting rules in sequence mining. 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases PKDD’99. Prague, Czech Republic: Springer-Verlag.Google Scholar
  81. Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. Extending database technology. Avignon, France: Springer.Google Scholar
  82. Tan, P., & Kumar, V. (2002). Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 9–35.Google Scholar
  83. Tauscher, L., & Greenberg, S. (1997). How people revisit web pages: Empirical findings and implications for the design of history systems. International Journal of Human Computer Studies, 1, 97–138.CrossRefGoogle Scholar
  84. Wang, J. (1999). A survey of web caching schemes for the Internet. Computer Communication Review, 5, 36–46.CrossRefGoogle Scholar
  85. Weiss, S. M., & Kulikowski, C. A. (1991). Computer systems that learn: Classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann.Google Scholar
  86. Wu, K., Yu, P.S., Ballman, A. (1998). SpeedTracer: A web usage mining and analysis tool. IBM Systems Journal, 37, 89–105.CrossRefGoogle Scholar
  87. Xiao, Y., & Dunham, M. H. (2001). Efficient mining of traversal patterns. Data & Knowledge Engineering, 191–214.Google Scholar
  88. Xiao, J., & Zhang, Y. (2001). Clustering of web users using session-based similarity measures. IEEE.Google Scholar
  89. Yang, Q., Wang, H., Zhang, W. (2002). Web-log mining for quantitative temporal-event prediction. IEEE Computational Intelligence Bulletin, 1, 10–18.Google Scholar
  90. Yun, C. H., & Chen, M. S. (2000a). Using pattern–join and purchase–combination for mining web transaction patterns in an electronic commerce environment. Compsac, 99–104.Google Scholar
  91. Yun, C.-H., & Chen, M.-S. (2000b). Mining web transaction patterns in an electronic commerce environment. 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2006

Authors and Affiliations

  1. 1.Institute of Mathematics and InformaticsVilniusLithuania

Personalised recommendations