Cut-and-Pick Transactions for Proxy Log Mining
Web logs collected by proxy servers, referred to as proxy logs or proxy traces, contain information about Web document accesses by many users against many Web sites. This “many-to-many” characteristic poses a challenge to Web log mining techniques due to the difficulty in identifying individual access transactions. This is because in a proxy log, user transactions are not clearly bounded and are sometimes interleaved with each other as well as with noise. Most previous work has used simplistic measures such as a fixed time interval as a determination method for the transaction boundaries, and has not addressed the problem of interleaving and noisy transactions. In this paper, we show that this simplistic view can lead to poor performance in building models to predict future access patterns. We present a more advanced cut-and-pick method for determining the access transactions from proxy logs, by deciding on more reasonable transaction boundaries and by removing noisy accesses. Our method takes advantage of the user behavior that in most transactions, the same user typically visits multiple, related Web sites that form clusters. These clusters can be discovered by our algorithm based on the connectivity among Web sites. By using real-world proxy logs, we experimentally show that this cut-and-pick method can produce more accurate transactions that result in Web-access prediction models with higher accuracy.
Unable to display preview. Download preview PDF.
- ADW01.C. Anderson, P. Domingos, and D. Weld. Personalizing web sites for mobile users. In Proceedings of the 10th World Wide Web Conference (WWW10), Hong Kong, China, May 2–4 2001.Google Scholar
- AMS+96.Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining., pages 307–328. AAAI/MIT Press, 1996.Google Scholar
- AS95.Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Philip S. Yu and Arbee L.P. Chen, editors, Proc. 11th Int. Conf. Data Engineering, ICDE, pages 3–14, Taipei, Taiwan, March 6–10 1995. IEEE Press.Google Scholar
- BL99.Jose Borges and Mark Levene. Data mining of user navigation patterns. In Proc. of the Web Usage Analysis and User Profiling Workshop, pages 31–36, San Diego, California, 1999.Google Scholar
- BL00.J. Borges and M. Levene. A heuristic to capture longer user web navigation patterns. In Proc. of the first International Conference on Electronic Commerce and Web Technologies, Greenwich, U.K., September 2000.Google Scholar
- CMS99.Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1):5–32, 1999.Google Scholar
- Coo96.Digitial Equipment Cooperation. Digital’s web proxy traces. Available at ftp://ftp.digital.com/pub/DEC/traces/proxy/webtraces.html, 1996.
- CPY96.M.-S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in a web environment. In Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS), pages 385–393, Hong Kong, May 27–30 1996.Google Scholar
- CSM97.R. Cooley, J. Srivastava, and B. Mobasher. Web mining: Information and pattern discovery on the world wide web. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97), Newport Beach, CA, November 1997.Google Scholar
- DH99.J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proceedings of the 8th International World Wide Web Conference, pages 1467–1479, Toronto, Canada, 1999.Google Scholar
- LYW01.Tianyi Li, Qiang Yang, and Ke Wang. Classification pruning for webrequest prediction. In Proceedings of the 10th World Wide Web Conference (WWW10), Hong Kong, China, May 2–4 2001.Google Scholar
- NM00.Alexandros Nanopoulos and Yannis Manolopoulos. Finding generalized path patterns for web log data mining. In Proceedings of East-European Conference on Advances in Databases and Information Systems, pages 215–228, 2000.Google Scholar
- PM96.Venkata N. Padmanabhan and Jeffrey C. Mogul. Using predictive prefetching to improve World-Wide Web latency. In Proceedings of the SIGCOMM’ 96 conference, 1996.Google Scholar
- PM98.T. Palpanas and A. Mendelzon. Web prefetching using partial match prediction. Technical Report CSRG-376, Departement of Computer Science, University of Toronto, 1998.Google Scholar
- PP99.James E. Pitkow and Peter Pirolli. Mining longest repeating subsequences to predict world wide web surfing. In USENIX Symposium on Internet Technologies and Systems, 1999.Google Scholar
- PPR96.Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow’s ear: Extracting usable structures from the web. In Proc. ACM Conf. Human Factors in Computing Systems, CHI. ACM Press, 1996.Google Scholar
- SKS98.S. Schechter, M. Krishnan, and M. Smith. Using path profiles to predict HTTP request. In Proceedings of 7th International World Wide Web Conference, Brisbane, Australia, April 14–18 1998.Google Scholar
- SP01.M. Spiliopoulou and C. Pohle. Data mining for measuring and improving the success of web sites. Data Mining and Knowledge Discovery, 5(1/2), 2001.Google Scholar
- SPF99.Myra Spiliopoulou, Carsten Pohle, and Lukas Faulstich. Improving the effectiveness of a web site with web usage mining. In Proc. of the Web Usage Analysis and User Profiling Workshop, pages 51–56, San Diego, California, 1999.Google Scholar
- SYZ00.Z. Su, Q. Yang, and H. Zhang. A prediction system for multimedia prefetching on the internet. In Proceedings of the ACM Multimedia Conference, October 2000.Google Scholar
- YZL01.Qiang Yang, Haining Henry Zhang, and Tianyi Li. Mining web logs for prediction models in www caching and prefetching. In Proc. of the 7th ACM SIGKDD’01, San Francisco, California, USA, August 2001.Google Scholar