Cut-and-Pick Transactions for Proxy Log Mining

  • Wenwu Lou
  • Guimei Liu
  • Hongjun Lu
  • Qiang Yang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2287)


Web logs collected by proxy servers, referred to as proxy logs or proxy traces, contain information about Web document accesses by many users against many Web sites. This “many-to-many” characteristic poses a challenge to Web log mining techniques due to the difficulty in identifying individual access transactions. This is because in a proxy log, user transactions are not clearly bounded and are sometimes interleaved with each other as well as with noise. Most previous work has used simplistic measures such as a fixed time interval as a determination method for the transaction boundaries, and has not addressed the problem of interleaving and noisy transactions. In this paper, we show that this simplistic view can lead to poor performance in building models to predict future access patterns. We present a more advanced cut-and-pick method for determining the access transactions from proxy logs, by deciding on more reasonable transaction boundaries and by removing noisy accesses. Our method takes advantage of the user behavior that in most transactions, the same user typically visits multiple, related Web sites that form clusters. These clusters can be discovered by our algorithm based on the connectivity among Web sites. By using real-world proxy logs, we experimentally show that this cut-and-pick method can produce more accurate transactions that result in Web-access prediction models with higher accuracy.


Association Rule Page Reference User Transaction Rule Length User Access Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. ADW01.
    C. Anderson, P. Domingos, and D. Weld. Personalizing web sites for mobile users. In Proceedings of the 10th World Wide Web Conference (WWW10), Hong Kong, China, May 2–4 2001.Google Scholar
  2. AMS+96.
    Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining., pages 307–328. AAAI/MIT Press, 1996.Google Scholar
  3. AS95.
    Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Philip S. Yu and Arbee L.P. Chen, editors, Proc. 11th Int. Conf. Data Engineering, ICDE, pages 3–14, Taipei, Taiwan, March 6–10 1995. IEEE Press.Google Scholar
  4. BL99.
    Jose Borges and Mark Levene. Data mining of user navigation patterns. In Proc. of the Web Usage Analysis and User Profiling Workshop, pages 31–36, San Diego, California, 1999.Google Scholar
  5. BL00.
    J. Borges and M. Levene. A heuristic to capture longer user web navigation patterns. In Proc. of the first International Conference on Electronic Commerce and Web Technologies, Greenwich, U.K., September 2000.Google Scholar
  6. CMS99.
    Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1):5–32, 1999.Google Scholar
  7. Coo96.
    Digitial Equipment Cooperation. Digital’s web proxy traces. Available at, 1996.
  8. CPY96.
    M.-S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in a web environment. In Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS), pages 385–393, Hong Kong, May 27–30 1996.Google Scholar
  9. CSM97.
    R. Cooley, J. Srivastava, and B. Mobasher. Web mining: Information and pattern discovery on the world wide web. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97), Newport Beach, CA, November 1997.Google Scholar
  10. DH99.
    J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proceedings of the 8th International World Wide Web Conference, pages 1467–1479, Toronto, Canada, 1999.Google Scholar
  11. LYW01.
    Tianyi Li, Qiang Yang, and Ke Wang. Classification pruning for webrequest prediction. In Proceedings of the 10th World Wide Web Conference (WWW10), Hong Kong, China, May 2–4 2001.Google Scholar
  12. NM00.
    Alexandros Nanopoulos and Yannis Manolopoulos. Finding generalized path patterns for web log data mining. In Proceedings of East-European Conference on Advances in Databases and Information Systems, pages 215–228, 2000.Google Scholar
  13. PE00.
    Mike Perkowitz and Oren Etzioni. Towards adaptive web sites: Conceptual framework and case study. Artificial Intelligence, 118(1–2):245–275, 2000.zbMATHCrossRefGoogle Scholar
  14. PM96.
    Venkata N. Padmanabhan and Jeffrey C. Mogul. Using predictive prefetching to improve World-Wide Web latency. In Proceedings of the SIGCOMM’ 96 conference, 1996.Google Scholar
  15. PM98.
    T. Palpanas and A. Mendelzon. Web prefetching using partial match prediction. Technical Report CSRG-376, Departement of Computer Science, University of Toronto, 1998.Google Scholar
  16. PP99.
    James E. Pitkow and Peter Pirolli. Mining longest repeating subsequences to predict world wide web surfing. In USENIX Symposium on Internet Technologies and Systems, 1999.Google Scholar
  17. PPR96.
    Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow’s ear: Extracting usable structures from the web. In Proc. ACM Conf. Human Factors in Computing Systems, CHI. ACM Press, 1996.Google Scholar
  18. SKS98.
    S. Schechter, M. Krishnan, and M. Smith. Using path profiles to predict HTTP request. In Proceedings of 7th International World Wide Web Conference, Brisbane, Australia, April 14–18 1998.Google Scholar
  19. SP01.
    M. Spiliopoulou and C. Pohle. Data mining for measuring and improving the success of web sites. Data Mining and Knowledge Discovery, 5(1/2), 2001.Google Scholar
  20. SPF99.
    Myra Spiliopoulou, Carsten Pohle, and Lukas Faulstich. Improving the effectiveness of a web site with web usage mining. In Proc. of the Web Usage Analysis and User Profiling Workshop, pages 51–56, San Diego, California, 1999.Google Scholar
  21. SYZ00.
    Z. Su, Q. Yang, and H. Zhang. A prediction system for multimedia prefetching on the internet. In Proceedings of the ACM Multimedia Conference, October 2000.Google Scholar
  22. YZL01.
    Qiang Yang, Haining Henry Zhang, and Tianyi Li. Mining web logs for prediction models in www caching and prefetching. In Proc. of the 7th ACM SIGKDD’01, San Francisco, California, USA, August 2001.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Wenwu Lou
    • 1
  • Guimei Liu
    • 1
  • Hongjun Lu
    • 1
  • Qiang Yang
    • 1
  1. 1.Department of Computer ScienceHong Kong University of Science and TechnologyClear Water BayHong Kong

Personalised recommendations