Evaluation of Web Robot Discovery Techniques: A Benchmarking Study

  • Nick Geens
  • Johan Huysmans
  • Jan Vanthienen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4065)


This paper describes part of a web usage mining study executed on log files obtained from a Belgian e-commerce company. From these log files, it can be observed that numerous web robots are active on the site. Most of these robots show a crawling behavior that is radically different from the browsing behavior of human visitors. Because the owners of the e-shop desire information about the paths that human visitors follow through the site, it is of crucial importance to remove these robotic visits from the log files.

Several existing methods for web robot discovery are evaluated and compared, none of them leading to satisfying results. Therefore, a new technique is developed that results in a successful and reliable identification of web robots.


User Agent Benchmark Study Subsequent Request Human Visitor Perfect Precision 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1(2), 12–23 (2000)CrossRefGoogle Scholar
  2. 2.
    Cooley, R.: Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. Ph.D thesis, University of Minnesota (2000)Google Scholar
  3. 3.
    Huysmans, J., Baesens, B., Vanthienen, J.: Web usage mining: a practical study. In: Twelfth Conference on Knowledge Acquisition and Management (KAM 2004) (2004)Google Scholar
  4. 4.
    Perner, P., Fiss, G.: Intelligent e-marketing with web mining, personalization, and user-adapted interfaces. In: Industrial Conference on Data Mining (ICDM 2002), London, UK, pp. 37–52. Springer, Heidelberg (2002)Google Scholar
  5. 5.
    Blanc, E., Giudici, P.: Sequence rules for web clickstream analysis. In: Industrial Conference on Data Mining (ICDM 2002), London, UK, pp. 1–14. Springer, Heidelberg (2002)Google Scholar
  6. 6.
    Huysmans, J., Baesens, B., Mues, C., Vanthienen, J.: Web usage mining with time constrained association rules. In: Proceedings of the Sixth International Conference on Enterprise Information Systems (ICEIS 2004), Porto, Portugal, pp. 343–348 (2004)Google Scholar
  7. 7.
    Heinonen, O., Hatonen, K., Klemettinen, K.: WWW robots and search engines Seminar on Mobile Code, Report TKO-C79, Helsinki University of Technology, Department of Computer Science (1996)Google Scholar
  8. 8.
    Greenwald, A.R., Kephart, J.O.: Shopbots and pricebots. In: Agent Mediated Electronic Commerce (IJCAI Workshop), pp. 1–23 (1999)Google Scholar
  9. 9.
    Almeida, V., Menasce, D.A., Riedi, R.H., Peligrinelli, F., Fonseca, R.C., Meira Jr., W.: Analyzing web robots and their impact on caching. In: 6th Web Caching and Content Delivery Workshop, pp. 299–310 (2001)Google Scholar
  10. 10.
    Tan, P., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery 6, 9–35 (2002)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Koster, M.: The robot exclusion standard (1994), http://www.robotstxt.org/wc/norobots.html
  12. 12.
    Eichmann, D.: Ethical Web agents. Computer Networks and ISDN Systems 28(1–2), 127–136 (1995)CrossRefGoogle Scholar
  13. 13.
    Koster, M.: The web robots database (2004), http://www.robotstxt.org/wc/active.html

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Nick Geens
    • 1
  • Johan Huysmans
    • 1
  • Jan Vanthienen
    • 1
  1. 1.Department of Decision Sciences and Information ManagementKatholieke Universiteit LeuvenLeuvenBelgium

Personalised recommendations