Web Robot Detection - Preprocessing Web Logfiles for Robot Detection

  • Christian Bomhardt
  • Wolfgang Gaul
  • Lars Schmidt-Thieme
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)


Web usage mining has to face the problem that parts of the underlying logfiles are created by robots. While cooperative robots identify themselves and obey to the instructions of server owners not to access parts or all of the pages on the server, malignant robots may camouflage themselves and have to be detected by web robot scanning devices. We describe the methodology of robot detection and show that highly accurate tools can be applied to decide whether session data was generated by a robot or a human user.


User Agent User Session Agent List Robot Agent Cooperative Robot 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. ALMEIDA, V., RIEDI, R., MENASCÉ, D., MEIRA, W., RIBEIRO, F., and FONSECA, R. (2001): Characterizing and modeling robot workload on e-business sites. Proc. 2001 ACM Sigmetrics Conference. Scholar
  2. ANACONDA partners llc: Anaconda! foundation weather.
    wxdemo.shtml.Google Scholar
  3. APACHE http server documentation project: Apache http server log files combined log format.\#combined.Google Scholar
  4. ARLITT, M., KRISHNAMURTHY, D., and ROLIA, J. (2001): Characterizing the scalability of a large web-based shopping system. ACM Transactions on Internet Technology. Scholar
  5. BERENDT, B., MOBASHER, B., SPILIOPOULOU, M., and WILTSHIRE, J. (2001): Measuring the accuracy of sessionizers for web usage analysis. Proceedings of the Web Mining Workshop at the First SIAM International Conference on Data Mining, Chicago.Google Scholar
  6. BOMHARDT, C. (2002): The robot detection tool. Scholar
  7. CAPTCHA project: Telling humans and computers apart. Scholar
  8. CATLEDGE, L. and PITKOW, J. (1995): Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems.Google Scholar
  9. GAUL, W. and SCHMIDT-THIEME, L. (2000): Frequent generalized subsequences-a problem from webmining. In: Gaul, W., Opitz, O., Schader, M. (eds.): Data Analysis, Scientific Modelling and Practical Application, Springer, Heidelberg, pp. 429–445.Google Scholar
  10. HENG, C: Defending your web site / server from the nimbda worm / virus. Scholar
  11. IPAOPAO.COM software Inc.: Fast email spider for web. Scholar
  12. KOSTER, M. (1994): A standard for robot exclusion. Scholar
  13. MENASCÉ, D., ALMEIDA, V., RIEDI, R, RIBEIRO, F., FONSECA, R., and MEIRA, W. (2000): In search of invariants for e-business workloads. Proceedings of ACM Conference on Electronic Commerce, Minneapolis, MN. Scholar
  14. MULLANE, G. (1998): Spambot beware detection. Scholar
  15. TAN, P.-N. and KUMAR, V. (2000): Modeling of web robot navigational patterns. Proc. ACM WebKDD Workshop.Google Scholar
  16. TAN, P.-N. and KUMAR, V. (2001): Discovery of web robot sessions based on their navigational patterns. Scholar

Copyright information

© Springer-Verlag Berlin · Heidelberg 2005

Authors and Affiliations

  • Christian Bomhardt
    • 1
  • Wolfgang Gaul
    • 1
  • Lars Schmidt-Thieme
    • 2
  1. 1.Institut für Entscheidungstheorie und UnternehmensforschungUniversity of KarlsruheGermany
  2. 2.Institute for Computer ScienceUniversity of FreiburgGermany

Personalised recommendations