Abstract
Web usage mining has to face the problem that parts of the underlying logfiles are created by robots. While cooperative robots identify themselves and obey to the instructions of server owners not to access parts or all of the pages on the server, malignant robots may camouflage themselves and have to be detected by web robot scanning devices. We describe the methodology of robot detection and show that highly accurate tools can be applied to decide whether session data was generated by a robot or a human user.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
ALMEIDA, V., RIEDI, R., MENASCÉ, D., MEIRA, W., RIBEIRO, F., and FONSECA, R. (2001): Characterizing and modeling robot workload on e-business sites. Proc. 2001 ACM Sigmetrics Conference. http://www-ece.rice.edu/riedi/Publ/RoboSimg01.ps.gz.
wxdemo.shtml.
APACHE http server documentation project: Apache http server log files combined log format. http://httpd.apache.org/docs/logs.html\#combined.
ARLITT, M., KRISHNAMURTHY, D., and ROLIA, J. (2001): Characterizing the scalability of a large web-based shopping system. ACM Transactions on Internet Technology. http://www.hpl.hp.com/techreports/2001/HPL-2001-110Rl.pdf.
BERENDT, B., MOBASHER, B., SPILIOPOULOU, M., and WILTSHIRE, J. (2001): Measuring the accuracy of sessionizers for web usage analysis. Proceedings of the Web Mining Workshop at the First SIAM International Conference on Data Mining, Chicago.
BOMHARDT, C. (2002): The robot detection tool. http://www.bomhardt.de/bomhardt/rdt/produkt.html.
CAPTCHA project: Telling humans and computers apart. http://www.captcha.net/.
CATLEDGE, L. and PITKOW, J. (1995): Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems.
GAUL, W. and SCHMIDT-THIEME, L. (2000): Frequent generalized subsequences-a problem from webmining. In: Gaul, W., Opitz, O., Schader, M. (eds.): Data Analysis, Scientific Modelling and Practical Application, Springer, Heidelberg, pp. 429–445.
HENG, C: Defending your web site / server from the nimbda worm / virus. http://www.thesitewizard.com/news/nimbdaworm.shtml.
IPAOPAO.COM software Inc.: Fast email spider for web. http://software.ipaopao.com/fesweb/.
KOSTER, M. (1994): A standard for robot exclusion. http://www.robotstxt.org/wc/norobots-rfc.html.
MENASCÉ, D., ALMEIDA, V., RIEDI, R, RIBEIRO, F., FONSECA, R., and MEIRA, W. (2000): In search of invariants for e-business workloads. Proceedings of ACM Conference on Electronic Commerce, Minneapolis, MN. http://www-ece.rice.edu/riedi/Publ/ec00.ps.gz.
MULLANE, G. (1998): Spambot beware detection. http://www.turnstep.com/Spambot/detection.html.
TAN, P.-N. and KUMAR, V. (2000): Modeling of web robot navigational patterns. Proc. ACM WebKDD Workshop.
TAN, P.-N. and KUMAR, V. (2001): Discovery of web robot sessions based on their navigational patterns. http://citeseer.nj.nec.com/443855.html.
THE WEB ROBOTS PAGES. http://www.robotstxt.org/wc/robots.html.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin · Heidelberg
About this paper
Cite this paper
Bomhardt, C., Gaul, W., Schmidt-Thieme, L. (2005). Web Robot Detection - Preprocessing Web Logfiles for Robot Detection. In: Bock, HH., et al. New Developments in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-27373-5_14
Download citation
DOI: https://doi.org/10.1007/3-540-27373-5_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23809-6
Online ISBN: 978-3-540-27373-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)