Advertisement

Data Mining and Knowledge Discovery

, Volume 22, Issue 1–2, pp 183–210 | Cite as

Web robot detection techniques: overview and limitations

  • Derek Doran
  • Swapna S. Gokhale
Article

Abstract

Most modern Web robots that crawl the Internet to support value-added services and technologies possess sophisticated data collection and analysis capabilities. Some of these robots, however, may be ill-behaved or malicious, and hence, may impose a significant strain on a Web server. It is thus necessary to detect Web robots in order to block undesirable ones from accessing the server. Such detection is also essential to ensure that the robot traffic is considered appropriately in the performance and capacity planning of Web servers. Despite a variety of Web robot detection techniques, there is no consensus regarding a single technique, or even a specific “type” of technique, that performs well in practice. Therefore, to aid in the development of a practically applicable robot detection technique, this survey presents a critical analysis and comparison of the prevalent detection approaches. We propose a framework to classify the existing detection techniques into four categories based on their underlying detection philosophy. We compare the different classes to gain insights into those characteristics that make up an effective robot detection scheme. Finally, we discuss why the contemporary techniques fail to offer a general solution to the robot detection problem and propose a set of key ingredients necessary for strong Web robot detection.

Keywords

Web Crawler Web Robot WWW Web Robot Detection Web User Classification 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ah LV, Blum M, Langford J (2003) CAPTCHA: using hard AI problems for security. In: Proceedings of Eurocrypt, pp 294–311Google Scholar
  2. AWStats—Free log file analyzer for advanced statistics (GNU GPL). Available at http://awstats.sourceforge.net/
  3. Bomhardt C, Gaul W, Schmidt-Thieme L (2005) Web robot detection—preprocessing web logfiles for robot detection. In: New developments in classification and data analysis, pp 113–124Google Scholar
  4. Buzikashvili N (2008) Query log analysis: disrupted query chains and adaptive segmentation. In: Proceedings of workshop information. Retrieval 2008, pp 35–40Google Scholar
  5. Dikaiakos MD, Stassopoulou A, Papageorgiou L (2005) An investigation of Web crawler behavior: characterization and metrics. Comput Commun 28: 880–897CrossRefGoogle Scholar
  6. Doran D, Gokhale SS (2009) Classifying Web robots by K-means clustering. In: Proceedings of the international conference on software engineering and knowledge engineering, pp 97–102Google Scholar
  7. Doran D, Gokhale SS (2008) Discovering new trends in Web robot traffic through functional classification. In: Proceedings of international symposium on network computing and applications, pp 275–278Google Scholar
  8. Duskin O, Feitelson DG (2009) Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals. In: Proceedings of 2009 workshop on Web Search Click Data, pp 15–19Google Scholar
  9. Geens N, Juysmans J, Vanthienen J (2006) Evaluation of Web robot discovery techniques: a benchmarking study. In: Lecture notes in computer science vol 4065/2006, pp 121–130Google Scholar
  10. Giles C, Sun Y, Councill I (2010) Measuring the web crawler ethics. In: Proceedings of 19th international conference on the World Wide Web, pp 1101–1102Google Scholar
  11. Gossweilier R, Kamvar M, Baluja S (2009) What’s up CAPTCHA?: a CAPTCHA based on image orientation. In: Proceedings of 18th international conference on World wide web, pp 841–850Google Scholar
  12. Guo W, Ju S, Gu Y (2005) Web robot detection techniques based on statistics of their requested URL resources. In: Proceedings of ninth international conference on computer supported cooperative work in design, pp 302–306Google Scholar
  13. Huntington P, Nicholas D, Jamali HR (2008) Web robot detection in the scholarly information environment. J Info Sci 34: 726–741CrossRefGoogle Scholar
  14. Jansen BJ, Spink A, Saracevic T (2000) Real life, real users, and real needs: a study and analysis of user queries on the web. Info Process Manage 36: 207–227CrossRefGoogle Scholar
  15. Kabe T, Miyazaki M (2000) Determining WWW user-agents from server access log. In: Proceedings of seventh international conference on parallel and distributed systems, pp 173–178Google Scholar
  16. Kandula S, Katabi D, Jacob M, Berger A (2005) Botz-4-sale: surviving organized DDoS attacks that mimic flash crowds. In: Proceedings of the 2nd conference on symposium on networked systems design & implementation, pp 287–300Google Scholar
  17. Kluever KA, Zanibbi R (2008) Video CAPTCHAs: usability vs. security. In: Proceedings of IEEE Western New York Image Processing Workshop 2008Google Scholar
  18. Koster M (1994) A standard for robot exclusion. http://www.robotstxt.org/wc/exclusion.html
  19. Lee J, Cha S, Lee S, Lee H (2009) Classification of web robots: an empirical study based on over one billion requests. Comput Secur 28: 795–802CrossRefGoogle Scholar
  20. Lin X, Quan L, Wu H (2008) An automatic scheme to categorize user sessions in modern HTTP traffic. In: Proceedings of IEEE global telecommunications conference 2008, pp 1–6Google Scholar
  21. Lu WZ, Yu SZ (2006) Web robot detection based on hidden Markov model. In: Proceedings of international conference on communications, circuits and systems, pp 1806–1810Google Scholar
  22. Motoyama M, Levchenko K, Kanich C, McCoy D, Voelker G, Savage S (2010) CAPTCHAs—understanding CAPTCHA solving from an economic context. In: Proceedings of the USENIX security symposium 2010Google Scholar
  23. Oriley T (2007) What is Web 2.0: Design patterns and business models for the next generation of software. In: Communications & Strategies, pp 17–37Google Scholar
  24. Park KS, Pai V, Lee KW, Calo S (2006) Securing Web service by automatic robot detection. In: Proceedings of the annual conference on USENIX ’06 annual technical conferenceGoogle Scholar
  25. Princeton University (2003) PlanetLab—an open platform for developing, deploying, and accessing planetary-scale services. http://www.planet-lab.org
  26. Prince MB, Holloway L, Keller AM (2005) Understanding how spammers steam your e-mail address: An analysis of the first six months of data from project honey pot. In: Second conference on Email and Anti-SpamGoogle Scholar
  27. Rabiner LR (1990) A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, 77:257–286Google Scholar
  28. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27: 379–423 623–656zbMATHMathSciNetGoogle Scholar
  29. Shirali-Shahreza HM, Shirali-ShahrezaM (2008) An Anti-SMS-Spam using CAPTCHA. In: Proceedings of 2008 ISECS international colloquium on computing, communication, control, and management, pp 318–321Google Scholar
  30. Smith JA, McCown F, Nelson ML (2006) Observed Web robot behavior on decaying Web subsites. In: D-Lib Magazine vol 12. http://www.dlib.org/dlib/february06/smith/02smith.html
  31. Stassopoulou A, Dikaiakos MD 2007 A probabilistic reasoning approach for discovering Web crawler sessions. In: APWeb/WAIM, pp 265–272Google Scholar
  32. Tan PN, Kumar V (2002) Discovery of Web robot sessions based on their navigational patterns. Data Min Knowl Discov 6(1): 9–35CrossRefMathSciNetGoogle Scholar
  33. Turing A (1950) Computing machinery and intelligence. Mind 59: 433–460CrossRefMathSciNetGoogle Scholar
  34. Ye S, Lu G, Li X (2004) Workload-aware Web crawling and server workload detection. In: Proceedings of the second Asia-Pacific advanced network research workshop, pp 263–269Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of ConnecticutStorrsUSA

Personalised recommendations