Information Retrieval

, Volume 8, Issue 3, pp 417–447

A General Evaluation Framework for Topical Crawlers

Article

Abstract

Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of different nature and difficulty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and efficiency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing different relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling effectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is effective at evaluating, comparing, differentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.

Keywords

Web crawlers evaluation tasks topics precision recall efficiency 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal C, Al-Garawi F and Yu P (2001) Intelligent crawling on the world wide web with arbitrary predicates. In: Proc. 10th International World Wide Web Conference, pp. 96–105.Google Scholar
  2. Amento B, Terveen L and Hill W (2000) Does “Authority” mean quality? Predicting expert quality ratings of web documents. In: Proc. 23rd ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 296–303.Google Scholar
  3. Beaulieu M, Fowkes H and Joho H (2000) Sheffield interactive experiment at TREC-9. In: Proc. 9th Text Retrieval Conference (TREC-9).Google Scholar
  4. Ben-Shaul I, et al. (1999a) Adding support for dynamic and focused search with fetuccino. Computer Networks, 31(11–16):1653–1665.Google Scholar
  5. Ben-Shaul I, Herscovici M, Jacovi M, Maarek Y, Pelleg D, Shtalhaim M, Soroka V and Ur S (1999b) Adding support for dynamic and focused search with fetuccino. Computer Networks, 31(11–16):1653–1665.Google Scholar
  6. Bharat K and Henzinger M (1998) Improved algorithms for topic distillation in hyperlinked environments. In: Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 104–111.Google Scholar
  7. Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7):107–117.Google Scholar
  8. Chakrabarti S, Dom B, Raghavan P, Rajagopalan S, Gibson D and Kleinberg J (1998) Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 30(1–7):65–74.Google Scholar
  9. Chakrabarti S, Joshi M, Punera K and Pennock D (2002a) The structure of broad topics on the web. In: Lassner D, De Roure D and Iyengar A, eds. Proc. 11th International World Wide Web Conference. ACM Press, New York, NY, pp. 251–262.Google Scholar
  10. Chakrabarti S, Punera K and Subramanyam M (2002b) Accelerated focused crawling through online relevance feedback. In: Lassner D, De Roure D and Iyengar A, eds. Proc. 11th International World Wide Web Conference. ACM Press, New York, NY, pp. 148–159.Google Scholar
  11. Chakrabarti S, van den Berg M and Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 31(11–16):1623–1640.Google Scholar
  12. Cho J, Garcia-Molina H and Page L (1998) Efficient crawling through URL ordering. Computer Networks, 30(1–7):161–172.Google Scholar
  13. Conover W (1980) Practical Nonparametric Statistics. Wiley, New York, Chapt. 5, pp. 213–343.Google Scholar
  14. Davison B (2000) Topical locality in the Web. In: Proc. 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272–279.Google Scholar
  15. De Bra P and Post R (1994) Information retrieval in the World Wide Web: Making client-based searching feasible. In: Proc. 1st International World Wide Web Conference.Google Scholar
  16. Diligenti M, Coetzee F, Lawrence S, Giles CL and Gori M (2000) Focused crawling using context graphs. In: Proc. 26th International Conference on Very Large Databases (VLDB 2000). Cairo, Egypt, pp. 527–534.Google Scholar
  17. Flake G, Lawrence S and Giles C (2000) Efficient identification of Web communities. In: Proc. 6th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Boston, MA, pp. 150–160.Google Scholar
  18. Henzinger M, Heydon A, Mitzenmacher M and Najork M (1999) Measuring search engine quality using random walks on the Web. In: Proc. 8th International World Wide Web Conference, pp. 213–225.Google Scholar
  19. Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalhaim M and Ur S (1998) The shark-search algorithm—An application: Tailored Web site mapping. In: Proc. 7th Intl. World-Wide Web Conference.Google Scholar
  20. Jansen B, Spink A and Saracevic T (2000) Real life, real users and real needs: A study and analysis of users queries on the Web. Information Processing and Management, 36(2):207–227.Google Scholar
  21. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.Google Scholar
  22. Kumar S, Raghavan P, Rajagopalan S, Sivakumar D, Tomkins A and Upfal E (2000) Stochastic models for the Web graph. In: Proc. 41st Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Silver Spring, MD, pp. 57–65.Google Scholar
  23. Menczer F (1997) ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In: Proc. 14th International Conference on Machine Learning, pp. 227–235.Google Scholar
  24. Menczer F (2003) Complementing search engines with online Web mining agents. Decision Support Systems, 35(2):195–212.Google Scholar
  25. Menczer F (2004) Lexical and semantic clustering by Web links. Journal of the American Society for Information Science and Technology, 55(14):1261–1269.Google Scholar
  26. Menczer F and Belew R (1998) Adaptive information agents in distributed textual environments. In: Proc. 2nd International Conference on Autonomous Agents. Minneapolis, MN, pp. 157–164.Google Scholar
  27. Menczer F and Belew R (2000) Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2–3):203–242.Google Scholar
  28. Menczer F, Pant G, Ruiz M and Srinivasan P (2001) Evaluating topic-driven Web crawlers. In: Kraft DH, Croft WB, Harper DJ and Zobel J, eds. Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval. ACM Press, New York, NY, pp. 241–249.Google Scholar
  29. Menczer F, Pant G and Srinivasan P (2004) Topical Web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4(4):378–419.Google Scholar
  30. Mitra M, Singhal A and Buckley C (1998) Improving automatic query expansion. In: Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 206–214.Google Scholar
  31. Najork M and Wiener JL (2001) Breadth-first search crawling yields high-quality pages. In: Proc. 10th International World Wide Web Conference.Google Scholar
  32. Nelson M (1995) The effect of query characteristics on retrieval results in the TREC retrieval tests. In: Proc. Annual Conference of the Canadian Association for Information Science.Google Scholar
  33. Pant G and Menczer F (2002) MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 5(2):221–229.Google Scholar
  34. Pant G, Srinivasan P and Menczer F (2002) Exploration versus exploitation in topic driven crawlers. In: Proc. WWW-02 Workshop on Web Dynamics.Google Scholar
  35. Pinkerton B (1994) Finding what people want: Experiences with the WebCrawler. In: Proc. 1st International World Wide Web Conference.Google Scholar
  36. Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137.Google Scholar
  37. Rennie J and McCallum A (1999) Using reinforcement learning to spider the Web efficiently. In: Proc. 16th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, pp. 335–343.Google Scholar
  38. Saracevic T and Kantor P (1998) A study of information seeking and retrieving. II. Users, questions, and effectiveness. Journal of the American Society for Information Science, 39(3):177–196.Google Scholar
  39. Silva I, Ribeiro-Neto B, Calado P, Ziviani N and Moura E (2000) Link-based and content-based evidential information in a belief network model. In: Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103.Google Scholar
  40. Spink A, Wolfram D, Jansen B and Saracevic T (2001) Searching the Web: The public and their queries. Journal of the American Society for Information Science, 52(3):226–234.Google Scholar
  41. Srinivasan P, Mitchell J, Bodenreider O, Pant G and Menczer F (2002) Web Crawling agents for retrieving biomedical information. In: Proc. Int. Workshop on Agents in Bioinformatics (NETTAB-02).Google Scholar
  42. van Rijsbergen C (1979) Information Retrieval, London, 2nd edn. Butterworths.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  1. 1.School of Library & Information Science and Department of Management SciencesThe University of IowaIowa CityUSA
  2. 2.School of Informatics and Department of Computer ScienceIndiana UniversityBloomingtonUSA
  3. 3.School of Accounting and Information SystemsUniversity of UtahSalt Lake CityUSA

Personalised recommendations