Abstract
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of different nature and difficulty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and efficiency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing different relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling effectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is effective at evaluating, comparing, differentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.
Article PDF
Similar content being viewed by others
References
Aggarwal C, Al-Garawi F and Yu P (2001) Intelligent crawling on the world wide web with arbitrary predicates. In: Proc. 10th International World Wide Web Conference, pp. 96–105.
Amento B, Terveen L and Hill W (2000) Does “Authority” mean quality? Predicting expert quality ratings of web documents. In: Proc. 23rd ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 296–303.
Beaulieu M, Fowkes H and Joho H (2000) Sheffield interactive experiment at TREC-9. In: Proc. 9th Text Retrieval Conference (TREC-9).
Ben-Shaul I, et al. (1999a) Adding support for dynamic and focused search with fetuccino. Computer Networks, 31(11–16):1653–1665.
Ben-Shaul I, Herscovici M, Jacovi M, Maarek Y, Pelleg D, Shtalhaim M, Soroka V and Ur S (1999b) Adding support for dynamic and focused search with fetuccino. Computer Networks, 31(11–16):1653–1665.
Bharat K and Henzinger M (1998) Improved algorithms for topic distillation in hyperlinked environments. In: Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 104–111.
Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7):107–117.
Chakrabarti S, Dom B, Raghavan P, Rajagopalan S, Gibson D and Kleinberg J (1998) Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 30(1–7):65–74.
Chakrabarti S, Joshi M, Punera K and Pennock D (2002a) The structure of broad topics on the web. In: Lassner D, De Roure D and Iyengar A, eds. Proc. 11th International World Wide Web Conference. ACM Press, New York, NY, pp. 251–262.
Chakrabarti S, Punera K and Subramanyam M (2002b) Accelerated focused crawling through online relevance feedback. In: Lassner D, De Roure D and Iyengar A, eds. Proc. 11th International World Wide Web Conference. ACM Press, New York, NY, pp. 148–159.
Chakrabarti S, van den Berg M and Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 31(11–16):1623–1640.
Cho J, Garcia-Molina H and Page L (1998) Efficient crawling through URL ordering. Computer Networks, 30(1–7):161–172.
Conover W (1980) Practical Nonparametric Statistics. Wiley, New York, Chapt. 5, pp. 213–343.
Davison B (2000) Topical locality in the Web. In: Proc. 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272–279.
De Bra P and Post R (1994) Information retrieval in the World Wide Web: Making client-based searching feasible. In: Proc. 1st International World Wide Web Conference.
Diligenti M, Coetzee F, Lawrence S, Giles CL and Gori M (2000) Focused crawling using context graphs. In: Proc. 26th International Conference on Very Large Databases (VLDB 2000). Cairo, Egypt, pp. 527–534.
Flake G, Lawrence S and Giles C (2000) Efficient identification of Web communities. In: Proc. 6th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Boston, MA, pp. 150–160.
Henzinger M, Heydon A, Mitzenmacher M and Najork M (1999) Measuring search engine quality using random walks on the Web. In: Proc. 8th International World Wide Web Conference, pp. 213–225.
Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalhaim M and Ur S (1998) The shark-search algorithm—An application: Tailored Web site mapping. In: Proc. 7th Intl. World-Wide Web Conference.
Jansen B, Spink A and Saracevic T (2000) Real life, real users and real needs: A study and analysis of users queries on the Web. Information Processing and Management, 36(2):207–227.
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.
Kumar S, Raghavan P, Rajagopalan S, Sivakumar D, Tomkins A and Upfal E (2000) Stochastic models for the Web graph. In: Proc. 41st Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Silver Spring, MD, pp. 57–65.
Menczer F (1997) ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In: Proc. 14th International Conference on Machine Learning, pp. 227–235.
Menczer F (2003) Complementing search engines with online Web mining agents. Decision Support Systems, 35(2):195–212.
Menczer F (2004) Lexical and semantic clustering by Web links. Journal of the American Society for Information Science and Technology, 55(14):1261–1269.
Menczer F and Belew R (1998) Adaptive information agents in distributed textual environments. In: Proc. 2nd International Conference on Autonomous Agents. Minneapolis, MN, pp. 157–164.
Menczer F and Belew R (2000) Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2–3):203–242.
Menczer F, Pant G, Ruiz M and Srinivasan P (2001) Evaluating topic-driven Web crawlers. In: Kraft DH, Croft WB, Harper DJ and Zobel J, eds. Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval. ACM Press, New York, NY, pp. 241–249.
Menczer F, Pant G and Srinivasan P (2004) Topical Web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4(4):378–419.
Mitra M, Singhal A and Buckley C (1998) Improving automatic query expansion. In: Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 206–214.
Najork M and Wiener JL (2001) Breadth-first search crawling yields high-quality pages. In: Proc. 10th International World Wide Web Conference.
Nelson M (1995) The effect of query characteristics on retrieval results in the TREC retrieval tests. In: Proc. Annual Conference of the Canadian Association for Information Science.
Pant G and Menczer F (2002) MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 5(2):221–229.
Pant G, Srinivasan P and Menczer F (2002) Exploration versus exploitation in topic driven crawlers. In: Proc. WWW-02 Workshop on Web Dynamics.
Pinkerton B (1994) Finding what people want: Experiences with the WebCrawler. In: Proc. 1st International World Wide Web Conference.
Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137.
Rennie J and McCallum A (1999) Using reinforcement learning to spider the Web efficiently. In: Proc. 16th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, pp. 335–343.
Saracevic T and Kantor P (1998) A study of information seeking and retrieving. II. Users, questions, and effectiveness. Journal of the American Society for Information Science, 39(3):177–196.
Silva I, Ribeiro-Neto B, Calado P, Ziviani N and Moura E (2000) Link-based and content-based evidential information in a belief network model. In: Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103.
Spink A, Wolfram D, Jansen B and Saracevic T (2001) Searching the Web: The public and their queries. Journal of the American Society for Information Science, 52(3):226–234.
Srinivasan P, Mitchell J, Bodenreider O, Pant G and Menczer F (2002) Web Crawling agents for retrieving biomedical information. In: Proc. Int. Workshop on Agents in Bioinformatics (NETTAB-02).
van Rijsbergen C (1979) Information Retrieval, London, 2nd edn. Butterworths.
Author information
Authors and Affiliations
Corresponding author
Additional information
Partially supported by National Science Foundation CAREER grant No. IIS-0133124/0348940.
Rights and permissions
About this article
Cite this article
Srinivasan, P., Menczer, F. & Pant, G. A General Evaluation Framework for Topical Crawlers. Inf Retrieval 8, 417–447 (2005). https://doi.org/10.1007/s10791-005-6993-5
Issue Date:
DOI: https://doi.org/10.1007/s10791-005-6993-5