A General Evaluation Framework for Topical Crawlers

Srinivasan, P.; Menczer, F.; Pant, G.

doi:10.1007/s10791-005-6993-5

A General Evaluation Framework for Topical Crawlers

Published: January 2005

Volume 8, pages 417–447, (2005)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

A General Evaluation Framework for Topical Crawlers

Download PDF

P. Srinivasan¹,
F. Menczer² &
G. Pant³

424 Accesses
64 Citations
3 Altmetric
Explore all metrics

Abstract

Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through well-defined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of different nature and difficulty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and efficiency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing different relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling effectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is effective at evaluating, comparing, differentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.

References

Aggarwal C, Al-Garawi F and Yu P (2001) Intelligent crawling on the world wide web with arbitrary predicates. In: Proc. 10th International World Wide Web Conference, pp. 96–105.
Amento B, Terveen L and Hill W (2000) Does “Authority” mean quality? Predicting expert quality ratings of web documents. In: Proc. 23rd ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 296–303.
Beaulieu M, Fowkes H and Joho H (2000) Sheffield interactive experiment at TREC-9. In: Proc. 9th Text Retrieval Conference (TREC-9).
Ben-Shaul I, et al. (1999a) Adding support for dynamic and focused search with fetuccino. Computer Networks, 31(11–16):1653–1665.
Google Scholar
Ben-Shaul I, Herscovici M, Jacovi M, Maarek Y, Pelleg D, Shtalhaim M, Soroka V and Ur S (1999b) Adding support for dynamic and focused search with fetuccino. Computer Networks, 31(11–16):1653–1665.
Google Scholar
Bharat K and Henzinger M (1998) Improved algorithms for topic distillation in hyperlinked environments. In: Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 104–111.
Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7):107–117.
Google Scholar
Chakrabarti S, Dom B, Raghavan P, Rajagopalan S, Gibson D and Kleinberg J (1998) Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 30(1–7):65–74.
Google Scholar
Chakrabarti S, Joshi M, Punera K and Pennock D (2002a) The structure of broad topics on the web. In: Lassner D, De Roure D and Iyengar A, eds. Proc. 11th International World Wide Web Conference. ACM Press, New York, NY, pp. 251–262.
Google Scholar
Chakrabarti S, Punera K and Subramanyam M (2002b) Accelerated focused crawling through online relevance feedback. In: Lassner D, De Roure D and Iyengar A, eds. Proc. 11th International World Wide Web Conference. ACM Press, New York, NY, pp. 148–159.
Google Scholar
Chakrabarti S, van den Berg M and Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 31(11–16):1623–1640.
Google Scholar
Cho J, Garcia-Molina H and Page L (1998) Efficient crawling through URL ordering. Computer Networks, 30(1–7):161–172.
Google Scholar
Conover W (1980) Practical Nonparametric Statistics. Wiley, New York, Chapt. 5, pp. 213–343.
Google Scholar
Davison B (2000) Topical locality in the Web. In: Proc. 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272–279.
De Bra P and Post R (1994) Information retrieval in the World Wide Web: Making client-based searching feasible. In: Proc. 1st International World Wide Web Conference.
Diligenti M, Coetzee F, Lawrence S, Giles CL and Gori M (2000) Focused crawling using context graphs. In: Proc. 26th International Conference on Very Large Databases (VLDB 2000). Cairo, Egypt, pp. 527–534.
Flake G, Lawrence S and Giles C (2000) Efficient identification of Web communities. In: Proc. 6th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Boston, MA, pp. 150–160.
Henzinger M, Heydon A, Mitzenmacher M and Najork M (1999) Measuring search engine quality using random walks on the Web. In: Proc. 8th International World Wide Web Conference, pp. 213–225.
Hersovici M, Jacovi M, Maarek YS, Pelleg D, Shtalhaim M and Ur S (1998) The shark-search algorithm—An application: Tailored Web site mapping. In: Proc. 7th Intl. World-Wide Web Conference.
Jansen B, Spink A and Saracevic T (2000) Real life, real users and real needs: A study and analysis of users queries on the Web. Information Processing and Management, 36(2):207–227.
Google Scholar
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.
Google Scholar
Kumar S, Raghavan P, Rajagopalan S, Sivakumar D, Tomkins A and Upfal E (2000) Stochastic models for the Web graph. In: Proc. 41st Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Silver Spring, MD, pp. 57–65.
Menczer F (1997) ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In: Proc. 14th International Conference on Machine Learning, pp. 227–235.
Menczer F (2003) Complementing search engines with online Web mining agents. Decision Support Systems, 35(2):195–212.
Google Scholar
Menczer F (2004) Lexical and semantic clustering by Web links. Journal of the American Society for Information Science and Technology, 55(14):1261–1269.
Google Scholar
Menczer F and Belew R (1998) Adaptive information agents in distributed textual environments. In: Proc. 2nd International Conference on Autonomous Agents. Minneapolis, MN, pp. 157–164.
Menczer F and Belew R (2000) Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2–3):203–242.
Google Scholar
Menczer F, Pant G, Ruiz M and Srinivasan P (2001) Evaluating topic-driven Web crawlers. In: Kraft DH, Croft WB, Harper DJ and Zobel J, eds. Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval. ACM Press, New York, NY, pp. 241–249.
Google Scholar
Menczer F, Pant G and Srinivasan P (2004) Topical Web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4(4):378–419.
Google Scholar
Mitra M, Singhal A and Buckley C (1998) Improving automatic query expansion. In: Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 206–214.
Najork M and Wiener JL (2001) Breadth-first search crawling yields high-quality pages. In: Proc. 10th International World Wide Web Conference.
Nelson M (1995) The effect of query characteristics on retrieval results in the TREC retrieval tests. In: Proc. Annual Conference of the Canadian Association for Information Science.
Pant G and Menczer F (2002) MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 5(2):221–229.
Google Scholar
Pant G, Srinivasan P and Menczer F (2002) Exploration versus exploitation in topic driven crawlers. In: Proc. WWW-02 Workshop on Web Dynamics.
Pinkerton B (1994) Finding what people want: Experiences with the WebCrawler. In: Proc. 1st International World Wide Web Conference.
Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137.
Google Scholar
Rennie J and McCallum A (1999) Using reinforcement learning to spider the Web efficiently. In: Proc. 16th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, pp. 335–343.
Google Scholar
Saracevic T and Kantor P (1998) A study of information seeking and retrieving. II. Users, questions, and effectiveness. Journal of the American Society for Information Science, 39(3):177–196.
Google Scholar
Silva I, Ribeiro-Neto B, Calado P, Ziviani N and Moura E (2000) Link-based and content-based evidential information in a belief network model. In: Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103.
Spink A, Wolfram D, Jansen B and Saracevic T (2001) Searching the Web: The public and their queries. Journal of the American Society for Information Science, 52(3):226–234.
Google Scholar
Srinivasan P, Mitchell J, Bodenreider O, Pant G and Menczer F (2002) Web Crawling agents for retrieving biomedical information. In: Proc. Int. Workshop on Agents in Bioinformatics (NETTAB-02).
van Rijsbergen C (1979) Information Retrieval, London, 2nd edn. Butterworths.

Download references

Author information

Authors and Affiliations

School of Library & Information Science and Department of Management Sciences, The University of Iowa, Iowa City, IA, 52242, USA
P. Srinivasan
School of Informatics and Department of Computer Science, Indiana University, Bloomington, IN, 47408, USA
F. Menczer
School of Accounting and Information Systems, University of Utah, Salt Lake City, UT, 84112, USA
G. Pant

Authors

P. Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar
F. Menczer
View author publications
You can also search for this author in PubMed Google Scholar
G. Pant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. Srinivasan.

Additional information

Partially supported by National Science Foundation CAREER grant No. IIS-0133124/0348940.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srinivasan, P., Menczer, F. & Pant, G. A General Evaluation Framework for Topical Crawlers. Inf Retrieval 8, 417–447 (2005). https://doi.org/10.1007/s10791-005-6993-5

Download citation

Issue Date: January 2005
DOI: https://doi.org/10.1007/s10791-005-6993-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A General Evaluation Framework for Topical Crawlers

Abstract

Article PDF

Similar content being viewed by others

The Turing test of online reviews: Can we tell the difference between human-written and GPT-4-written online reviews?

Artificial intelligence to automate the systematic review of scientific literature

The journal coverage of Web of Science and Scopus: a comparative analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A General Evaluation Framework for Topical Crawlers

Abstract

Article PDF

Similar content being viewed by others

The Turing test of online reviews: Can we tell the difference between human-written and GPT-4-written online reviews?

Artificial intelligence to automate the systematic review of scientific literature

The journal coverage of Web of Science and Scopus: a comparative analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation