World Wide Web

, Volume 12, Issue 3, pp 285–319 | Cite as

A Genre-Aware Approach to Focused Crawling

  • Guilherme T. de AssisEmail author
  • Alberto H. F. Laender
  • Marcos André Gonçalves
  • Altigran S. da Silva


Focused crawlers have as their main goal to crawl Web pages that are relevant to a specific topic or user interest, playing an important role for a great variety of applications. In general, they work by trying to find and crawl all kinds of pages deemed as related to an implicitly declared topic. However, users are often not simply interested in any document about a topic, but instead they may want only documents of a given type or genre on that topic to be retrieved. In this article, we describe an approach to focused crawling that exploits not only content-related information but also genre information present in Web pages to guide the crawling process. This approach has been designed to address situations in which the specific topic of interest can be expressed by specifying two sets of terms, the first describing genre aspects of the desired pages and the second related to the subject or content of these pages, thus requiring no training or any kind of preprocessing. The effectiveness, efficiency and scalability of the proposed approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi of computer science courses, job offers in the computer science field and sale offers of computer equipments. These experiments show that focused crawlers constructed according to our genre-aware approach achieve levels of F1 superior to 88%, requiring the analysis of no more than 65% of the visited pages in order to find 90% of the relevant pages. In addition, we experimentally analyze the impact of term selection on our approach and evaluate a proposed strategy for semi-automatic generation of such terms. This analysis shows that a small set of terms selected by an expert or a set of terms specified by a typical user familiar with the topic is usually enough to produce good results and that such a semi-automatic strategy is very effective in supporting the task of selecting the sets of terms required to guide a crawling process.


web crawling focused crawling document genre exploitation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Almpanidis, G., Kotropoulos, C., Pitas, I.: Combining text and link analysis for focused crawling—an application for vertical search engines. Inf. Syst. 32(6), 886–908 (2007)CrossRefGoogle Scholar
  2. 2.
    Assis, G.T., Laender, A.H.F., Gonçalves, M.A., da Silva, A.S.: Exploiting genre in focused crawling. In: Proceedings of the 14th Symposium on String Processing and Information Retrieval, pp. 49–60, Santiago, 29–31 October 2007Google Scholar
  3. 3.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM/Addison-Wesley, New York (1999)Google Scholar
  4. 4.
    Boese, E.S.: Stereotyping the Web: Genre Classification of Web Documents. Master’s thesis, Computer Science Department, Colorado State University, Boulder, Colorado, USA (2005)Google Scholar
  5. 5.
    Borges, K.A.V., Laender, A.H.F., Medeiros, C.B., Davis, C.A.: Discovering geographic locations in web pages using urban addresses. In: Proceedings of the 4th ACM Workshop On Geographic Information Retrieval, pp. 31–36. Lisbon, Portugal (2007)Google Scholar
  6. 6.
    Bra, P.D., Post, R.D.J.: Information retrieval in the world wide web: making client-based searching feasible. Comput. Netw. ISDN Syst. 27(2), 183–192 (1994)CrossRefGoogle Scholar
  7. 7.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Networks 31(11–16), 1623–1640 (1999)CrossRefGoogle Scholar
  8. 8.
    Chen, G., Choi, B.: Web page genre classification. In: Proceedings of the 23th ACM Symposium on Applied Computing, pp. 2353–2357, Fortaleza, 16–20 March 2008Google Scholar
  9. 9.
    Chen, J., Li, Q., Jia, W.: Automatically generating an E-textbook on the web. World Wide Web 8(4), 377–394 (2005)CrossRefGoogle Scholar
  10. 10.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  11. 11.
    Dellandrea, E., Harb, H., Chen, L.: Zipf, neural networks and svm for musical genre classification. In: Proceedings of the 5th IEEE International Symposium on Signal Processing and Information Technology, pp. 57–62, Athens, 18–21 December 2005Google Scholar
  12. 12.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 527–534. Cairo, 10–14 September 2000Google Scholar
  13. 13.
    Foltz, P.: Improving human-proceedings interaction: indexing the CHI index. In: Proceedings of the Conference on Human Factors in Computing Systems, pp. 101–102, Denver, 7–11 May 1995Google Scholar
  14. 14.
    Glover, E., Pennock, D., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: Proceedings of the 11th International Conference on Information and Knowledge Management, pp. 507–514, McLean, 4–9 November 2002Google Scholar
  15. 15.
    Herscovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm—an application: tailored web site mapping. Comput. Networks 30(1–7), 317–326 (1998)Google Scholar
  16. 16.
    Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)CrossRefGoogle Scholar
  17. 17.
    Johnson, J., Tsioutsiouliklis, K., Giles, C.L.: Evolving strategies for focused web crawling. In: Proceedings of the 20th International Conference on Machine Learning, pp. 298–305, Washington, DC, 21–24 August 2003Google Scholar
  18. 18.
    Kontostathis, A.: Essential dimensions of latent semantic indexing (LSI). In: Proceedings of the 40th Hawaii International Conference on Systems Science, p. 73, Waikoloa, 3–6 January 2007Google Scholar
  19. 19.
    Lage, J.P., da Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2), 177–196 (2004)CrossRefGoogle Scholar
  20. 20.
    Lagus, K., Kaski, S.: Keyword selection method for characterizing text document maps. In: Proceedings of the 9th International Conference on Artificial Neural Networks, pp. 371–376, Edinburgh, 7–10 September 1999Google Scholar
  21. 21.
    Liu, H., Janssen, J., Milios, E.E.: Using HMM to learn user browsing patterns for focused web crawling. Data Knowl. Eng. 59(2), 270–291 (2006)CrossRefGoogle Scholar
  22. 22.
    McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Inf. Retr. 3(2), 127–163 (2000)CrossRefGoogle Scholar
  23. 23.
    Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. 4(4), 378–419 (2004)CrossRefGoogle Scholar
  24. 24.
    Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E.: Evaluating topic-driven web crawlers. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 241–249, New Orleans, September 2001Google Scholar
  25. 25.
    Muller, P., Insua, D.R.: Issues in Bayesian analysis of neural network models. Neural Comput. 10(3), 749–770 (1998)CrossRefGoogle Scholar
  26. 26.
    Pant, G., Menczer, F.: Topical crawling for business intelligence. In: Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries, pp. 233–244, Trondheim, 17–22 August 2003Google Scholar
  27. 27.
    Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Sys. 23(4), 430–462 (2005)CrossRefGoogle Scholar
  28. 28.
    Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng. 18(1), 107–122 (2006)CrossRefGoogle Scholar
  29. 29.
    Pant, G., Tsioutsiouliklis, K., Johnson, J., Giles, C.L.: Panorama: extending digital libraries with topical crawlers. In: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 142–150, Tuscon, 7–11 June 2004Google Scholar
  30. 30.
    Rish, I.: An empirical study of the Naïve Bayes classifier. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence, pp. 41–46, Seattle, 4–10 August 2001Google Scholar
  31. 31.
    Rosso, M.A.: Using Genre to Improve Web Search. Master’s thesis, School of Information and Library Science, University of North Carolina, Chapel Hill (2005)Google Scholar
  32. 32.
    Sizov, S., Theobald, M., Siersdorfer, S., Weikum, G., Graupmann, J., Biwer, M., Zimmer, P.: The BINGO! system for information portal generation and expert web search. In: Proceedings of the First Biennial Conference on Innovative Data Systems Research, Asilomar, 5–8 January 2003Google Scholar
  33. 33.
    Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Inf. Retr. 8(3), 417–447 (2005)CrossRefGoogle Scholar
  34. 34.
    Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)CrossRefGoogle Scholar
  35. 35.
    Tantug, C., Eryigit, G.: Performance Analysis of Naïve Bayes Classification, Support Vector Machines and Neural Networks for Spam Categorization, pp. 495–504. Springer, New York (2006)Google Scholar
  36. 36.
    Tarr, D., Borko, H.: Factors influencing inter-indexer consistency. In: Proceedings of the 37th Annual Meeting of the American Society for Information Science, pp. 50–55, Washington, DC, 1974Google Scholar
  37. 37.
    van der Walt, C., Barnard, E.: Data characteristics that determine classifier performance. In: Proceedings of the 16th Annual Symposium of the Pattern Recognition Association of South Africa, pp. 160–165, Parys, November 2006Google Scholar
  38. 38.
    Vidal, M.L.A., Silva, A.S., Moura, E.S., Cavalcanti, J.M.B.: Structure-driven crawler generation by example. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 292–299, Seattle, 6–11 August 2006Google Scholar
  39. 39.
    Yoshioka, T., Herman, G., Yates, J., Orlikowski, W.: Genre taxonomy: a knowledge repository of communicative actions. ACM Trans. Inf. Syst. 19(4), 431–456 (2001)CrossRefGoogle Scholar
  40. 40.
    zu Eissen, S.M., Stein, B.: Genre classification of web pages: user study and feasibility analysis. In: Biundo, S., Frühwirth, T., Palm, G. (eds.) KI 2004: Advances in Artificial Intelligence. Lecture Notes in Artificial Intelligence, LNAI, vol. 3228, pp. 256–269. Springer, New York (2004)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Guilherme T. de Assis
    • 1
    Email author
  • Alberto H. F. Laender
    • 1
  • Marcos André Gonçalves
    • 1
  • Altigran S. da Silva
    • 2
  1. 1.Computer Science DepartmentFederal University of Minas GeraisBelo HorizonteBrazil
  2. 2.Computer Science DepartmentFederal University of AmazonasManausBrazil

Personalised recommendations