Advertisement

Exploiting Genre in Focused Crawling

  • Guilherme T. de Assis
  • Alberto H. F. Laender
  • Marcos André Gonçalves
  • Altigran S. da Silva
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4726)

Abstract

In this paper, we propose a novel approach to focused crawling that exploits genre and content-related information present in Web pages to guide the crawling process. The effectiveness, efficiency and scalability of this approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi (genre) of computer science courses (content). The results of these experiments show that focused crawlers constructed according to our approach achieve levels of F1 superior to 92% (an average gain of 178% over traditional focused crawlers), requiring the analysis of no more than 60% of the visited pages in order to find 90% of the relevant pages (an average gain of 82% over traditional focused crawlers).

Keywords

Web crawling Focused crawling SVM classifiers 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/Addison-Wesley, New York (1999)Google Scholar
  2. 2.
    Chakrabarti, S., Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Journal of Computer Networks 31(11-16), 1623–1640 (1999)CrossRefGoogle Scholar
  3. 3.
    De Bra, P.M.E., Post, R.D.J.: Information Retrieval in the World Wide Web: Making Client-Based Searching Feasible. Journal of Computer Networks and ISDN Systems 27(2), 183–192 (1994)CrossRefGoogle Scholar
  4. 4.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using Context Graphs. In: Proc. 26th Int’l Conference on Very Large Data Bases, pp. 527–534 (2000)Google Scholar
  5. 5.
    Herscovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The Shark-Search Algorithm - An Application: Tailored Web Site Mapping. Journal of Computer Networks 30(1-7), 317–326 (1998)CrossRefGoogle Scholar
  6. 6.
    Lage, J.P., Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic Generation of Agents for Collecting Hidden Web Pages for Data Extraction. Data & Knowledge Engineering 49(2), 177–196 (2004)CrossRefGoogle Scholar
  7. 7.
    Liu, H., Janssen, J.C.M., Milios, E.E.: Using HMM to Learn User Browsing Patterns for Focused Web Crawling. Data & Knowledge Engineering 59(2), 270–291 (2006)CrossRefGoogle Scholar
  8. 8.
    McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Automating the Construction of Internet Portals with Machine Learning. Journal of Information Retrieval 3(2), 127–163 (2000)CrossRefGoogle Scholar
  9. 9.
    Menczer, F., Pant, G., Srinivasan, P.: Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology 4(4), 378–419 (2004)CrossRefGoogle Scholar
  10. 10.
    Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E.: Evaluating Topic-driven Web Crawlers. In: Proc. 24th Annual Int’l ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 241–249 (2001)Google Scholar
  11. 11.
    Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003)Google Scholar
  12. 12.
    Pant, G., Srinivasan, P.: Link Contexts in Classifier-Guided Topical Crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006)CrossRefGoogle Scholar
  13. 13.
    Pant, G., Srinivasan, P.: Learning to Crawl: Comparing Classification Schemes. ACM Transactions on Information Systems 23(4), 430–462 (2005)CrossRefGoogle Scholar
  14. 14.
    Pant, G., Tsioutsiouliklis, K., Johnson, J., Giles, C.L.: Panorama: Extending digital libraries with topical crawlers. In: Proc. 4th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 142–150 (2004)Google Scholar
  15. 15.
    Srinivasan, P., Menczer, F., Pant, G.: A General Evaluation Framework for Topical Crawlers. Journal of Information Retrieval 8(3), 417–447 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Guilherme T. de Assis
    • 1
  • Alberto H. F. Laender
    • 1
  • Marcos André Gonçalves
    • 1
  • Altigran S. da Silva
    • 2
  1. 1.Computer Science Department, Federal University of Minas Gerais, 31270-901 Belo Horizonte, MGBrazil
  2. 2.Computer Science Department, Federal University of Amazonas, 69077-000 Manaus, AMBrazil

Personalised recommendations