Focused Crawling Using Latent Semantic Indexing – An Application for Vertical Search Engines

  • George Almpanidis
  • Constantine Kotropoulos
  • Ioannis Pitas
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3652)


Vertical search engines and web portals are gaining ground over the general-purpose engines due to their limited size and their high precision for the domain they cover. The number of vertical portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain specific web documents. We compare its efficiency with other well-known web information retrieval techniques. Our implementation presents a different approach to focused crawling and aims to overcome the size limitations of the initial training data while maintaining a high recall/precision ratio.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Google Search Technology. Online, at
  2. 2.
    Steele, R.: Techniques for Specialized Search Engines. In: Proc. Internet Computing, Las Vegas (2001)Google Scholar
  3. 3.
    Chakrabarti, S., Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31, 1623–1640 (1999)CrossRefGoogle Scholar
  4. 4.
    Najork, M., Wiener, J.: Breadth-first search crawling yields high-quality pages. In: Proc. 10th Int. World Wide Web Conf., pp. 114–118 (2001)Google Scholar
  5. 5.
    Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., Raghavan, S.: Searching the Web. ACM Transactions on Internet Technology 1(1), 2–43 (2001)CrossRefGoogle Scholar
  6. 6.
    Yang, K.: Combining text- and link-based methods for Web IR. In: Proc. 10th Text Rerieval Conf (TREC-10), Washington, Government Printing Office (2002)Google Scholar
  7. 7.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)zbMATHCrossRefGoogle Scholar
  8. 8.
    Ng, A., Zheng, A., Jordan, M.: Stable algorithms for link analysis. In: Proc. ACM Conf. on Research and Development in Infomation Retrieval, pp. 258–266 (2001)Google Scholar
  9. 9.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. WWW7 / Computer Networks 30(1-7), 107–117 (1998)Google Scholar
  10. 10.
    Kleinberg, J.: Authoritative sources in a hyperlinked environment. In: Proc. 9th Annual ACM-SIAM Symposium Discrete Algorithms, January 1998, pp. 668–677 (1998)Google Scholar
  11. 11.
    Berry, M., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval. Society of Industrial and Applied Mathematics, Philadelphia (1999)zbMATHGoogle Scholar
  12. 12.
    O’Brien, G.: Information Management Tools for Updating an SVD-Encoded Indexing Scheme. Master’s thesis, University of Tennessee, Knoxville, TN (1994)Google Scholar
  13. 13.
    Bharat, K., Henzinger, M.: Improved algorithms for topic distillation in hyperlinked environments. In: Proc. Int. Conf. Research and Development in Information Retrieval, Melbourne (Australia), August 1998, pp. 104–111 (1998)Google Scholar
  14. 14.
    Cohn, D., Chang, H.: Learning to probabilistically identify authoritative documents. In: Proc. 17th Int. Conf. Machine Learning, pp. 167–174 (2000)Google Scholar
  15. 15.
    Srinivasan, P., Pant, G., Menczer, F.: Target Seeking Crawlers and their Topical Performance. In: Proc. Int. Conf. Research and Development in Information Retrieval (August 2002)Google Scholar
  16. 16.
    Chau, M., Chen, H.: Comparison of three vertical search spiders. Computer 36(5), 56–62 (2003)CrossRefGoogle Scholar
  17. 17.
    Cohn, D., Hoffman, T.: The Missing Link-A probabilistic model of document content and hypertext connectivity. Advances in Neural Information Processing Systems 13, 430–436 (2001)Google Scholar
  18. 18.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proc. 26th Int. Conf. Very Large Databases (VLDB 2000), Cairo, pp. 527–534 (2000)Google Scholar
  19. 19.
    Rennie, J., McCallum, A.: Using reinforcement learning to spider the Web efficiently. In: Proc. 16th Int. Conf. Machine Learning (ICML 1999), pp. 335–343 (1999)Google Scholar
  20. 20.
    Chakrabarti, S.: Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: Proc. 10th Int. World Wide Web Conf., Hong Kong, pp. 211–220 (2001)Google Scholar
  21. 21.
    Cho, J., Molina, H.G., Page, L.: Efficient Crawling through URL Ordering. In: Proc. 7th Int. World Wide Web Conf., Brisbane, Australia, pp. 161–172 (1998)Google Scholar
  22. 22.
    Aggarwal, C., Al-Garawi, F., Yu, P.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: Proc. 10th Int. World Wide Web Conf., Hong Kong, pp. 96–105 (2001)Google Scholar
  23. 23.
    Menczer, F., Pant, G., Ruiz, M., Srinivasan, P.: Evaluating topic-driven web crawlers. In: Proc. Int. Conf. Research and Development in Information, New Orleans, pp. 241–249 (2001)Google Scholar
  24. 24.
    Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: Proc. 12th Int. Conf. Information and Knowledge Management, New Orleans, November 2003, pp. 394–401 (2003)Google Scholar
  25. 25.
    Varlamis, I., Vazirgiannis, M., Halkidi, M., Nguyen, B.: THESUS: Effective thematic selection and organization of web document collections based on link semantics. IEEE Trans. Knowledge & Data Engineering 16(6), 585–600 (2004)Google Scholar
  26. 26.
    Bergmark, D., Lagoze, C., Sbityakov, A.: Focused Crawls, Tunneling, and Digital Libraries. In: Proc. 6th European Conf. Research and Advanced Technology for Digital Libraries, pp. 91–106 (2002)Google Scholar
  27. 27.
    Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2002)Google Scholar
  28. 28.
    Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm. An Application: tailored Web site mapping. Computer Networks and ISDN Systems 30, 317–326 (1998)CrossRefGoogle Scholar
  29. 29.
    CMU World Wide Knowledge Base and WebKB dataset. Online, at,
  30. 30.
    Pant, G., Srinivasan, P., Menczer, F.: Exploration versus exploitation in topic driven crawlers. In: Proc. 2nd Int. Workshop Web Dynamics (May 2002)Google Scholar
  31. 31.
    Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  32. 32.
    Golub, G., Van Loan, C.: Matrix Computations. Johns Hopkins University Press, Baltimore (1996)zbMATHGoogle Scholar
  33. 33.
    Davison, B.: Unifying text and link analysis. In:Proc. IJCAI-2003 Workshop Text-Mining & Link-Analysis (TextLink), Acapulco (August 9, 2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • George Almpanidis
    • 1
  • Constantine Kotropoulos
    • 1
  • Ioannis Pitas
    • 1
  1. 1.Department of InfomaticsAristotle University of ThessalonikiThessalonikiGreece

Personalised recommendations