Skip to main content

Topical Crawling for Business Intelligence

  • Conference paper
Research and Advanced Technology for Digital Libraries (ECDL 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2769))

Included in the following conference series:

Abstract

The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. General-purpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-depth and up-to-date research. In this paper we investigate the use of topical crawlers in creating a small document collection that helps locate relevant business entities. The problem of locating business entities is encountered when an organization looks for competitors, partners or acquisitions. We formalize the problem, create a test bed, introduce metrics to measure the performance of crawlers, and compare the results of four different crawlers. Our results underscore the importance of identifying good hubs and exploiting link contexts based on tag trees for accelerating the crawl and improving the overall results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P., Rajagopalan, S.: Automatic resource list compilation by analyzing hyperlink structure and associated text. In: WWW7 (1998)

    Google Scholar 

  2. Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW2002, Hawaii (May 2002)

    Google Scholar 

  3. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. In: WWW8 (May 1999)

    Google Scholar 

  4. Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: VLDB 2000, Cairo, Egypt (2000)

    Google Scholar 

  5. Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks 30(1-7), 161–172 (1998)

    Google Scholar 

  6. De Bra, P.M.E., Post, R.D.J.: Information retrieval in the World Wide Web: Making client-based searching feasible. In: Proc. 1st International World Wide Web Conference (1994)

    Google Scholar 

  7. Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm — An application: Tailored Web site mapping. In: WWW7 (1998)

    Google Scholar 

  8. Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  9. Lawrence, S., Giles, C.L.: Accessibility of information on the Web. Nature 400, 107–109 (1999)

    Article  Google Scholar 

  10. Menczer, F., Belew, R.K.: Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 39(2-3), 203–242 (2000)

    Article  MATH  Google Scholar 

  11. Menczer, F., Pant, G., Ruiz, M., Srinivasan, P.: Evaluating topic-driven Web crawlers. In: Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval (2001)

    Google Scholar 

  12. Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: Evaluating adaptive algorithms. In: To appear in ACM Trans. on Internet Technologies (2003), http://dollar.biz.uiowa.edu/~fil/Papers/TOIT.pdf

  13. Pant, G.: Deriving Link-context from HTML Tag Tree. In: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)

    Google Scholar 

  14. Pant, G., Srinivasan, P., Menczer, F.: Exploration versus exploitation in topic driven crawlers. In: WWW 2002 Workshop on Web Dynamics (2002)

    Google Scholar 

  15. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  16. RaviKumar, S., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the Web graph. In: FOCS, pp. 57–65 (November 2000)

    Google Scholar 

  17. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  18. Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Information Retrieval (Submitted, 2003), http://dollar.biz.uiowa.edu/~fil/Papers/crawl_framework.pdf

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pant, G., Menczer, F. (2003). Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2003. Lecture Notes in Computer Science, vol 2769. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45175-4_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45175-4_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40726-3

  • Online ISBN: 978-3-540-45175-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics