Topical Crawling for Business Intelligence

Pant, Gautam; Menczer, Filippo

doi:10.1007/978-3-540-45175-4_22

Gautam Pant⁶ &
Filippo Menczer⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2769))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

664 Accesses
33 Citations

Abstract

The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. General-purpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-depth and up-to-date research. In this paper we investigate the use of topical crawlers in creating a small document collection that helps locate relevant business entities. The problem of locating business entities is encountered when an organization looks for competitors, partners or acquisitions. We formalize the problem, create a test bed, introduce metrics to measure the performance of crawlers, and compare the results of four different crawlers. Our results underscore the importance of identifying good hubs and exploiting link contexts based on tag trees for accelerating the crawl and improving the overall results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P., Rajagopalan, S.: Automatic resource list compilation by analyzing hyperlink structure and associated text. In: WWW7 (1998)
Google Scholar
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW2002, Hawaii (May 2002)
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. In: WWW8 (May 1999)
Google Scholar
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: VLDB 2000, Cairo, Egypt (2000)
Google Scholar
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks 30(1-7), 161–172 (1998)
Google Scholar
De Bra, P.M.E., Post, R.D.J.: Information retrieval in the World Wide Web: Making client-based searching feasible. In: Proc. 1st International World Wide Web Conference (1994)
Google Scholar
Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm — An application: Tailored Web site mapping. In: WWW7 (1998)
Google Scholar
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Article MATH MathSciNet Google Scholar
Lawrence, S., Giles, C.L.: Accessibility of information on the Web. Nature 400, 107–109 (1999)
Article Google Scholar
Menczer, F., Belew, R.K.: Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning 39(2-3), 203–242 (2000)
Article MATH Google Scholar
Menczer, F., Pant, G., Ruiz, M., Srinivasan, P.: Evaluating topic-driven Web crawlers. In: Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval (2001)
Google Scholar
Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: Evaluating adaptive algorithms. In: To appear in ACM Trans. on Internet Technologies (2003), http://dollar.biz.uiowa.edu/~fil/Papers/TOIT.pdf
Pant, G.: Deriving Link-context from HTML Tag Tree. In: 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003)
Google Scholar
Pant, G., Srinivasan, P., Menczer, F.: Exploration versus exploitation in topic driven crawlers. In: WWW 2002 Workshop on Web Dynamics (2002)
Google Scholar
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
RaviKumar, S., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the Web graph. In: FOCS, pp. 57–65 (November 2000)
Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Information Retrieval (Submitted, 2003), http://dollar.biz.uiowa.edu/~fil/Papers/crawl_framework.pdf

Download references

Author information

Authors and Affiliations

Department of Management Sciences, The University of Iowa, Iowa City, IA, 52242, USA
Gautam Pant & Filippo Menczer

Authors

Gautam Pant
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Menczer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

NetLab, Knowledge Technologies Group, Lund University Libraries, P.O. Box 134, 22100, Lund, Sweden
Traugott Koch
Dept. of Computer and Information Science, Norwegian University of Science and Technology,
Ingeborg Torvik Sølvberg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pant, G., Menczer, F. (2003). Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2003. Lecture Notes in Computer Science, vol 2769. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45175-4_22

Download citation

DOI: https://doi.org/10.1007/978-3-540-45175-4_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40726-3
Online ISBN: 978-3-540-45175-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics