On Leveraging User Access Patterns for Topic Specific Crawling

Aggarwal, Charu C.

doi:10.1023/B:DAMI.0000031633.76754.d3

On Leveraging User Access Patterns for Topic Specific Crawling

Published: September 2004

Volume 9, pages 123–145, (2004)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Charu C. Aggarwal¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

In recent years, there has been considerable research on constructing crawlers which find resources satisfying specific conditions called predicates. Such a predicate could be a keyword query, a topical query, or some arbitrary contraint on the internal structure of the web page. Several techniques such as focussed crawling and intelligent crawling have recently been proposed for performing the topic specific resource discovery process. All these crawlers are linkage based, since they use the hyperlink behavior in order to perform resource discovery. Recent studies have shown that the topical correlations in hyperlinks are quite noisy and may not always show the consistency necessary for a reliable resource discovery process. In this paper, we will approach the problem of resource discovery from an entirely different perspective; we will mine the significant browsing patterns of world wide web users in order to model the likelihood of web pages belonging to a specified predicate. This user behavior can be mined from the freely available traces of large public domain proxies on the world wide web. For example, proxy caches such as Squid are hierarchical proxies which make their logs publically available. As we shall see in this paper, such traces are a rich source of information which can be mined in order to find the users that are most relevant to the topic of a given crawl. We refer to this technique as collaborative crawling because it mines the collective user experiences in order to find topical resources. Such a strategy turns out to be extremely effective because the topical consistency in world wide web browsing patterns turns out to very high compared to the noisy linkage information. In addition, the user-centered crawling system can be combined with linkage based systems to create an overall system which works more effectively than a system based purely on either user behavior or hyperlinks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal, C.C. 2002. Collaborative crawling: Mining user experiences for topical resource discovery. In Proceedings of the KDD Conference.
Aggarwal, C.C., Wolf, J.L., Wu, K.-L., and Yu, P.S. 1999. Horting hatches an egg: A new graph theoretical approach to collaborative filtering. In Proceedings of the ACM SIGKDD Conference.
Aggarwal, C.C., Gates, S.C., and Yu, P.S. 1999. On the merits of using supervised clustering for building categorization systems. In Proceedings of the ACM SIGKDD Conference.
Aggarwal, C.C., Al-Garawi, F., and Yu, P. 2001. Intelligent crawling on the world wide web with arbitrary predicates. In Proceedings of the WWW Conference.
Bharat, K. and Henzinger, M. 1998. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the ACM SIGIR Conference.
Carriere, J. and Kazman, R. Searching and visualizing the web through connectivity. In Proceedings of the World Wide Web Conference, pp. 701–711.
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghavan, P., and Rajagopalan, S. 1998. Automatic resource compilation by analyzing hyperlink structure and associated text. Special Issue of the Seventh World Wide Web Conference, 30(1–7).
Google Scholar
Chakrabarti, S., Dom, B., Ravi Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D., and Kleinberg, J. 1999. Mining the Web's link structure. IEEE Computer, 32(8):60–67.
Google Scholar
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focussed crawling: A new approach to topic specific resource discovery. In Proceedings of the Eighth World Wide Web Conference, pp. 545–562.
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Distributed hypertext resource discovery through examples. In Proceedings of the VLDB Conference.
Chakrabarti, S. 2001. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the WWW Conference.
Chen, M.S., Park, J.S., and Yu, P.S. 1996. Data mining for path traversal patterns in a web environment. ICDCS Conference.
Diligenti, M. et al. 2000. Focused crawling using context graphs. In Proceedings of the VLDB Conference.
Ding, J., Gravano, L., and Shivakumar, N. 2000. Computing geographical scopes of web resources. In Proceedings of the VLDB Conference.
Edwards, J., McCurley, K., and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of the World Wide Web Conference.
Lempel, R. and Moran, S. 2000. The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC effect. WWW9 Conference, pp. 387–401.
Najork, M. and Wiener, J. 2001. Breadth-first search yields high-quality Web pages. In Proceedings of the World Wide Web Conference.
Bar-Yossef, Z., Berg, A., Chein, S., Fakcharoenphol, J., and Witz, D. 2000. Approximating aggregate queries about web pages via random walks. In Proceedings of the VLDB Conference.
Cho, J. and Garcia-Molina, H. 2000. The evolution of the Web and implications for an incremental crawler. In Proceedings of the VLDB Conference.
Kleinberg, J. 1998. Authoritative sources in a hyperlinked environment. In Proceedings of the ACM-SIAM Symposium of Discrete Algorithms.
Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. Trawling the web for emerging cyber-communities. In Proceedings of the World Wide Web Conference.
Mukherjea. S. 2000. WTMS: Asystem for collecting and analyzing topic-specific web information. In Proceedings of the World Wide Web Conference.
Raghavan, S. and Garcia-Molina, H. 2001. Crawling the hidden web. In Proceedings of the VLDB Conference.
Rousskov, A. and Solviev, V. On performance of caching proxies. http://www.cs.ndsu.nodak.edu/rousskov/research/cache/squid/profiling/papers/ftp://ircache.nlanr.net/Traces/
Shardanand, U. and Maes, P. 1995. Social information filtering: Algorithms for automating word of mouth. In Proceedings of CHI '95, Denver CO, pp. 210–217.
Srikant, R. and Yang, Y. 2001. Mining web logs to improve website organization. ACM KDD Conference. http://www.yahoo.com http://www.altavista.com http://www.lycos.com

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Charu C. Aggarwal

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, C.C. On Leveraging User Access Patterns for Topic Specific Crawling. Data Mining and Knowledge Discovery 9, 123–145 (2004). https://doi.org/10.1023/B:DAMI.0000031633.76754.d3

Download citation

Issue Date: September 2004
DOI: https://doi.org/10.1023/B:DAMI.0000031633.76754.d3

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Leveraging User Access Patterns for Topic Specific Crawling

Abstract

Access this article

Similar content being viewed by others

CRAWLER-LD: A Multilevel Metadata Focused Crawler Framework for Linked Data

Focused crawling for the hidden web

Crawl Smart: A Domain-Specific Crawler

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

On Leveraging User Access Patterns for Topic Specific Crawling

Abstract

Access this article

Similar content being viewed by others

CRAWLER-LD: A Multilevel Metadata Focused Crawler Framework for Linked Data

Focused crawling for the hidden web

Crawl Smart: A Domain-Specific Crawler

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation