Algorithmic Aspects of Information Retrieval on the Web

Broder, Andrei; Henzinger, Monika

doi:10.1007/978-1-4615-0005-6_1

Andrei Broder³ &
Monika Henzinger⁴

Part of the book series: Massive Computing ((MACO,volume 4))

512 Accesses
4 Citations

Abstract

The Web explosion offers a bonanza of novel problems. In particular, information retrieval in the Web context requires methods and ideas that have not been addressed in the classic information retrieval literature. This chapter will survey emerging techniques for information retrieval in the Web context and discuss some of the pertinent open problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 629.00; Price excludes VAT (USA)

Softcover Book: USD 799.99; Price excludes VAT (USA)

Hardcover Book: USD 799.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

Ahoy. http://centauri-prime.cs.washington.edu:6060,2000.
alexa. http://alexa.com,2000.
AltaVista. http://ataysta.com, 2000.
G. O. Arocena, A. O. Mendelzon, and G. A. Mihaila. Applications of a web query language. In WWW6, pages 587–595, 1997.
Google Scholar
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
Google Scholar
K. Bharat, A. Broder, J. Dean, and M. Henzinger, 1999. Workshop on Organizing Webspace at the Fourth ACM Conference on Digital Libraries.
Google Scholar
K. Bharat and A. Z. Broder. A technique for measuring the relative size and overlap of public web search engines. In WWW7, pages 379–388, 1998.
Google Scholar
K. Bharat, A.Z. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the web. In WWW7, pages 469–477, 1998.
Google Scholar
K. Bharat and M. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 111–104, 1998.
Google Scholar
A. Bookstein and D. Swanson. Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25: 312–318, 1974.
Article Google Scholar
S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In M.J. Carey and D.A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 398–409, 1995.
Google Scholar
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW7, pages 107–117, 1998.
Google Scholar
A.Z. Broder. Some applications of Rabin’s fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143–152. Springer-Verlag, 1993.
Google Scholar
A.Z. Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences, pages 2129. IEEE Computer Society, 1997.
Google Scholar
A.Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC-98), pages 327336. ACM Press, 1998.
Google Scholar
A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. Syntactic clustering of the Web. In WWW6, pages 391–404, 1997.
Google Scholar
J. Carriere and R. Kazman. Webquery: Searching and visualizing the web through connectivity. In WWW6, pages 701–711, 1997.
Google Scholar
S. Chakrabarti, B. Dom, D. Gibson, S. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Experiments in topic distillation, 1998a. ACM-SIGIR’98 Post-Conference Workshop on Hypertext Information Retrieval for the Web.
Google Scholar
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In L.M. Haas and A. A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data. ACM Press, 1998b.
Google Scholar
S. Chakrabarti, B.P.R. Dom, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In WWW7, pages 65–74, 1998c.
Google Scholar
J. Cho, H. García-Molina, and L. Page. Efficient crawling through URL ordering. In WWW7, pages 161–172, 1998.
Google Scholar
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. Technical report, Stanford University, Stanford, California, 1999.
Google Scholar
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proceedings of the 2000 ACM Internation Conference on Management of Data (SIGMOD), 2000.
Google Scholar
E.G. Coffman, Z. Liu, and R.R. Weber. Optimal robot scheduling for web search engines. Technical Report 3317, INRIA, 1997.
Google Scholar
J. Dean and M.R. Henzinger. Finding related web pages in the world wide web. In Proceedings of the Eighth International World Wide Web Conference, pages 389–401, 1999.
Google Scholar
dmoz, 2000. http://dmoz.org/.
Google Scholar
R.B. Doorenbos, O. Etzioni, and D.S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In W.L. Johnson and B. Hayes-Roth, editors, Proceedings of the 1st International Conference on Autonomous Agents, pages 39–48. ACM Press, 1997.
Google Scholar
D. Ellis. The physical and cognitive paradigms in information retrieval research. Journal of Documentation, 48: 45–64, 1992.
Article Google Scholar
Excite, 2000. http://excite.com/.
Google Scholar
N. Fuhr. Models for retrieval with probabilistic indexing. Information Precessing and Management, 25: 55–72, 1989.
Article Google Scholar
E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178: 471–479, 1972.
Article Google Scholar
Google, 2000. http://google.com/.
Google Scholar
N. Heintze. Scalable document fingerprinting. In Second USENIX Workshop on Electronic Commerce, pages 191–200, 1996.
Google Scholar
Hotbot, 2000. http://hotbot.com/.
Google Scholar
L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18: 39–43, 1953.
Article MATH Google Scholar
M.M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14: 10–25, 1963.
Article Google Scholar
J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 668–677, 1998.
Google Scholar
J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and J. Riedl. Grouplens: Collaborative filtering for usenet news. Communications of the ACM, 40: 77–87, 1997.
Google Scholar
M. Koster, 1993. http://info.webcrawler.com/mak/projects/-robots/guidelines.html.
Google Scholar
S. Lawrence and C.L. Giles. Searching the World Wide Web. Science, 280: 98, 1998.
Article Google Scholar
U. Manber. Finding similar files in a large file system. In Proceedings of the Winter 1994 USENIX Conference, pages 1–10, 1994.
Google Scholar
M. Maron and J. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the Association for Computing Machinery, 7: 216–244, 1960.
Article Google Scholar
M.S. Mizruchi, P. Mariolis, M. Schwartz, and B. Mintz. Techniques for disaggregating centrality scores in social networks. In N.B. Tuma, editor, Sociological Methodology, pages 26–48. Jossey-Bass, 1986.
Google Scholar
P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow’s ear: Extracting usable structures from the web. In Proceedings of the Conference on Human Factors in Computing Systems (CHI 96), pages 118–125, 1996.
Google Scholar
PubMed, 2000. http://ncbi.nlm.nih.gov/.
Google Scholar
M.O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15–81, Center for Research in Computing Technology, Harvard University, 1981.
Google Scholar
S.E. Robertson and K.S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27: 129–146, 1976.
Article Google Scholar
G. Salton. The SMART System - Experiments in Automatic Document Processing. Prentice Hall, 1971.
Google Scholar
G. Salton. The relevance of the cognitive paradigm for information science. In O. Harbo and L. Kajberg, editors, Theory and Application of Information Research. Proceedings of the 2nd International Research Forum on Information Science, pages 49–61. Mansell, 1980.
Google Scholar
G. Salton. The smart environment for retrieval system evaluation - Advantages and problem areas. In K.S. Jones, editor, Information Retrieval Experiment, pages 316–329. Butterworths, 1981.
Google Scholar
C. Silverstein, M. Henzinger, J. Marais, and M. Moricz. Analysis of a very large AltaVista query log. Technical Report 1998–014,Compaq Systems Research Center, Palo Alto, California, 1998.
Google Scholar
H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265–269, 1973.
Google Scholar
K. Sparck-Jones and P. Willet, editors. Readings in Information Retrieval. Morgan Kaufmann, 1997.
Google Scholar
E. Spertus. Parasite: Mining structural information on the web. In WWW6, pages 587–595, 1997.
Google Scholar
L. Terveen, W. Hill, B. Amento, D. McDonald, and J. Creter. Phoaks: A system for sharing recommendations. Communications of the ACM, 40: 59–62, 1997.
Article Google Scholar
A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 2–10, 1998.
Chapter Google Scholar
C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979.
Google Scholar
Yahoo!, 2000. http://yahoo.com/.
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 46–54, 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

Alta Vista Company, San Mateo, California, USA
Andrei Broder
Google Incorporated, Mountain View, California, USA
Monika Henzinger

Authors

Andrei Broder
View author publications
You can also search for this author in PubMed Google Scholar
Monika Henzinger
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

AT&T Labs Research, USA
James Abello & Mauricio G. C. Resende &
University of Florida, USA
Panos M. Pardalos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Broder, A., Henzinger, M. (2002). Algorithmic Aspects of Information Retrieval on the Web. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_1

Download citation

DOI: https://doi.org/10.1007/978-1-4615-0005-6_1
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-4882-5
Online ISBN: 978-1-4615-0005-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics