Skip to main content

Algorithmic Aspects of Information Retrieval on the Web

  • Chapter
Handbook of Massive Data Sets

Part of the book series: Massive Computing ((MACO,volume 4))

Abstract

The Web explosion offers a bonanza of novel problems. In particular, information retrieval in the Web context requires methods and ideas that have not been addressed in the classic information retrieval literature. This chapter will survey emerging techniques for information retrieval in the Web context and discuss some of the pertinent open problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 629.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 799.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 799.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  • Ahoy. http://centauri-prime.cs.washington.edu:6060,2000.

  • alexa. http://alexa.com,2000.

  • AltaVista. http://ataysta.com, 2000.

  • G. O. Arocena, A. O. Mendelzon, and G. A. Mihaila. Applications of a web query language. In WWW6, pages 587–595, 1997.

    Google Scholar 

  • R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.

    Google Scholar 

  • K. Bharat, A. Broder, J. Dean, and M. Henzinger, 1999. Workshop on Organizing Webspace at the Fourth ACM Conference on Digital Libraries.

    Google Scholar 

  • K. Bharat and A. Z. Broder. A technique for measuring the relative size and overlap of public web search engines. In WWW7, pages 379–388, 1998.

    Google Scholar 

  • K. Bharat, A.Z. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the web. In WWW7, pages 469–477, 1998.

    Google Scholar 

  • K. Bharat and M. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 111–104, 1998.

    Google Scholar 

  • A. Bookstein and D. Swanson. Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25: 312–318, 1974.

    Article  Google Scholar 

  • S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. In M.J. Carey and D.A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 398–409, 1995.

    Google Scholar 

  • S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW7, pages 107–117, 1998.

    Google Scholar 

  • A.Z. Broder. Some applications of Rabin’s fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143–152. Springer-Verlag, 1993.

    Google Scholar 

  • A.Z. Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences, pages 2129. IEEE Computer Society, 1997.

    Google Scholar 

  • A.Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC-98), pages 327336. ACM Press, 1998.

    Google Scholar 

  • A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. Syntactic clustering of the Web. In WWW6, pages 391–404, 1997.

    Google Scholar 

  • J. Carriere and R. Kazman. Webquery: Searching and visualizing the web through connectivity. In WWW6, pages 701–711, 1997.

    Google Scholar 

  • S. Chakrabarti, B. Dom, D. Gibson, S. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Experiments in topic distillation, 1998a. ACM-SIGIR’98 Post-Conference Workshop on Hypertext Information Retrieval for the Web.

    Google Scholar 

  • S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In L.M. Haas and A. A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data. ACM Press, 1998b.

    Google Scholar 

  • S. Chakrabarti, B.P.R. Dom, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In WWW7, pages 65–74, 1998c.

    Google Scholar 

  • J. Cho, H. García-Molina, and L. Page. Efficient crawling through URL ordering. In WWW7, pages 161–172, 1998.

    Google Scholar 

  • J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. Technical report, Stanford University, Stanford, California, 1999.

    Google Scholar 

  • J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proceedings of the 2000 ACM Internation Conference on Management of Data (SIGMOD), 2000.

    Google Scholar 

  • E.G. Coffman, Z. Liu, and R.R. Weber. Optimal robot scheduling for web search engines. Technical Report 3317, INRIA, 1997.

    Google Scholar 

  • J. Dean and M.R. Henzinger. Finding related web pages in the world wide web. In Proceedings of the Eighth International World Wide Web Conference, pages 389–401, 1999.

    Google Scholar 

  • dmoz, 2000. http://dmoz.org/.

    Google Scholar 

  • R.B. Doorenbos, O. Etzioni, and D.S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In W.L. Johnson and B. Hayes-Roth, editors, Proceedings of the 1st International Conference on Autonomous Agents, pages 39–48. ACM Press, 1997.

    Google Scholar 

  • D. Ellis. The physical and cognitive paradigms in information retrieval research. Journal of Documentation, 48: 45–64, 1992.

    Article  Google Scholar 

  • Excite, 2000. http://excite.com/.

    Google Scholar 

  • N. Fuhr. Models for retrieval with probabilistic indexing. Information Precessing and Management, 25: 55–72, 1989.

    Article  Google Scholar 

  • E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178: 471–479, 1972.

    Article  Google Scholar 

  • Google, 2000. http://google.com/.

    Google Scholar 

  • N. Heintze. Scalable document fingerprinting. In Second USENIX Workshop on Electronic Commerce, pages 191–200, 1996.

    Google Scholar 

  • Hotbot, 2000. http://hotbot.com/.

    Google Scholar 

  • L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18: 39–43, 1953.

    Article  MATH  Google Scholar 

  • M.M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14: 10–25, 1963.

    Article  Google Scholar 

  • J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 668–677, 1998.

    Google Scholar 

  • J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and J. Riedl. Grouplens: Collaborative filtering for usenet news. Communications of the ACM, 40: 77–87, 1997.

    Google Scholar 

  • M. Koster, 1993. http://info.webcrawler.com/mak/projects/-robots/guidelines.html.

    Google Scholar 

  • S. Lawrence and C.L. Giles. Searching the World Wide Web. Science, 280: 98, 1998.

    Article  Google Scholar 

  • U. Manber. Finding similar files in a large file system. In Proceedings of the Winter 1994 USENIX Conference, pages 1–10, 1994.

    Google Scholar 

  • M. Maron and J. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the Association for Computing Machinery, 7: 216–244, 1960.

    Article  Google Scholar 

  • M.S. Mizruchi, P. Mariolis, M. Schwartz, and B. Mintz. Techniques for disaggregating centrality scores in social networks. In N.B. Tuma, editor, Sociological Methodology, pages 26–48. Jossey-Bass, 1986.

    Google Scholar 

  • P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow’s ear: Extracting usable structures from the web. In Proceedings of the Conference on Human Factors in Computing Systems (CHI 96), pages 118–125, 1996.

    Google Scholar 

  • PubMed, 2000. http://ncbi.nlm.nih.gov/.

    Google Scholar 

  • M.O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15–81, Center for Research in Computing Technology, Harvard University, 1981.

    Google Scholar 

  • S.E. Robertson and K.S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27: 129–146, 1976.

    Article  Google Scholar 

  • G. Salton. The SMART System - Experiments in Automatic Document Processing. Prentice Hall, 1971.

    Google Scholar 

  • G. Salton. The relevance of the cognitive paradigm for information science. In O. Harbo and L. Kajberg, editors, Theory and Application of Information Research. Proceedings of the 2nd International Research Forum on Information Science, pages 49–61. Mansell, 1980.

    Google Scholar 

  • G. Salton. The smart environment for retrieval system evaluation - Advantages and problem areas. In K.S. Jones, editor, Information Retrieval Experiment, pages 316–329. Butterworths, 1981.

    Google Scholar 

  • C. Silverstein, M. Henzinger, J. Marais, and M. Moricz. Analysis of a very large AltaVista query log. Technical Report 1998–014,Compaq Systems Research Center, Palo Alto, California, 1998.

    Google Scholar 

  • H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265–269, 1973.

    Google Scholar 

  • K. Sparck-Jones and P. Willet, editors. Readings in Information Retrieval. Morgan Kaufmann, 1997.

    Google Scholar 

  • E. Spertus. Parasite: Mining structural information on the web. In WWW6, pages 587–595, 1997.

    Google Scholar 

  • L. Terveen, W. Hill, B. Amento, D. McDonald, and J. Creter. Phoaks: A system for sharing recommendations. Communications of the ACM, 40: 59–62, 1997.

    Article  Google Scholar 

  • A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 2–10, 1998.

    Chapter  Google Scholar 

  • C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979.

    Google Scholar 

  • Yahoo!, 2000. http://yahoo.com/.

  • O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 46–54, 1998.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Broder, A., Henzinger, M. (2002). Algorithmic Aspects of Information Retrieval on the Web. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-0005-6_1

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-4882-5

  • Online ISBN: 978-1-4615-0005-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics