Abstract
The Web explosion offers a bonanza of novel problems. In particular, information retrieval in the Web context requires methods and ideas that have not been addressed in the classic information retrieval literature. This chapter will survey emerging techniques for information retrieval in the Web context and discuss some of the pertinent open problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Bibliography
Ahoy. http://centauri-prime.cs.washington.edu:6060,2000.
alexa. http://alexa.com,2000.
AltaVista. http://ataysta.com, 2000.
G. O. Arocena, A. O. Mendelzon, and G. A. Mihaila. Applications of a web query language. In WWW6, pages 587–595, 1997.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.
K. Bharat, A. Broder, J. Dean, and M. Henzinger, 1999. Workshop on Organizing Webspace at the Fourth ACM Conference on Digital Libraries.
K. Bharat and A. Z. Broder. A technique for measuring the relative size and overlap of public web search engines. In WWW7, pages 379–388, 1998.
K. Bharat, A.Z. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the web. In WWW7, pages 469–477, 1998.
K. Bharat and M. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 111–104, 1998.
A. Bookstein and D. Swanson. Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25: 312–318, 1974.
S. Brin, J. Davis, and H. GarcÃa-Molina. Copy detection mechanisms for digital documents. In M.J. Carey and D.A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 398–409, 1995.
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW7, pages 107–117, 1998.
A.Z. Broder. Some applications of Rabin’s fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143–152. Springer-Verlag, 1993.
A.Z. Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences, pages 2129. IEEE Computer Society, 1997.
A.Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC-98), pages 327336. ACM Press, 1998.
A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. Syntactic clustering of the Web. In WWW6, pages 391–404, 1997.
J. Carriere and R. Kazman. Webquery: Searching and visualizing the web through connectivity. In WWW6, pages 701–711, 1997.
S. Chakrabarti, B. Dom, D. Gibson, S. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Experiments in topic distillation, 1998a. ACM-SIGIR’98 Post-Conference Workshop on Hypertext Information Retrieval for the Web.
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In L.M. Haas and A. A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data. ACM Press, 1998b.
S. Chakrabarti, B.P.R. Dom, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In WWW7, pages 65–74, 1998c.
J. Cho, H. GarcÃa-Molina, and L. Page. Efficient crawling through URL ordering. In WWW7, pages 161–172, 1998.
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. Technical report, Stanford University, Stanford, California, 1999.
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proceedings of the 2000 ACM Internation Conference on Management of Data (SIGMOD), 2000.
E.G. Coffman, Z. Liu, and R.R. Weber. Optimal robot scheduling for web search engines. Technical Report 3317, INRIA, 1997.
J. Dean and M.R. Henzinger. Finding related web pages in the world wide web. In Proceedings of the Eighth International World Wide Web Conference, pages 389–401, 1999.
dmoz, 2000. http://dmoz.org/.
R.B. Doorenbos, O. Etzioni, and D.S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In W.L. Johnson and B. Hayes-Roth, editors, Proceedings of the 1st International Conference on Autonomous Agents, pages 39–48. ACM Press, 1997.
D. Ellis. The physical and cognitive paradigms in information retrieval research. Journal of Documentation, 48: 45–64, 1992.
Excite, 2000. http://excite.com/.
N. Fuhr. Models for retrieval with probabilistic indexing. Information Precessing and Management, 25: 55–72, 1989.
E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178: 471–479, 1972.
Google, 2000. http://google.com/.
N. Heintze. Scalable document fingerprinting. In Second USENIX Workshop on Electronic Commerce, pages 191–200, 1996.
Hotbot, 2000. http://hotbot.com/.
L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18: 39–43, 1953.
M.M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14: 10–25, 1963.
J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 668–677, 1998.
J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and J. Riedl. Grouplens: Collaborative filtering for usenet news. Communications of the ACM, 40: 77–87, 1997.
M. Koster, 1993. http://info.webcrawler.com/mak/projects/-robots/guidelines.html.
S. Lawrence and C.L. Giles. Searching the World Wide Web. Science, 280: 98, 1998.
U. Manber. Finding similar files in a large file system. In Proceedings of the Winter 1994 USENIX Conference, pages 1–10, 1994.
M. Maron and J. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the Association for Computing Machinery, 7: 216–244, 1960.
M.S. Mizruchi, P. Mariolis, M. Schwartz, and B. Mintz. Techniques for disaggregating centrality scores in social networks. In N.B. Tuma, editor, Sociological Methodology, pages 26–48. Jossey-Bass, 1986.
P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow’s ear: Extracting usable structures from the web. In Proceedings of the Conference on Human Factors in Computing Systems (CHI 96), pages 118–125, 1996.
PubMed, 2000. http://ncbi.nlm.nih.gov/.
M.O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15–81, Center for Research in Computing Technology, Harvard University, 1981.
S.E. Robertson and K.S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27: 129–146, 1976.
G. Salton. The SMART System - Experiments in Automatic Document Processing. Prentice Hall, 1971.
G. Salton. The relevance of the cognitive paradigm for information science. In O. Harbo and L. Kajberg, editors, Theory and Application of Information Research. Proceedings of the 2nd International Research Forum on Information Science, pages 49–61. Mansell, 1980.
G. Salton. The smart environment for retrieval system evaluation - Advantages and problem areas. In K.S. Jones, editor, Information Retrieval Experiment, pages 316–329. Butterworths, 1981.
C. Silverstein, M. Henzinger, J. Marais, and M. Moricz. Analysis of a very large AltaVista query log. Technical Report 1998–014,Compaq Systems Research Center, Palo Alto, California, 1998.
H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265–269, 1973.
K. Sparck-Jones and P. Willet, editors. Readings in Information Retrieval. Morgan Kaufmann, 1997.
E. Spertus. Parasite: Mining structural information on the web. In WWW6, pages 587–595, 1997.
L. Terveen, W. Hill, B. Amento, D. McDonald, and J. Creter. Phoaks: A system for sharing recommendations. Communications of the ACM, 40: 59–62, 1997.
A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 2–10, 1998.
C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979.
Yahoo!, 2000. http://yahoo.com/.
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), pages 46–54, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Broder, A., Henzinger, M. (2002). Algorithmic Aspects of Information Retrieval on the Web. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_1
Download citation
DOI: https://doi.org/10.1007/978-1-4615-0005-6_1
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-4882-5
Online ISBN: 978-1-4615-0005-6
eBook Packages: Springer Book Archive