Skip to main content

Abstract

In this chapter we discuss the challenges in the design and deployment of search engines, systems that respond to keyword-based queries by extracting results which include pointers to Web pages. The main aspect of search engines is their ability to scale and manage billions of indexed pages dispersed on the Web. We provide an architectural view of the main elements of search engines, and we focus on their main components, namely the crawling and indexing subsystems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 79.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    www.worldwidewebsize.com.

  2. 2.

    http://web.archive.org/web/19990209043945/google.stanford.edu/googlehardware.html.

  3. 3.

    http://www.ysearchblog.com/archives/000172.html, accessible through the Web archive at http://web.archive.org/.

  4. 4.

    http://googleblog.blogspot.it/2009/01/powering-google-search.html.

  5. 5.

    The links information extracted from an HTML page can also be passed on to the indexer for use in ranking functions based on link analysis (see Chap. 7).

  6. 6.

    The total runtime complexity of iterative duplicate detection is O(n 2) for a given candidate set C T that contains candidates of type T [265].

  7. 7.

    PageRank is a page relevance measure that will be described in Chap. 7.

  8. 8.

    Large websites might be penalized because of the politeness constraint, which might force their resources to be fetched last.

  9. 9.

    http://www.robotstxt.org/meta.html.

  10. 10.

    http://www.sitemaps.org/.

  11. 11.

    Dynamic indexing is also referred to as online indexing, to emphasize the availability of the system upon document arrival and indexing.

  12. 12.

    http://www.comscore.com/Insights/Press_Releases/2012/11/comScore_Releases_October_2012_U.S._Search_Engine_Rankings, visited 6th December 2012.

References

  1. A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, S. Raghavan, Searching the web. ACM Trans. Internet Technol. 1(1), 2–43 (2001)

    Article  Google Scholar 

  2. C. Badue, B. Ribeiro-Neto, R. Baeza-Yates, N. Ziviani, Distributed query processing using partitioned inverted files, in Proceedings of the Eighth International Symposium on String Processing and Information Retrieval, SPIRE 2001, Nov (2001), pp. 10–20

    Chapter  Google Scholar 

  3. R.A. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval (Addison-Wesley, Boston, 1999)

    Google Scholar 

  4. R. Baeza-Yates, F. Saint-Jean, A three level search engine index based in query log distribution, in String Processing and Information Retrieval, ed. by M. Nascimento, E. Moura, A. Oliveira. Lecture Notes in Computer Science, vol. 2857 (Springer, Berlin, 2003), pp. 56–65

    Chapter  Google Scholar 

  5. R. Baeza-Yates, C. Castillo, M. Marin, A. Rodriguez, Crawling a country: better strategies than breadth-first for web page ordering, in Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. WWW’05 (ACM, New York, 2005), pp. 864–872

    Chapter  Google Scholar 

  6. R.A. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, F. Silvestri, Challenges on distributed web retrieval, in Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, ed. by R. Chirkova, A. Dogac, M.T. Özsu, T.K. Sellis (IEEE Press, New York, 2007), pp. 6–20

    Google Scholar 

  7. R. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, F. Silvestri, The impact of caching on search engines, in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’07 (ACM, New York, 2007), pp. 183–190

    Chapter  Google Scholar 

  8. R. Baeza-Yates, A. Gionis, F.P. Junqueira, V. Murdock, V. Plachouras, F. Silvestri, Design trade-offs for search engine caching. ACM Trans. Web 2(4), 20 (2008)

    Article  Google Scholar 

  9. R. Baeza-Yates, A. Gionis, F. Junqueira, V. Plachouras, L. Telloli, On the feasibility of multi-site web search engines, in Proceedings of the 18th ACM Conference on Information and Knowledge Management. CIKM’09 (ACM, New York, 2009), pp. 425–434

    Google Scholar 

  10. L.A. Barroso, J. Dean, U. Hölzle, Web search for a planet: the Google cluster architecture. IEEE MICRO 23(2), 22–28 (2003)

    Article  Google Scholar 

  11. R. Blanco, E. Bortnikov, F. Junqueira, R. Lempel, L. Telloli, H. Zaragoza, Caching search engine results over incremental indices, in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’10 (ACM, New York, 2010), pp. 82–89

    Google Scholar 

  12. S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)

    Article  Google Scholar 

  13. A.Z. Broder, S.C. Glassman, M.S. Manasse, G. Zweig, Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)

    Article  Google Scholar 

  14. B.B. Cambazoglu, F.P. Junqueira, V. Plachouras, S. Banachowski, B. Cui, S. Lim, B. Bridge, A refreshing perspective of search engine caching, in Proceedings of the 19th International Conference on World Wide Web. WWW’10 (ACM, New York, 2010), pp. 181–190

    Chapter  Google Scholar 

  15. C. Castillo, M. Marin, A. Rodriguez, R. Baeza-Yates, Scheduling algorithms for web crawling, in Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress. LA-WEBMEDIA’04 (IEEE Comput. Soc., Washington, 2004), pp. 10–17

    Chapter  Google Scholar 

  16. J. Cho, H. Garcia-Molina, L. Page, Efficient crawling through URL ordering, in Proceedings of the Seventh International Conference on World Wide Web 7, WWW7 (Elsevier, Amsterdam, 1998), pp. 161–172

    Google Scholar 

  17. D. Cutting, J. Pedersen, Optimization for dynamic inverted index maintenance, in Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’90 (ACM, New York, 1990), pp. 405–411

    Google Scholar 

  18. T. Fagni, R. Perego, F. Silvestri, S. Orlando, Boosting the performance of web search engines: caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst. 24(1), 51–78 (2006)

    Article  Google Scholar 

  19. A. Heydon, M. Najork, Mercator: a scalable, extensible web crawler. World Wide Web J. 2(4), 219–229 (1999)

    Article  Google Scholar 

  20. J.C. Klensin, Role of the domain name system (DNS), Internet RFC 3467, Feb 2003

    Google Scholar 

  21. M. Koster, A method for web robots control, Internet Draft draft-koster-robots-00, Dec (1996)

    Google Scholar 

  22. R. Lempel, S. Moran, Predictive caching and prefetching of query results in search engines, in Proceedings of the 12th International Conference on World Wide Web. WWW’03 (ACM, New York, 2003), pp. 19–28

    Google Scholar 

  23. N. Lester, A. Moffat, J. Zobel, Fast on-line index construction by geometric partitioning, in Proceedings of the 14th ACM International Conference on Information and Knowledge Management. CIKM’05 (ACM, New York, 2005), pp. 776–783

    Google Scholar 

  24. C.D. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval. 2008. Online edition (2007)

    Google Scholar 

  25. E.P. Markatos, On caching search engine query results, in Computer Communications (2000)

    Google Scholar 

  26. S. Melink, S. Raghavan, B. Yang, H. Garcia-Molina, Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19(3), 217–241 (2001)

    Article  Google Scholar 

  27. A. Moffat, W. Webber, J. Zobel, R. Baeza-Yates, A pipelined architecture for distributed text query evaluation. Inf. Retr. 10(3), 205–231 (2007)

    Article  Google Scholar 

  28. M. Najork, J.L. Wiener, Breadth-first crawling yields high-quality pages, in Proceedings of the 10th International Conference on World Wide Web. WWW’01 (ACM, New York, 2001), pp. 114–118

    Google Scholar 

  29. F. Naumann, M. Herschel, An Introduction to Duplicate Detection. Synthesis Lectures on Data Management (Morgan & Claypool, San Rafael, 2010)

    MATH  Google Scholar 

  30. B. Ribeiro-Neto, E.S. Moura, M.S. Neubert, N. Ziviani, Efficient distributed algorithms to build inverted files, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’99 (ACM, New York, 1999), pp. 105–112

    Google Scholar 

  31. P.C. Saraiva, E. Silva de Moura, N. Ziviani, W. Meira, R. Fonseca, B. Riberio-Neto, Rank-preserving two-level caching for scalable search engines, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’01 (ACM, New York, 2001), pp. 51–58

    Google Scholar 

  32. Y. Sun, Z. Zhuang, C.L. Giles, A large-scale study of robots.txt, in WWW, ed. by C.L. Williamson, M.E. Zurko, P.F. Patel-Schneider, P.J. Shenoy (ACM, New York, 2007), pp. 1123–1124

    Chapter  Google Scholar 

  33. A. Tomasic, H. García-Molina, K. Shoens, Incremental updates of inverted lists for text document retrieval. SIGMOD Rec. 23(2), 289–300 (1994)

    Article  Google Scholar 

  34. Y. Xie, D. O’Hallaron, Locality in search engine queries and its implications for caching, in IEEE Infocom 2002 (2002), pp. 1238–1247

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Ceri, S., Bozzon, A., Brambilla, M., Della Valle, E., Fraternali, P., Quarteroni, S. (2013). Search Engines. In: Web Information Retrieval. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39314-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39314-3_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39313-6

  • Online ISBN: 978-3-642-39314-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics