Search Engines

Ceri, Stefano; Bozzon, Alessandro; Brambilla, Marco; Della Valle, Emanuele; Fraternali, Piero; Quarteroni, Silvia

doi:10.1007/978-3-642-39314-3_6

Stefano Ceri⁹,
Alessandro Bozzon⁹,
Marco Brambilla⁹,
Emanuele Della Valle⁹,
Piero Fraternali⁹ &
…
Silvia Quarteroni⁹

Part of the book series: Data-Centric Systems and Applications ((DCSA))

3686 Accesses
1 Citations

Abstract

In this chapter we discuss the challenges in the design and deployment of search engines, systems that respond to keyword-based queries by extracting results which include pointers to Web pages. The main aspect of search engines is their ability to scale and manage billions of indexed pages dispersed on the Web. We provide an architectural view of the main elements of search engines, and we focus on their main components, namely the crawling and indexing subsystems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.worldwidewebsize.com.
2.
http://web.archive.org/web/19990209043945/google.stanford.edu/googlehardware.html.
3.
http://www.ysearchblog.com/archives/000172.html, accessible through the Web archive at http://web.archive.org/.
4.
http://googleblog.blogspot.it/2009/01/powering-google-search.html.
5.
The links information extracted from an HTML page can also be passed on to the indexer for use in ranking functions based on link analysis (see Chap. 7).
6.
The total runtime complexity of iterative duplicate detection is O(n ²) for a given candidate set C ^T that contains candidates of type T [265].
7.
PageRank is a page relevance measure that will be described in Chap. 7.
8.
Large websites might be penalized because of the politeness constraint, which might force their resources to be fetched last.
9.
http://www.robotstxt.org/meta.html.
10.
http://www.sitemaps.org/.
11.
Dynamic indexing is also referred to as online indexing, to emphasize the availability of the system upon document arrival and indexing.
12.
http://www.comscore.com/Insights/Press_Releases/2012/11/comScore_Releases_October_2012_U.S._Search_Engine_Rankings, visited 6th December 2012.

References

A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, S. Raghavan, Searching the web. ACM Trans. Internet Technol. 1(1), 2–43 (2001)
Article Google Scholar
C. Badue, B. Ribeiro-Neto, R. Baeza-Yates, N. Ziviani, Distributed query processing using partitioned inverted files, in Proceedings of the Eighth International Symposium on String Processing and Information Retrieval, SPIRE 2001, Nov (2001), pp. 10–20
Chapter Google Scholar
R.A. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval (Addison-Wesley, Boston, 1999)
Google Scholar
R. Baeza-Yates, F. Saint-Jean, A three level search engine index based in query log distribution, in String Processing and Information Retrieval, ed. by M. Nascimento, E. Moura, A. Oliveira. Lecture Notes in Computer Science, vol. 2857 (Springer, Berlin, 2003), pp. 56–65
Chapter Google Scholar
R. Baeza-Yates, C. Castillo, M. Marin, A. Rodriguez, Crawling a country: better strategies than breadth-first for web page ordering, in Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. WWW’05 (ACM, New York, 2005), pp. 864–872
Chapter Google Scholar
R.A. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, F. Silvestri, Challenges on distributed web retrieval, in Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, ed. by R. Chirkova, A. Dogac, M.T. Özsu, T.K. Sellis (IEEE Press, New York, 2007), pp. 6–20
Google Scholar
R. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, F. Silvestri, The impact of caching on search engines, in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’07 (ACM, New York, 2007), pp. 183–190
Chapter Google Scholar
R. Baeza-Yates, A. Gionis, F.P. Junqueira, V. Murdock, V. Plachouras, F. Silvestri, Design trade-offs for search engine caching. ACM Trans. Web 2(4), 20 (2008)
Article Google Scholar
R. Baeza-Yates, A. Gionis, F. Junqueira, V. Plachouras, L. Telloli, On the feasibility of multi-site web search engines, in Proceedings of the 18th ACM Conference on Information and Knowledge Management. CIKM’09 (ACM, New York, 2009), pp. 425–434
Google Scholar
L.A. Barroso, J. Dean, U. Hölzle, Web search for a planet: the Google cluster architecture. IEEE MICRO 23(2), 22–28 (2003)
Article Google Scholar
R. Blanco, E. Bortnikov, F. Junqueira, R. Lempel, L. Telloli, H. Zaragoza, Caching search engine results over incremental indices, in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’10 (ACM, New York, 2010), pp. 82–89
Google Scholar
S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
Article Google Scholar
A.Z. Broder, S.C. Glassman, M.S. Manasse, G. Zweig, Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)
Article Google Scholar
B.B. Cambazoglu, F.P. Junqueira, V. Plachouras, S. Banachowski, B. Cui, S. Lim, B. Bridge, A refreshing perspective of search engine caching, in Proceedings of the 19th International Conference on World Wide Web. WWW’10 (ACM, New York, 2010), pp. 181–190
Chapter Google Scholar
C. Castillo, M. Marin, A. Rodriguez, R. Baeza-Yates, Scheduling algorithms for web crawling, in Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress. LA-WEBMEDIA’04 (IEEE Comput. Soc., Washington, 2004), pp. 10–17
Chapter Google Scholar
J. Cho, H. Garcia-Molina, L. Page, Efficient crawling through URL ordering, in Proceedings of the Seventh International Conference on World Wide Web 7, WWW7 (Elsevier, Amsterdam, 1998), pp. 161–172
Google Scholar
D. Cutting, J. Pedersen, Optimization for dynamic inverted index maintenance, in Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’90 (ACM, New York, 1990), pp. 405–411
Google Scholar
T. Fagni, R. Perego, F. Silvestri, S. Orlando, Boosting the performance of web search engines: caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst. 24(1), 51–78 (2006)
Article Google Scholar
A. Heydon, M. Najork, Mercator: a scalable, extensible web crawler. World Wide Web J. 2(4), 219–229 (1999)
Article Google Scholar
J.C. Klensin, Role of the domain name system (DNS), Internet RFC 3467, Feb 2003
Google Scholar
M. Koster, A method for web robots control, Internet Draft draft-koster-robots-00, Dec (1996)
Google Scholar
R. Lempel, S. Moran, Predictive caching and prefetching of query results in search engines, in Proceedings of the 12th International Conference on World Wide Web. WWW’03 (ACM, New York, 2003), pp. 19–28
Google Scholar
N. Lester, A. Moffat, J. Zobel, Fast on-line index construction by geometric partitioning, in Proceedings of the 14th ACM International Conference on Information and Knowledge Management. CIKM’05 (ACM, New York, 2005), pp. 776–783
Google Scholar
C.D. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval. 2008. Online edition (2007)
Google Scholar
E.P. Markatos, On caching search engine query results, in Computer Communications (2000)
Google Scholar
S. Melink, S. Raghavan, B. Yang, H. Garcia-Molina, Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19(3), 217–241 (2001)
Article Google Scholar
A. Moffat, W. Webber, J. Zobel, R. Baeza-Yates, A pipelined architecture for distributed text query evaluation. Inf. Retr. 10(3), 205–231 (2007)
Article Google Scholar
M. Najork, J.L. Wiener, Breadth-first crawling yields high-quality pages, in Proceedings of the 10th International Conference on World Wide Web. WWW’01 (ACM, New York, 2001), pp. 114–118
Google Scholar
F. Naumann, M. Herschel, An Introduction to Duplicate Detection. Synthesis Lectures on Data Management (Morgan & Claypool, San Rafael, 2010)
MATH Google Scholar
B. Ribeiro-Neto, E.S. Moura, M.S. Neubert, N. Ziviani, Efficient distributed algorithms to build inverted files, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’99 (ACM, New York, 1999), pp. 105–112
Google Scholar
P.C. Saraiva, E. Silva de Moura, N. Ziviani, W. Meira, R. Fonseca, B. Riberio-Neto, Rank-preserving two-level caching for scalable search engines, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’01 (ACM, New York, 2001), pp. 51–58
Google Scholar
Y. Sun, Z. Zhuang, C.L. Giles, A large-scale study of robots.txt, in WWW, ed. by C.L. Williamson, M.E. Zurko, P.F. Patel-Schneider, P.J. Shenoy (ACM, New York, 2007), pp. 1123–1124
Chapter Google Scholar
A. Tomasic, H. García-Molina, K. Shoens, Incremental updates of inverted lists for text document retrieval. SIGMOD Rec. 23(2), 289–300 (1994)
Article Google Scholar
Y. Xie, D. O’Hallaron, Locality in search engine queries and its implications for caching, in IEEE Infocom 2002 (2002), pp. 1238–1247
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milan, Italy
Stefano Ceri, Alessandro Bozzon, Marco Brambilla, Emanuele Della Valle, Piero Fraternali & Silvia Quarteroni

Authors

Stefano Ceri
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Bozzon
View author publications
You can also search for this author in PubMed Google Scholar
Marco Brambilla
View author publications
You can also search for this author in PubMed Google Scholar
Emanuele Della Valle
View author publications
You can also search for this author in PubMed Google Scholar
Piero Fraternali
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Quarteroni
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ceri, S., Bozzon, A., Brambilla, M., Della Valle, E., Fraternali, P., Quarteroni, S. (2013). Search Engines. In: Web Information Retrieval. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39314-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-39314-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39313-6
Online ISBN: 978-3-642-39314-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics