Abstract
In this chapter we discuss the challenges in the design and deployment of search engines, systems that respond to keyword-based queries by extracting results which include pointers to Web pages. The main aspect of search engines is their ability to scale and manage billions of indexed pages dispersed on the Web. We provide an architectural view of the main elements of search engines, and we focus on their main components, namely the crawling and indexing subsystems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
http://www.ysearchblog.com/archives/000172.html, accessible through the Web archive at http://web.archive.org/.
- 4.
- 5.
The links information extracted from an HTML page can also be passed on to the indexer for use in ranking functions based on link analysis (see Chap. 7).
- 6.
The total runtime complexity of iterative duplicate detection is O(n 2) for a given candidate set C T that contains candidates of type T [265].
- 7.
PageRank is a page relevance measure that will be described in Chap. 7.
- 8.
Large websites might be penalized because of the politeness constraint, which might force their resources to be fetched last.
- 9.
- 10.
- 11.
Dynamic indexing is also referred to as online indexing, to emphasize the availability of the system upon document arrival and indexing.
- 12.
References
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, S. Raghavan, Searching the web. ACM Trans. Internet Technol. 1(1), 2–43 (2001)
C. Badue, B. Ribeiro-Neto, R. Baeza-Yates, N. Ziviani, Distributed query processing using partitioned inverted files, in Proceedings of the Eighth International Symposium on String Processing and Information Retrieval, SPIRE 2001, Nov (2001), pp. 10–20
R.A. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval (Addison-Wesley, Boston, 1999)
R. Baeza-Yates, F. Saint-Jean, A three level search engine index based in query log distribution, in String Processing and Information Retrieval, ed. by M. Nascimento, E. Moura, A. Oliveira. Lecture Notes in Computer Science, vol. 2857 (Springer, Berlin, 2003), pp. 56–65
R. Baeza-Yates, C. Castillo, M. Marin, A. Rodriguez, Crawling a country: better strategies than breadth-first for web page ordering, in Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. WWW’05 (ACM, New York, 2005), pp. 864–872
R.A. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, F. Silvestri, Challenges on distributed web retrieval, in Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, ed. by R. Chirkova, A. Dogac, M.T. Özsu, T.K. Sellis (IEEE Press, New York, 2007), pp. 6–20
R. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, F. Silvestri, The impact of caching on search engines, in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’07 (ACM, New York, 2007), pp. 183–190
R. Baeza-Yates, A. Gionis, F.P. Junqueira, V. Murdock, V. Plachouras, F. Silvestri, Design trade-offs for search engine caching. ACM Trans. Web 2(4), 20 (2008)
R. Baeza-Yates, A. Gionis, F. Junqueira, V. Plachouras, L. Telloli, On the feasibility of multi-site web search engines, in Proceedings of the 18th ACM Conference on Information and Knowledge Management. CIKM’09 (ACM, New York, 2009), pp. 425–434
L.A. Barroso, J. Dean, U. Hölzle, Web search for a planet: the Google cluster architecture. IEEE MICRO 23(2), 22–28 (2003)
R. Blanco, E. Bortnikov, F. Junqueira, R. Lempel, L. Telloli, H. Zaragoza, Caching search engine results over incremental indices, in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’10 (ACM, New York, 2010), pp. 82–89
S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)
A.Z. Broder, S.C. Glassman, M.S. Manasse, G. Zweig, Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)
B.B. Cambazoglu, F.P. Junqueira, V. Plachouras, S. Banachowski, B. Cui, S. Lim, B. Bridge, A refreshing perspective of search engine caching, in Proceedings of the 19th International Conference on World Wide Web. WWW’10 (ACM, New York, 2010), pp. 181–190
C. Castillo, M. Marin, A. Rodriguez, R. Baeza-Yates, Scheduling algorithms for web crawling, in Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress. LA-WEBMEDIA’04 (IEEE Comput. Soc., Washington, 2004), pp. 10–17
J. Cho, H. Garcia-Molina, L. Page, Efficient crawling through URL ordering, in Proceedings of the Seventh International Conference on World Wide Web 7, WWW7 (Elsevier, Amsterdam, 1998), pp. 161–172
D. Cutting, J. Pedersen, Optimization for dynamic inverted index maintenance, in Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’90 (ACM, New York, 1990), pp. 405–411
T. Fagni, R. Perego, F. Silvestri, S. Orlando, Boosting the performance of web search engines: caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst. 24(1), 51–78 (2006)
A. Heydon, M. Najork, Mercator: a scalable, extensible web crawler. World Wide Web J. 2(4), 219–229 (1999)
J.C. Klensin, Role of the domain name system (DNS), Internet RFC 3467, Feb 2003
M. Koster, A method for web robots control, Internet Draft draft-koster-robots-00, Dec (1996)
R. Lempel, S. Moran, Predictive caching and prefetching of query results in search engines, in Proceedings of the 12th International Conference on World Wide Web. WWW’03 (ACM, New York, 2003), pp. 19–28
N. Lester, A. Moffat, J. Zobel, Fast on-line index construction by geometric partitioning, in Proceedings of the 14th ACM International Conference on Information and Knowledge Management. CIKM’05 (ACM, New York, 2005), pp. 776–783
C.D. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval. 2008. Online edition (2007)
E.P. Markatos, On caching search engine query results, in Computer Communications (2000)
S. Melink, S. Raghavan, B. Yang, H. Garcia-Molina, Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19(3), 217–241 (2001)
A. Moffat, W. Webber, J. Zobel, R. Baeza-Yates, A pipelined architecture for distributed text query evaluation. Inf. Retr. 10(3), 205–231 (2007)
M. Najork, J.L. Wiener, Breadth-first crawling yields high-quality pages, in Proceedings of the 10th International Conference on World Wide Web. WWW’01 (ACM, New York, 2001), pp. 114–118
F. Naumann, M. Herschel, An Introduction to Duplicate Detection. Synthesis Lectures on Data Management (Morgan & Claypool, San Rafael, 2010)
B. Ribeiro-Neto, E.S. Moura, M.S. Neubert, N. Ziviani, Efficient distributed algorithms to build inverted files, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’99 (ACM, New York, 1999), pp. 105–112
P.C. Saraiva, E. Silva de Moura, N. Ziviani, W. Meira, R. Fonseca, B. Riberio-Neto, Rank-preserving two-level caching for scalable search engines, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’01 (ACM, New York, 2001), pp. 51–58
Y. Sun, Z. Zhuang, C.L. Giles, A large-scale study of robots.txt, in WWW, ed. by C.L. Williamson, M.E. Zurko, P.F. Patel-Schneider, P.J. Shenoy (ACM, New York, 2007), pp. 1123–1124
A. Tomasic, H. García-Molina, K. Shoens, Incremental updates of inverted lists for text document retrieval. SIGMOD Rec. 23(2), 289–300 (1994)
Y. Xie, D. O’Hallaron, Locality in search engine queries and its implications for caching, in IEEE Infocom 2002 (2002), pp. 1238–1247
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Ceri, S., Bozzon, A., Brambilla, M., Della Valle, E., Fraternali, P., Quarteroni, S. (2013). Search Engines. In: Web Information Retrieval. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39314-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-39314-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39313-6
Online ISBN: 978-3-642-39314-3
eBook Packages: Computer ScienceComputer Science (R0)