Abstract
Though search on the World-Wide Web has focused mostly on unstructured text, there is an increasing amount of structured data on the Web and growing interest in harnessing such data. I will describe several current projects at Google whose overall goal is to leverage structured data and better expose it to our users.
The first project is on crawling the deep web. The deep web refers to content that resides in databases behind forms, but is unreachable by search engines because there are no links to these pages. I will describe a system that surfaces pages from the deep web by guessing queries to submit to these forms, and entering the results into the Google index [1]. The pages that we generated using this system come from millions of forms, hundreds of domains and over 40 languages. Pages from the deep web are served in the top-10 results on google.com for over 1000 queries per second.
The second project considers the collection of HTML tables on the web. The WebTables Project [2] built a corpus of over 150 million tables from HTML tables on the Web. The WebTables System addresses the challenges of extracting these tables from the Web, and offers search over this collection of tables. The project also illustrates the potential of leveraging the collection of schemas of these tables.
Finally, I’ll discuss current work on computing aspects of queries in order to better organize search results for exploratory queries.
Chapter PDF
Similar content being viewed by others
References
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep-web crawl. In: Proc. of VLDB, pp. 1241–1252 (2008)
Cafarella, M.J., Halevy, A., Zhang, Y., Wang, D.Z., Wu, E.: WebTables: Exploring the Power of Tables on the Web. In: VLDB (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Halevy, A.Y. (2009). Structured Data on the Web. In: Feldman, Y.A., Kraft, D., Kuflik, T. (eds) Next Generation Information Technologies and Systems. NGITS 2009. Lecture Notes in Computer Science, vol 5831. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04941-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-04941-5_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04940-8
Online ISBN: 978-3-642-04941-5
eBook Packages: Computer ScienceComputer Science (R0)