Abstract
In this work, we propose an approach that allows to query heterogeneous data sources on the Web in a declarative fashion. Such an approach gives means for a generic way to formulate various information needs, much more powerful than simple keyword queries. Particularly appealing is the ability to combine (join) information from different sources and the ability to compute simple statistics that can be used to select promising information pieces. What might sound like a hopeless effort due to the inherent complexity expressible by SQL-style queries is at second glance not complicated to understand and to use. Already very simple combinations (i.e., joins) of different data sources (i.e., tables) offer a surprisingly large set of interesting use cases. In particular, using sliding window joins that limit the scope of interest to recent information, obtained, for instance, from the live stream of Twitter Tweets. This goes far beyond keyword queries enriched with operators like allintext: or allintitle: or site:, as can be used, for instance, in the Google search engine.
Similar content being viewed by others
Notes
While news sites and blogs where crawled continuously over this period of time, the time window in which Tweets where collected comprises only a couple of days in late July 2011.
References
Ankolekar A, Krötzsch M, Tran T, Vrandecic D (2007) The two cultures: mashing up web 2.0 and the semantic web. In: Proceedings of the 16th international conference on world wide web (WWW’07). ACM, New York
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval, 1st edn. Addison Wesley, Reading
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw 30(1–7):107–117
Cafarella MJ, Halevy AY, Madhavan J (2011) Structured data on the web. Commun ACM 54:2
Chidlovskii B, Borghoff UM (2000) Semantic caching of web queries. VLDB J 9:1
Chiticariu L, Krishnamurthy R, Li Y, Raghavan S, Reiss F, Vaithyanathan S (2010) SystemT: an algebraic approach to declarative information extraction. In: ACL
de Virgilio R, Giunchiglia F, Tanca L (eds) (2010) Semantic web information management—a model driven perspective. Springer, Berlin
Garcia-Molina H, Ullman J, Widom J (2008) Database systems: the complete book, 2nd edn. Prentice Hall, New York
Geerts F, Kementsietsidis A, Milano D (2006) MONDRIAN: annotating and querying databases through colors and blocks. In: Proceedings of the 22nd international conference on data engineering, 2006 (ICDE’06). IEEE Press, Los Alamitos
Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an RDBMS for web data integration. In: Proceedings of the 12th international conference on World Wide Web (WWW ’03). ACM, New York
He B, Chang KC-C (2003) Statistical schema matching across web query interfaces. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data (SIGMOD’03). ACM, New York
Lee MD, Welsh M (2005) An empirical evaluation of models of text document similarity. In: CogSci. Erlbaum, Hillsdale
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
Marcus A, Bernstein M, Miller R, Madden S, Karger D, Bader O (2011) Tweets as data: demonstration of TweeQL and TwitInfo. In: SIGMOD. ACM, New York
Nehme RV, Rundensteiner EA, Bertino E (2009) Tagging stream data for rich real-time services. In: VLDB. VLDB Endowment
Peng F, Chawathe SS (2003) XPath queries on streaming data. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data (SIGMOD’03). ACM, New York
Yahoo! pipes, http://pipes.yahoo.com/pipes/
Shahaf D, Guestrin C (2010) Connecting the dots between news articles. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10). ACM, New York
Shanmugasundaram J, Tufte K, Zhang C, He G, DeWitt DJ, Naughton JF (1999) Relational databases for querying XML documents: limitations and opportunities. In: Proceedings of the 25th international conference on very large data bases (VLDB’99). San Mateo, Morgan Kaufmann
Stonebraker M, Rowe LA (1986) The design of POSTGRES. In: Proceedings of the 1986 ACM SIGMOD international conference on management of data (SIGMOD’86). ACM, New York
Traina C, Traina AJM, Vieira MR, Arantes AS, Faloutsos C (2006) Efficient processing of complex similarity queries in RDBMS through query rewriting. In: Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM’06). ACM, New York
Wang DZ, Michelakis E, Franklin MJ, Garofalakis M, Hellerstein JM (2010) Probabilistic declarative information extraction. In: 2010 IEEE 26th international conference on data engineering (ICDE)
Wiesener S, Kowarschick W, Vogel P, Bayer R (1996) Semantic hypermedia retrieval in digital libraries. In: Digital libraries research and technology advances. Lecture notes in computer science, vol 1082. Springer, Berlin
XML path language (W3C recommendation), http://www.w3.org/TR/xpath
XQuery 1.0: An XML query language (W3C recommendation), http://www.w3.org/TR/xquery/
Yahoo! query language (YQL), http://developer.yahoo.com/yql/
Yu J, Benatallah B, Casati F, Daniel F (2008) Understanding mashup development. IEEE Internet Comput 12:44–52
Acknowledgements
This work has been supported by the Excellence Cluster on Multimodal Computing and Interaction (MMCI).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pinkel, C., Alvanaki, F. & Michel, S. Sequoia—An Approach to Declarative Information Retrieval. Datenbank Spektrum 12, 101–108 (2012). https://doi.org/10.1007/s13222-012-0087-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-012-0087-5