Advertisement

Information Systems Frontiers

, Volume 15, Issue 3, pp 311–329 | Cite as

Beyond search: Retrieving complete tuples from a text-database

  • Alexander LöserEmail author
  • Christoph Nagel
  • Stephan Pieper
  • Christoph Boden
Article

Abstract

A common task of Web users is querying structured information from Web pages. For realizing this interesting scenario we propose a novel query processor for systematically discovering instances of semantic relations in Web search results and joining these relation instances into complex result tuples with conjunctive queries. Our query processor transforms a structured user query into keyword queries that are submitted to a search engine, forwards search results to a relation extractor, and then combines relations into complex result tuples. The processor automatically learns discriminative and effective keywords for different types of semantic relations. Thereby, our query processor leverages the index of a search engine to query potentially billions of pages. Unfortunately, relation extractors may fail to return a relation for a result tuple. Moreover, user defined data sources may not return at least k complete result tuples. Therefore we propose an adaptive routing model based on information theory for retrieving missing attributes of incomplete result tuples. The model determines the most promising next incomplete tuple and attribute type for returning any-k complete result tuples at any point during the query execution process. We report a thorough experimental evaluation over multiple relation extractors. Our query processor returns complete result tuples while processing only very few Web pages.

Keywords

Structured query execution Text data Keyword query generation 

Notes

Acknowledgments

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement nr. FP7-ICT-2009-5-257859, ‘Risk and Opportunity management of huge-scale BUSiness community cooperation’ (ROBUST). Alexander Löser also received funding from the Federal Ministry of Economics and Technology (BMWi) under grant agreement “01MD11014A, ‘MIA-Marktplatz für Informationen und Analysen’ (MIA)”.

References

  1. Agichtein, E., & Gravano, L. (2003). Qxtract: a building block for efficient information extraction from plain-text databases. In SIGMOD conference (p. 663).Google Scholar
  2. Avnur, R., & Hellerstein, J.M. (2000). Eddies: continuously adaptive query processing. In SIGMOD conference (pp. 261–272).Google Scholar
  3. Banko, M., & Etzioni, O. (2008). The tradeoffs between open and traditional relation extraction. In ACL (pp. 28–36).Google Scholar
  4. Boden, C., Hafele, T., Löser A. (2011). Classification algorithms for relation prediction. In ICDE workshops (pp. 46–52).Google Scholar
  5. Boden, C., Löser, A., Nagel, C., Pieper, S. (2011). Factcrawl: a fact retrieval framework for full-text indices. In 14th WebDB workshop with ACM SIGMOD Google Scholar
  6. Boden, C., Löser, A., Nagel, C., Pieper, S. (2012). Fact-aware document retrieval for information extraction. Datenbank-Spektrum, 12, 89–100.CrossRefGoogle Scholar
  7. Castellanos, M., Wang, S., Dayal, U., Gupta, C. (2010). Sie-obi: a streaming information extraction platform for operational business intelligence. In SIGMOD conference (pp. 1105–1110).Google Scholar
  8. Chakrabarti, S., Sarawagi, S., Sudarshan, S. (2010). Enhancing search with structure. IEEE Data Engineering Bulletin, 33(1), 3–24.Google Scholar
  9. Clarke, C.L.A., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In SIGIR (pp. 659–666).Google Scholar
  10. Croft, B., Metzler, D., Strohman, T. (2009). Search engines: Information retrieval in practice (1st ed.) USA: Addison-Wesley Publishing Company.Google Scholar
  11. Crow, D. (2010). Google Squared: Web scale, open domain information extraction and presentation. In ECIR, industrial track.Google Scholar
  12. DeRose, P., Shen, W., 0002, F.C., Doan, A., Ramakrishnan, R. (2007a). Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB (pp. 399–410).Google Scholar
  13. DeRose, P., Shen, W., 0002, F.C., Lee, Y., Burdick, D., Doan, A., Ramakrishnan, R. (2007b). Dblife: A community information management platform for the database research community (demo). In CIDR (pp. 169–172).Google Scholar
  14. Dong, X., Halevy, A., Madhavan, J. (2005). Reference reconciliation in complex information spaces. In ACM SIGMOD (pp. 85–96).Google Scholar
  15. Etzioni, O., Banko, M., Soderland, S., Weld, D.S. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.CrossRefGoogle Scholar
  16. Feldman, R., Regev, Y., Gorodetsky, M. (2008). A modular information extraction system. Intelligent Data Analysis, 12(1), 51–71.Google Scholar
  17. Fortune 500 companies (2010). http://money.cnn.com/magazines/fortune (Last visited 01/06/10).
  18. Fung, G.P.C., Yu, J.X., Lu, H. (2002). Discriminative category matching: Efficient text classification for huge document collections. In ICDM (pp. 187–194).Google Scholar
  19. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A. (2001). Declarative data cleaning: language, model, and algorithms. In VLDB (pp. 371–380).Google Scholar
  20. Grishman, R., Huttunen, S., Yangarber, R. (2002). Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4), 236–246.CrossRefGoogle Scholar
  21. Halevy, A.Y. (2001). Answering queries using views: a survey. The VLDB Journal, 10, 270–294.CrossRefGoogle Scholar
  22. HSQLDB (2011). http://hsqldb.org/ (Last visited 06/14/11).
  23. Ilyas, I.F., Beskales, G., Soliman, M.A. (2008). A survey of top-query processing techniques in relational database systems. ACM Computing Surveys, 40(4).Google Scholar
  24. Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L. (2006). To search or to crawl?: towards a query optimizer for text-centric tasks. In SIGMOD conference (pp. 265–276).Google Scholar
  25. Jain, A., Doan, A., Gravano, L. (2008). Optimizing sql queries over text databases. In ICDE (pp. 636–645).Google Scholar
  26. Jain, A., Ipeirotis, P.G., Doan, A., Gravano, L. (2009). Join optimization of information extraction output: quality matters! In ICDE (pp. 186–197).Google Scholar
  27. Jain, A., & Pantel, P. (2010). Factrank: random walks on a web of facts. In COLING (pp. 501–509).Google Scholar
  28. Jain, A., & Srivastava, D. (2009). Exploring a few good tuples from text databases. In ICDE (pp. 616–627).Google Scholar
  29. Kasneci, G., Suchanek, F.M., Ramanath, M., Weikum, G. (2008). The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37, 4.CrossRefGoogle Scholar
  30. Liu, J., Dong, X., Halevy, A.Y. (2006). Answering structured queries on unstructured data. In WebDB.Google Scholar
  31. Löser, A., Hüske, F., Markl, V. (2008). Situational business intelligence. In BIRTE.Google Scholar
  32. Löser, A., Lutter, S., Düssel, P., Markl, V. (2009). Ad-hoc queries over document collections—a case study. In BIRTE (pp. 50–65).Google Scholar
  33. Löser A., Nagel, C., Pieper, S. (2010). Augmenting tables by self-supervised web search. In BIRTE Google Scholar
  34. Markl, V., Raman, V., Simmen, D.E., Lohman, G.M., Pirahesh, H. (2004). Robust query processing through progressive optimization. In SIGMOD conference (pp. 659–670).Google Scholar
  35. Naumann, F. (2002). Quality-driven query answering for integrated information systems. Lecture notes in computer science Vol. 2261: Springer.Google Scholar
  36. OpenCalais (2011). www.opencalais.com (Last visited 06/14/11).
  37. Pérez-Martínez, J.M., Llavori, R.B., Cabo, M.J.A., Pedersen, T.B. (2008). Contextualizing data warehouses with documents. Decision Support Systems, 45(1), 77–94.CrossRefGoogle Scholar
  38. Riloff, E. (1996). Automatically generating extraction patterns from untagged text. AAAI/IAAI, 2, 1044–1049.Google Scholar
  39. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G. (1988). Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on management of data, 30 May–1 June 1979 (pp. 23–34). Boston, Massachusetts.Google Scholar
  40. Wu, F., & Weld, D.S. (2010). Open information extraction using wikipedia. In ACL (pp. 118–127).Google Scholar
  41. Yu, C., Lakshmanan, L.V.S., Amer-Yahia, S. (2009). It takes variety to make a world: diversification in recommender systems. In EDBT (pp. 368–378).Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Alexander Löser
    • 1
    Email author
  • Christoph Nagel
    • 1
  • Stephan Pieper
    • 1
  • Christoph Boden
    • 1
  1. 1.Database Systems and Information Management Group (DIMA)Technische Universität Berlin (TUB)BerlinGermany

Personalised recommendations