Abstract
Exploiting textual information from large document collections such as the Web with structured queries is an often requested, but still unsolved requirement of many users. We present BlueFact, a framework for efficiently retrieving documents containing structured, factual information from a full-text index. This is an essential building block for information extraction systems that enable ad-hoc analytical queries on unstructured text data as well as knowledge harvesting in a digital archive scenario.
Our approach is based on the observation that documents share a set of common grammatical structures and words for expressing facts. Our system observes these keyword phrases using structural, syntactic, lexical and semantic features in an iterative, cost effective training process and systematically queries the search engine index with these automatically generated phrases. Next, BlueFact retrieves a list of document identifiers, combines observed keywords as evidence for a factual information and infers the relevance for each document identifier. Finally, we forward the documents in the order of their estimated relevance to an information extraction service. That way BlueFact can efficiently retrieve all the structured, factual information contained in an indexed collection of text documents.
We report results of a comprehensive experimental evaluation over 20 different fact types on the Reuters News Corpus Volume I (RCV1). BlueFact’s scoring model and feature generation methods significantly outperform existing approaches in terms of fact retrieval performance. BlueFact fires significantly fewer queries against the index, requires significantly less execution time and achieves very high fact recall across different domains.
Similar content being viewed by others
Notes
The average execution time for an average sized document in the Reuters news corpus with a size of 1.4 Kbyte was 1.3 seconds. The average extraction time for the larges document in the corpus (52.45 KB) was 28 seconds with the OpenCalais.com information extraction service we used.
Reuters Corpus Volume I (RCV1) available at http://trec.nist.gov/data/reuters/reuters.html.
Noun phrase.
References
Agichtein E, Gravano L (2003) Querying text databases for efficient information extraction. In: Proceedings of the 19th IEEE international conference on data engineering (ICDE), pp 113–124
Alias-i.: Lingpipe 4.0.1. http://alias-i.com/lingpipe. Last visited 01/10/10
Boden C, Häfele T, Löser A (2011) Classification algorithms for relation prediction. In: DaLi workshop at ICDE 2011
Bohannon P, Merugu S, Yu C, Agarwal V, DeRose P, Iyer A, Jain A, Kakade V, Muralidharan M, Ramakrishnan R, Shen W (2009) Purple sox extraction management system. SIGMOD Rec 37:21–27. doi:10.1145/1519103.1519107. http://doi.acm.org/10.1145/1519103.1519107
Chiticariu L, Krishnamurthy R, Li Y, Raghavan S, Reiss FR, Vaithyanathan S (2010) Systemt: an algebraic approach to declarative information extraction. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL ’10), pp 128–137. Association for Computational Linguistics, Stroudsburg. http://portal.acm.org/citation.cfm?id=1858681.1858695
Cohen WW (1995) Fast effective rule induction. In: ICML. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.8204
Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19:61–74. http://portal.acm.org/citation.cfm?id=972450.972454
Etzioni O, Banko M, Soderland S, Weld DS (2008) Open information extraction from the web. Commun ACM 51:68–74. doi:10.1145/1409360.1409378. http://doi.acm.org/10.1145/1409360.1409378
Fang Y, Chang KCC (2011) Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality. In: Proceedings of the fourth ACM international conference on Web search and data mining. doi:10.1145/1935826.1935933. http://doi.acm.org/10.1145/1935826.1935933
Feldman R, Regev Y, Gorodetsky M (2008) A modular information extraction system. Intell Data Anal 12:51–71. http://portal.acm.org/citation.cfm?id=1368027.1368031
Fung GPC, Yu JX, Lu H (2002) Discriminative category matching: efficient text classification for huge document collections. In: Proceedings of the 19th IEEE international conference on data mining, pp 187–194
Ipeirotis PG, Agichtein E, Jain P, Gravano L (2006) To search or to crawl? Towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276
Jain A, Doan A, Gravano L (2008) Optimizing SQL queries over text databases. In: IEEE 24th international conference on data engineering. IEEE Press, New York, pp 636–645
Kasneci G, Ramanath M, Suchanek F, Weikum G (2009) The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec 37:41–47. doi:10.1145/1519103.1519110. http://doi.acm.org/10.1145/1519103.1519110
Liu J (2006) Answering structured queries on unstructured data. In: WebDB, pp 25–30
Löser A, Hueske F, Markl V (2009) Situational business intelligence. In: Aalst W, Mylopoulos J, Sadeh NM, Shaw MJ, Szyperski C, Castellanos M, Dayal U, Sellis T (Eds) Business intelligence for the real-time enterprise. Lecture notes in business information processing, vol 27. Springer, Berlin, pp 1–11. http://dx.doi.org/10.1007/978-3-642-03422-0_1
Löser A, Nagel C, Pieper S (2011) Augmenting tables by self-supervised web search. In: Enabling real-time business intelligence, pp 84–99
Löser A, Nagel C, Pieper S, Boden C (2011) Factcrawl: a fact retrieval framework for full-text indices. In: WebDB workshop with SIGMOD 2011
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
OpenCalais: Open calais. http://www.opencalais.com/. Last visited 02/25/11
Robertson SE (1991) On term selection for query expansion. J Doc 46:359–364. doi:10.1108/eb026866. http://portal.acm.org/citation.cfm?id=104889.104901
Shen W, DeRose P, McCann R, Doan A, Ramakrishnan R (2008) Toward best-effort information extraction. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, New York, pp 1031–1042
Zhou M, Cheng T, Chang KCC (2010) Docqs: a prototype system for supporting data-oriented content query. In: Proceedings of the 2010 international conference on management of data. ACM, New York
Acknowledgements
The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. FP7-ICT-2009-5-257859, ‘Risk and Opportunity management of huge-scale BUSiness community cooperation’ (ROBUST). Alexander Löser also receives funding from the Federal Ministry of Economics and Technology (BMWi) under grant agreement “01MD11014A, ‘MIA-Marktplatz für Informationen und Analysen’ (MIA)”.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Boden, C., Löser, A., Nagel, C. et al. Fact-Aware Document Retrieval for Information Extraction. Datenbank Spektrum 12, 89–100 (2012). https://doi.org/10.1007/s13222-012-0088-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-012-0088-4