Abstract
Learning or writing regular expressions to identify instances of a specific concept within text documents with a high precision and recall is challenging. It is relatively easy to improve the precision of an initial regular expression by identifying false positives covered and tweaking the expression to avoid the false positives. However, modifying the expression to improve recall is difficult since false negatives can only be identified by manually analyzing all documents, in the absence of any tools to identify the missing instances. We focus on partially automating the discovery of missing instances by soliciting minimal user feedback. We present a technique to identify good generalizations of a regular expression that have improved recall while retaining high precision. We empirically demonstrate the effectiveness of the proposed technique as compared to existing methods and show results for a variety of tasks such as identification of dates, phone numbers, product names, and course numbers on real world datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Appelt, D.E.: Introduction to information extraction. AI Commun. 12(3), 161–172 (1999)
Babbar, R., Singh, N.: Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In: AND Workshop, pp. 43–50 (2010)
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB, pp. 115–126 (2006)
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S.: SystemT: An algebraic approach to declarative information extraction. In: ACL, pp. 128–137 (2010)
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: IJCAI, pp. 1251–1256 (2001)
Denis, F.: Learning regular languages from simple positive examples. Machine Learning 44(1/2), 37–66 (2001)
Fernau, H.: Algorithms for learning regular expressions from positive data. Inf. Comput. 207(4), 521–541 (2009)
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: SIGMOD, pp. 165–176 (2000)
Klimt, B., Yang, Y.: Introducing the Enron corpus. In: CEAS (2004)
Lang, K.: 20 Newsgroups (1997), http://people.csail.mit.edu/jrennie/20Newsgroups
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.V.: Regular expression learning for information extraction. In: EMNLP, pp. 21–30 (2008)
Mitchell, T.M.: Generalization as search. Artif. Intell. 18(2), 203–226 (1982)
Nie, Z., Wen, J., Zhang, B.: 2D conditional random fields for web information extraction. In: ICML, pp. 1044–1051 (2005)
Pearl, J.: Heuristics: intelligent search strategies for computer problem solving. Addison-Wesley Longman Publishing Co., Inc., Boston (1984)
Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS, pp. 1185–1192 (2004)
Wu, T., Pottenger, W.M.: A semi-supervised active learning algorithm for information extraction from textual data: Research articles. JASIST 56(3), 258–271 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Murthy, K., P., D., Deshpande, P.M. (2012). Improving Recall of Regular Expressions for Information Extraction. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds) Web Information Systems Engineering - WISE 2012. WISE 2012. Lecture Notes in Computer Science, vol 7651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35063-4_33
Download citation
DOI: https://doi.org/10.1007/978-3-642-35063-4_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35062-7
Online ISBN: 978-3-642-35063-4
eBook Packages: Computer ScienceComputer Science (R0)