Improving Recall of Regular Expressions for Information Extraction

Murthy, Karin; P., Deepak; Deshpande, Prasad M.

doi:10.1007/978-3-642-35063-4_33

Karin Murthy²⁰,
Deepak P.²⁰ &
Prasad M. Deshpande²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7651))

Included in the following conference series:

International Conference on Web Information Systems Engineering

2599 Accesses
11 Citations

Abstract

Learning or writing regular expressions to identify instances of a specific concept within text documents with a high precision and recall is challenging. It is relatively easy to improve the precision of an initial regular expression by identifying false positives covered and tweaking the expression to avoid the false positives. However, modifying the expression to improve recall is difficult since false negatives can only be identified by manually analyzing all documents, in the absence of any tools to identify the missing instances. We focus on partially automating the discovery of missing instances by soliciting minimal user feedback. We present a technique to identify good generalizations of a regular expression that have improved recall while retaining high precision. We empirically demonstrate the effectiveness of the proposed technique as compared to existing methods and show results for a variety of tasks such as identification of dates, phone numbers, product names, and course numbers on real world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Appelt, D.E.: Introduction to information extraction. AI Commun. 12(3), 161–172 (1999)
Google Scholar
Babbar, R., Singh, N.: Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In: AND Workshop, pp. 43–50 (2010)
Google Scholar
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB, pp. 115–126 (2006)
Google Scholar
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S.: SystemT: An algebraic approach to declarative information extraction. In: ACL, pp. 128–137 (2010)
Google Scholar
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: IJCAI, pp. 1251–1256 (2001)
Google Scholar
Denis, F.: Learning regular languages from simple positive examples. Machine Learning 44(1/2), 37–66 (2001)
Article MathSciNet MATH Google Scholar
Fernau, H.: Algorithms for learning regular expressions from positive data. Inf. Comput. 207(4), 521–541 (2009)
Article MathSciNet MATH Google Scholar
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: SIGMOD, pp. 165–176 (2000)
Google Scholar
Klimt, B., Yang, Y.: Introducing the Enron corpus. In: CEAS (2004)
Google Scholar
Lang, K.: 20 Newsgroups (1997), http://people.csail.mit.edu/jrennie/20Newsgroups
Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.V.: Regular expression learning for information extraction. In: EMNLP, pp. 21–30 (2008)
Google Scholar
Mitchell, T.M.: Generalization as search. Artif. Intell. 18(2), 203–226 (1982)
Article Google Scholar
Nie, Z., Wen, J., Zhang, B.: 2D conditional random fields for web information extraction. In: ICML, pp. 1044–1051 (2005)
Google Scholar
Pearl, J.: Heuristics: intelligent search strategies for computer problem solving. Addison-Wesley Longman Publishing Co., Inc., Boston (1984)
Google Scholar
Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. In: NIPS, pp. 1185–1192 (2004)
Google Scholar
Wu, T., Pottenger, W.M.: A semi-supervised active learning algorithm for information extraction from textual data: Research articles. JASIST 56(3), 258–271 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research - India, Bangalore, India
Karin Murthy, Deepak P. & Prasad M. Deshpande

Authors

Karin Murthy
View author publications
You can also search for this author in PubMed Google Scholar
Deepak P.
View author publications
You can also search for this author in PubMed Google Scholar
Prasad M. Deshpande
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Fudan University, 825 Zhangheng Rd., Shanghai, 201203, China
X. Sean Wang
Department of Computer Science, College of Engineering, Science and Engineering Offices, The University of Illinois at Chicago, 851 South Morgan Street (M/C 152), 60607-7053, Chicago, Illinois, USA
Isabel Cruz
Department of Informatics and Telecommunications, University of Athens, GR15784, Ilisia, Athens, Greece
Alex Delis
Centre for Applied Informatics, Victoria University, PO Box 14428, 8001, Melbourne, VIC, Australia
Guangyan Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Murthy, K., P., D., Deshpande, P.M. (2012). Improving Recall of Regular Expressions for Information Extraction. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds) Web Information Systems Engineering - WISE 2012. WISE 2012. Lecture Notes in Computer Science, vol 7651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35063-4_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-35063-4_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35062-7
Online ISBN: 978-3-642-35063-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics