TEG—a hybrid approach to information extraction

Feldman, Ronen; Rosenfeld, Benjamin; Fresko, Moshe

doi:10.1007/s10115-005-0204-y

TEG—a hybrid approach to information extraction

Regular Paper
Published: 20 April 2005

Volume 9, pages 1–18, (2006)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ronen Feldman¹,
Benjamin Rosenfeld¹ &
Moshe Fresko¹

162 Accesses
24 Citations
Explore all metrics

Abstract

This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the robustness of our system under conditions of poor training-data quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Natural Language Processing

Information extraction from electronic medical documents: state of the art and future research directions

Article 08 November 2022

References

Aitken JS (2002) Learning information extraction rules: an inductive logic programming approach. In: Proceedings of 15th European Conference on Artificial Intelligence. IOS Press, Amsterdam
Bikel DM, Miller S, Schwartz R, Weischedel R (1997) Nymble: a high-performance learning name-finder. In: Proceedings of ANLP-97, pp 194–201. ACL, Washington DC
Bikel DM, Schwartz RL, Weischedel RM (1999) An algorithm that learns what's in a name. Mach Learn 34(1–3), 211–231
Google Scholar
Charniak E (2000) A maximum-entropy-inspired parser. In: Proceedings of the Meeting of the North American Association for Computational Linguistics. ACL, Hong Kong
Chinchor N, Hirschman L, Lewis D (1994) Evaluating message understanding systems: an analysis of the third message understanding conference (muc-3). Comput Linguistics 3(19), 409–449
Google Scholar
Collins M, Miller S (1998) Semantic tagging using a probabilistic context free grammar. In: Proceedings of the 6th Workshop on Very Large Corpora, pp 38–48. ACL, Montreal, Canada
Collins M (1997) Three generative, lexicalized models for statistical parsing. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pp 16–23. ACL, Madrid, Spain
De Sitter A, Daelemans W (2003) Information extraction via double classification. In: Proceedings of International Workshop on Adaptive Text Extraction and Mining, pp 66–73. Dubrovnik, Croatia
Feldman R (2002) Text mining. In: Kloesgen W, Zytkow J (eds) Handbook of Data Mining and Knowledge Discovery. MIT Press, Cambridge, MA
Google Scholar
Freitag D, McCallum AK (1999) Information extraction with hmms and shrinkage. In: Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI, Orlando, Florida
Freitag D, McCallum A (2000) Information extraction with hmm structures learned by stochastic optimization. In: Proceedings of AAAI/IAAI, pp 584–589. AAAI, Austin, Texas
Freitag D (1997) Using grammatical inference to improve precision in information extraction. In: Proceedings of Workshop on Grammatical Inference, Automata Induction, and Language Acquisition (ICML'97). Nashville, TN
Freitag D (1998) Information extraction from html: application of a general machine learning approach. In: Proceedings of AAAI/IAAI, pp 517–523. AAAI, Madison, WI
Grieser G, Jantke KP, Lange S, Thomas B (2000) A unifying approach to html wrapper representation and learning. In: Discovery Science, Third International Conference, DS 2000. Kyoto, Japan. Proceedings, vol 1967, pp 50–64. Springer, Berlin Heidelberg New York
Kushmerick N, Johnston E, McGuinness S (2001) Information extraction by text classification. In: Proceedings of IJCAI-01 Workshop on Adaptive text Extraction and Mining. Seattle, WA
Kushmerick N (2002) Finite-state approaches to web information extraction. In: Proceedings of 3rd Summer Convention on Information Extraction. Springer, Rome, Italy
Leek TR (1997) Information extraction using hidden Markov models. Master's thesis, UC San Diego
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, pp 282–289. Morgan Kaufmann, San Francisco, CA
Miller S, Crystal M, Fox H, Ramshaw L, Schwartz R, Stone R, Weischedel R (1998) The annotation group. Algorithms that learn to extract information–BBN: description of the SIFT system as used for MUC. In: Proceedings of the 7th Message Understanding Conference (MUC-7). Morgan Kaufman, Fairfax, VA
McCallum A, Freitag D, Pereira F (2000) Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of 17th International Conference on Machine Learning, pp 591–598. Morgan Kaufmann, San Francisco, CA
Roark B, Johnson M (1999) Efficient probabilistic top-down and left-corner parsing. In: Proceedings of the 37th Annual Meeting of the ACL. ACL, Colleage Park, MD
Sun A, Naing M, Lim E, Lam W (2003) Using support vector machine for terrorism information extraction. In: Proceedings of 1st NSF/NIJ Symposium on Intelligence and Security Informatics.
Yeh A, Hirschman L (2002) Background and overview for kdd cup 2002 task 1: information extraction from biomedical articles. KDD Explorations 4(2), 87–89
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Bar-Ilan University, Ramat Gan, 52900, Israel
Ronen Feldman, Benjamin Rosenfeld & Moshe Fresko

Authors

Ronen Feldman
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Rosenfeld
View author publications
You can also search for this author in PubMed Google Scholar
Moshe Fresko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ronen Feldman.

Additional information

Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc. in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He was an Adjunct Professor at NYU Stern Business School. He is the founder of ClearForest Corporation, a Boston based company specializing in development of text mining tools and applications. He has given more than 30 tutorials on next mining and information extraction and authored numerous papers on these topics. He is currently finishing his book “The Text Mining Handbook” to the published by Cambridge University Press.

Benjamin Rosenfeld is a research scientist at ClearForest Corporation. He received his B.Sc. in Mathematics and Computer Science from Bar-Ilan University. He is the co-inventor of the DIAL information extraction language.

Moshe Fresko is finalizing his Ph.D. in Computer Science Department at Bar-Ilan University in Israel. He received his B.Sc. in Computer Engineering from Bogazici University, Istanbul/Turkey on 1991, and M.Sc. on 1994. He is also an adjunct lecturer at the Computer Science Department of Bar-Ilan University and functions as the Information-Extraction Group Leader in the Data Mining Laboratory.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feldman, R., Rosenfeld, B. & Fresko, M. TEG—a hybrid approach to information extraction. Knowl Inf Syst 9, 1–18 (2006). https://doi.org/10.1007/s10115-005-0204-y

Download citation

Received: 08 November 2004
Revised: 13 January 2005
Accepted: 05 February 2005
Published: 20 April 2005
Issue Date: January 2006
DOI: https://doi.org/10.1007/s10115-005-0204-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TEG—a hybrid approach to information extraction

Abstract

Access this article

Similar content being viewed by others

A survey on large language model based autonomous agents

Natural Language Processing

Information extraction from electronic medical documents: state of the art and future research directions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

TEG—a hybrid approach to information extraction

Abstract

Access this article

Similar content being viewed by others

A survey on large language model based autonomous agents

Natural Language Processing

Information extraction from electronic medical documents: state of the art and future research directions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation