Skip to main content
Log in

TEG—a hybrid approach to information extraction

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the robustness of our system under conditions of poor training-data quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aitken JS (2002) Learning information extraction rules: an inductive logic programming approach. In: Proceedings of 15th European Conference on Artificial Intelligence. IOS Press, Amsterdam

  2. Bikel DM, Miller S, Schwartz R, Weischedel R (1997) Nymble: a high-performance learning name-finder. In: Proceedings of ANLP-97, pp 194–201. ACL, Washington DC

  3. Bikel DM, Schwartz RL, Weischedel RM (1999) An algorithm that learns what's in a name. Mach Learn 34(1–3), 211–231

    Google Scholar 

  4. Charniak E (2000) A maximum-entropy-inspired parser. In: Proceedings of the Meeting of the North American Association for Computational Linguistics. ACL, Hong Kong

  5. Chinchor N, Hirschman L, Lewis D (1994) Evaluating message understanding systems: an analysis of the third message understanding conference (muc-3). Comput Linguistics 3(19), 409–449

    Google Scholar 

  6. Collins M, Miller S (1998) Semantic tagging using a probabilistic context free grammar. In: Proceedings of the 6th Workshop on Very Large Corpora, pp 38–48. ACL, Montreal, Canada

  7. Collins M (1997) Three generative, lexicalized models for statistical parsing. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pp 16–23. ACL, Madrid, Spain

  8. De Sitter A, Daelemans W (2003) Information extraction via double classification. In: Proceedings of International Workshop on Adaptive Text Extraction and Mining, pp 66–73. Dubrovnik, Croatia

  9. Feldman R (2002) Text mining. In: Kloesgen W, Zytkow J (eds) Handbook of Data Mining and Knowledge Discovery. MIT Press, Cambridge, MA

    Google Scholar 

  10. Freitag D, McCallum AK (1999) Information extraction with hmms and shrinkage. In: Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI, Orlando, Florida

  11. Freitag D, McCallum A (2000) Information extraction with hmm structures learned by stochastic optimization. In: Proceedings of AAAI/IAAI, pp 584–589. AAAI, Austin, Texas

  12. Freitag D (1997) Using grammatical inference to improve precision in information extraction. In: Proceedings of Workshop on Grammatical Inference, Automata Induction, and Language Acquisition (ICML'97). Nashville, TN

  13. Freitag D (1998) Information extraction from html: application of a general machine learning approach. In: Proceedings of AAAI/IAAI, pp 517–523. AAAI, Madison, WI

  14. Grieser G, Jantke KP, Lange S, Thomas B (2000) A unifying approach to html wrapper representation and learning. In: Discovery Science, Third International Conference, DS 2000. Kyoto, Japan. Proceedings, vol 1967, pp 50–64. Springer, Berlin Heidelberg New York

  15. Kushmerick N, Johnston E, McGuinness S (2001) Information extraction by text classification. In: Proceedings of IJCAI-01 Workshop on Adaptive text Extraction and Mining. Seattle, WA

  16. Kushmerick N (2002) Finite-state approaches to web information extraction. In: Proceedings of 3rd Summer Convention on Information Extraction. Springer, Rome, Italy

  17. Leek TR (1997) Information extraction using hidden Markov models. Master's thesis, UC San Diego

  18. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, pp 282–289. Morgan Kaufmann, San Francisco, CA

  19. Miller S, Crystal M, Fox H, Ramshaw L, Schwartz R, Stone R, Weischedel R (1998) The annotation group. Algorithms that learn to extract information–BBN: description of the SIFT system as used for MUC. In: Proceedings of the 7th Message Understanding Conference (MUC-7). Morgan Kaufman, Fairfax, VA

  20. McCallum A, Freitag D, Pereira F (2000) Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of 17th International Conference on Machine Learning, pp 591–598. Morgan Kaufmann, San Francisco, CA

  21. Roark B, Johnson M (1999) Efficient probabilistic top-down and left-corner parsing. In: Proceedings of the 37th Annual Meeting of the ACL. ACL, Colleage Park, MD

  22. Sun A, Naing M, Lim E, Lam W (2003) Using support vector machine for terrorism information extraction. In: Proceedings of 1st NSF/NIJ Symposium on Intelligence and Security Informatics.

  23. Yeh A, Hirschman L (2002) Background and overview for kdd cup 2002 task 1: information extraction from biomedical articles. KDD Explorations 4(2), 87–89

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ronen Feldman.

Additional information

Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc. in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He was an Adjunct Professor at NYU Stern Business School. He is the founder of ClearForest Corporation, a Boston based company specializing in development of text mining tools and applications. He has given more than 30 tutorials on next mining and information extraction and authored numerous papers on these topics. He is currently finishing his book “The Text Mining Handbook” to the published by Cambridge University Press.

Benjamin Rosenfeld is a research scientist at ClearForest Corporation. He received his B.Sc. in Mathematics and Computer Science from Bar-Ilan University. He is the co-inventor of the DIAL information extraction language.

Moshe Fresko is finalizing his Ph.D. in Computer Science Department at Bar-Ilan University in Israel. He received his B.Sc. in Computer Engineering from Bogazici University, Istanbul/Turkey on 1991, and M.Sc. on 1994. He is also an adjunct lecturer at the Computer Science Department of Bar-Ilan University and functions as the Information-Extraction Group Leader in the Data Mining Laboratory.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feldman, R., Rosenfeld, B. & Fresko, M. TEG—a hybrid approach to information extraction. Knowl Inf Syst 9, 1–18 (2006). https://doi.org/10.1007/s10115-005-0204-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-005-0204-y

Keywords

Navigation