Skip to main content

A Hybrid Machine Learning Approach for Information Extraction from Free Text

  • Conference paper
From Data and Information Analysis to Knowledge Engineering

Abstract

We present a hybrid machine learning approach for information extraction from unstructured documents by integrating a learned classifier based on the Maximum Entropy Modeling (MEM), and a classifier based on our work on Data-Oriented Parsing (DOP). The hybrid behavior is achieved through a voting mechanism applied by an iterative tag-insertion algorithm. We have tested the method on a corpus of German newspaper articles about company turnover, and achieved 85.2% F-measure using the hybrid approach, compared to 79.3% for MEM and 51.9% for DOP when running them in isolation.

Thanks to Volker Morbach for his great help during the implementation and evaluation phase of the project. This work was supported by a research grant from BMBF to the DFKI project Quetal (FKZ: 01 IW C02).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 159.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • BENDER, O., OCH, F., and NEY, H. (2003): Maximum Entropy Models for Named Entity Recognition In: Proceedings of CoNLL-2003, pp. 148–151.

    Google Scholar 

  • BOD, R., SCHA, R. and SIMA’AN, K. (2003): Data-Oriented Parsing. CSLI Publications, University of Chicago Press.

    Google Scholar 

  • CHIEU, H. L. and NG, H. T. (2002): A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text. In Proceedings of AAAI 2002.

    Google Scholar 

  • DARROCH, J. N. and RATCLIFF, D. (1972). Generalized Iterative Scaling for Log-Linear Models. Annals of Mathematical Statistics, 43, pages 1470–1480.

    MathSciNet  Google Scholar 

  • FLORIAN, R., ITTYCHERIAH, A., JING, H., and ZHANG, T. (2003): Named Entity Recognition through Classifier Combination. In: Proceedings of CoNLL-2003, pp. 168–171.

    Google Scholar 

  • FREITAG, D. (1998): Multistrategy Learning for Information Extraction. In Proceedings of the 15th ICML, pages 161–169.

    Google Scholar 

  • NEUMANN, G. (2003): A Data-Driven Approach to Head-Driven Phrase Structure Grammar. In: R. Scha, R. Bod, and K. Sima’an (eds.) Data-Oriented Parsing, pages 233–251.

    Google Scholar 

  • NEUMANN, G. and PISKORSKI, J. (2002): A Shallow Text Processing Core Engine. Journal of Computational Intelligence, 18, 451–476.

    Google Scholar 

  • PIETRA, S. D., PIETRA, V. J. and LAFFERTY, J. D. (1997): Inducing Features of Random Fields. Journal of IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 380–393.

    Google Scholar 

  • RATNAPARKHI, A. (1998): Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. Thesis, University of Pennsylvania, Philadelphia, PA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer Berlin · Heidelberg

About this paper

Cite this paper

Neumann, G. (2006). A Hybrid Machine Learning Approach for Information Extraction from Free Text. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_47

Download citation

Publish with us

Policies and ethics