Machine Learning for Information Extraction in Informal Domains

Freitag, Dayne

doi:10.1023/A:1007601113994

Machine Learning for Information Extraction in Informal Domains

Published: May 2000

Volume 39, pages 169–202, (2000)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Machine Learning for Information Extraction in Informal Domains

Download PDF

Dayne Freitag¹

6343 Accesses
142 Citations
3 Altmetric
Explore all metrics

Abstract

We consider the problem of learning to perform information extraction in domains where linguistic processing is problematic, such as Usenet posts, email, and finger plan files. In place of syntactic and semantic information, other sources of information can be used, such as term frequency, typography, formatting, and mark-up. We describe four learning approaches to this problem, each drawn from a different paradigm: a rote learner, a term-space learner based on Naive Bayes, an approach using grammatical induction, and a relational rule learner. Experiments on 14 information extraction problems defined over four diverse document collections demonstrate the effectiveness of these approaches. Finally, we describe a multistrategy approach which combines these learners and yields performance competitive with or better than the best of them. This technique is modular and flexible, and could find application in other machine learning problems.

References

Aone, C. & Bennett, S.W. (1996). Applying machine learning to anaphora resolution. In S. Wermter, E. Riloff, & G. Scheler (Eds.), Connectionist, statistical and symbolic approaches to learning for natural language processing (pp. 302–314). Berlin: Springer-Verlag.
Google Scholar
Appelt, D.E., Hobbs, J.R., Bear, J., Israel, D., & Tyson, M. (1993). FASTUS: a finite-state processor for information extraction from real-world text. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence (IJCAI-93) (pp. 1172–1178).
Apté, C., Damerau, F., & Weiss, S.M., (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3), 233–251.
Google Scholar
August, S.E. & Dolan, C.P. (1992). Hughes Research Laboratories: Description of the trainable text skimmer used for MUC-4. Proceedings of the Fourth Message Understanding Conference (MUC-4), pp. 189–196.
Bikel, D.M., Miller, S., Schwartz, R., & Weischedel, R. (1997). Nymble: a high-performance learning name-finder. Proceedings of the Fifth Conference on Applied Natural Language Processing (pp. 194–201).
Califf, M.E. (1998). Relational learning techniques for natural language information extraction. Ph.D. Thesis, University of Texas at Austin.
Cardie, C. (1993). A case-based approach to knowledge acquisition for domain-specific sentence analysis. Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI-93) (pp. 798–803).
Cardie, C. (1997). Empirical methods in information extraction. AI Magazine, 18(4), 65–79.
Google Scholar
Carrasco, R.C. & Oncina J. (1994). Learning stochastic regular grammars by means of a state merging method. In R.C. Carrasco & J. Oncina (Eds.), Grammatical inference and applications: Second international colloquium, ICGI-94, Springer-Verlag.
Chan, P. & Stolfo, S. (1993). Experiments on multistrategy learning by meta-learning, Proceedings of the Second International Conference on Information and Knowledge Management (CIKM 93) (pp. 314–323).
Clark, P. & Boswell, R. (1991). Rule induction with CN2: some recent improvements. In Y. Kodratoff (Ed.), Machine learning—EWSL-91 (pp.151–163). Springer-Verlag, Berlin.
Google Scholar
Cohen, W.W. & Singer, Y. (1996). Context-sensitive learning methods for text categorization. Proceedings of the 19th Annual International ACM Conference on Research and Development in Information Retrieval (pp. 307–315) Zurich, Switzerland: ACM Press.
Google Scholar
Cowie, J., Guthrie, L., Jin, W., Wang, R., Wakao, T., Pustejovsky, J., & Waterman, S. (1993). CRL/Brandeis: Description of the Diderot system as used for MUC-5. Proceedings of the Fifth Message Understanding Conference (MUC-5) (pp. 161–179).
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S. (1998). Learning to extract symbolic knowledge from the World Wide Web. Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98).
Defense Advanced Research Projects Agency (1992). Proceedings of the Fourth Message Understanding Conference (MUC-4), McLean, Virginia. Morgan Kaufmann Publisher, Inc.
Defense Advanced Research Projects Agency (1993). Proceedings of the Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland. Morgan Kaufmann Publisher, Inc.
Defense Advanced Research Projects Agency (1995). Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann Publisher, Inc.
Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24(2), 141–168.
Google Scholar
Domingos, P. & Pazzani, M. (1996). Beyond independence: Conditions for the optimality of the simple Bayesian classifier. Proceedings of the Thirteenth International Conference on Machine Learning (ICML-96) pp. 105–112).
Doorenbos, R., Etzioni, O., & Weld, D.S. (1997). A scalable comparison-shopping agent for the world-wide web. Proceedings of the First International Conference on Autonomous Agents.
Duda, R. & Hart, P. (1973). Pattern classification and scene analysis. New York: John Wiley and Sons.
Google Scholar
Freitag, D. (1998). Multistrategy learning for information extraction. Proceedings of the Fifteenth International Machine Learning Conference (ICML-98).
Freitag, D. (1999). Machine learning for information extraction in informal domains. Ph.D. Thesis, Carnegie Melon University.
Goan, T., Benson, N., & Etzioni, O. (1996). A grammar inference algorithm for the World Wide Web. Working Notes of the AAAI-96 Spring Symposium on Machine Learning in Information Access.
Gold, E. (1967). Language identification in the limit. Information and Control, 10, 447–474.
Google Scholar
Kim, J.-T. & Moldovan, D. I. (1995). Acquisition of linguistic patterns for knowledge-based information extraction. IEEE Transactions on Knowledge and Data Engineering, 7(5), 713–724.
Google Scholar
Kohavi, R. (1995). The power of decision tables. Proceedings of the European Conference on Machine Learning (ECML-95) (pp. 174–89).
Kushmerick, N. (1997). Wrapper induction for information extraction. Ph.D. Thesis, University of Washington. Tech Report UW-CSE–97–11–04.
Lewis, D. (1992). Representation and learning in information retrieval. Ph.D. Thesis, Univ. of Massachusetts. CS Tech. Report 91–93.
Lewis, D.D. (1997). Reference list to accompany the SIGIR-97 Tutorial on Machine Learning for Information Retrieval. http://www.research.att.com/lewis/papers/lewis98.ps.
Maron, M. (1961). Automatic indexing: An experimental inquiry, Journal of the Association for Computing Machinery, 8, 404–417.
Google Scholar
McCarthy, J.F. & Lehnert, W.G. (1995). Using decision trees for coreference resolution. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95).
Michalski, R. & Tecuci, G. (Eds.). (1994). Machine learning: A multistrategy approach. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Michalski, R. S. (1983). A theory and methodology of inductive learning. In R.S. Michalski, J.G. Carbonell, & T.M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (pp. 83–134), Palo Alto, Ca: Tioga Publishing Company.
Google Scholar
Mitchell, T.M. (1997). Machine learning. The McGraw-Hill Companies, Inc.
Muraki, K., Doi, S., & Ando, S. (1993). NEC: Ddescription of theVENIEXsystem as used for MUC-5. Proceedings of the Fifth Message Understanding Conference (MUC-5) (pp. 147–159).
Noah, W.W. & Weeks, R.V. (1993). TRW: Description of the DEFT system as used for MUC-5. Proceedings of the Fifth Message Understanding Conference (MUC-5) (pp. 237–248).
Quinlan, J.R. (1990). Learning logical definitions from relations. Machine Learning, 5(3), 239–266.
Google Scholar
Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Mateo, Calif: Morgan Kaufmann Publishers.
Google Scholar
Rabiner, L.R., & Juang, B.H. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1), 4–16.
Google Scholar
Riloff, E. (1996). Automatically generating extraction patterns from untagged text. Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96) (pp. 1044–1049).
Riloff, E. & Lehnert, W. (1994). Information extraction as a basis for high-precision text classification. ACM Transactions on Information Systems, 12(3), 296–333.
Google Scholar
Rulot, H. & Vidal, E. (1988). An efficient algorithm for the inference of circuit-free automata. In G.A. Ferrate (Ed.), Syntactic and structural pattern recognition. Springer-Verlag, Berlin.
Google Scholar
Schaffer, C. (1993). Selecting a classification method by cross-validation. Machine Learning, 13(1), 135–143.
Google Scholar
Soderland, S. (1996). Learning text analysis rules for domain-specific natural Language processing. Ph.D. Thesis, University of Massachusetts. CS Tech. Report 96–087.
Soderland, S. (1997). Learning to extract text-based information from the world wide web. Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.
Soderland, S. & Lehnert, W. (1994). Wrap-Up: a trainable discourse module for information extraction. Journal of Artificial Intelligence Research, 2, 131–158.
Google Scholar
van Rijsbergen, C.J. (1979). Information Retrieval. Boston: Butterworths, Inc.
Google Scholar
Vidal, E. (1994). Grammatical inference: an introductory survey. In R.C. Carrasco & J. Oncina (Eds.), Grammatical Inference and Applications: Second International Colloquium, ICGI-94 (pp. 1–4) Springer-Verlag.
Weischedel, R., Ayuso, D., Boisen, S., Fox, H., Ingria, R., & Matsukawa, T., Papageorgiov, C., MacLaughlin, D., Kitagawa, M., Sakai, T., Abe, J., Hosihi, H., Miyamoto, Y., & Miller, S. (1993). BBN: Description of the PLUM system as used for MUC-5. Proceedings of the Fifth Message Understanding Conference (MUC-5) (pp. 93–107).

Download references

Author information

Authors and Affiliations

Justsystem Pittsburgh Research Center, 4616 Henry Street, Pittsburgh, PA, 15213, USA
Dayne Freitag

Authors

Dayne Freitag
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Freitag, D. Machine Learning for Information Extraction in Informal Domains. Machine Learning 39, 169–202 (2000). https://doi.org/10.1023/A:1007601113994

Download citation

Issue Date: May 2000
DOI: https://doi.org/10.1023/A:1007601113994

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Machine Learning for Information Extraction in Informal Domains

Abstract

Article PDF

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Natural Language Processing

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Machine Learning for Information Extraction in Informal Domains

Abstract

Article PDF

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Natural Language Processing

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation