Skip to main content

Acquisition of Domain Knowledge

  • Conference paper
  • First Online:
Information Extraction in the Web Era (SCIE 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2700))

Included in the following conference series:

Abstract

Linguistic knowledge in Natural Language understanding systems is commonly stratified across several levels. This is true of Information Extraction as well. Typical state-of-the-art Information Extraction systems require syntactic-semantic patterns for locating facts or events in text; domain-specific word or concept classes for semantic generalization; and a specialized lexicon of terms that may not be found in general-purpose dictionaries, among other kinds of knowledge.

We describe an approach to unsupervised, or minimally supervised, knowledge acquisition. The approach is based on bootstrapping a comprehensive knowledge base from a small set of seed elements. Our approach is embodied in algorithms for discovery of patterns, concept classes, and lexicon, from raw un-annotated text.

We present the results of knowledge acquisition, and examine them in the context of prior work. We discuss problems in evaluating the quality of the acquired knowledge, and methodologies for evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.95
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: Proc. 5th ACM Intl. Conf. Digital Libraries, DL 2000 (2000)

    Google Scholar 

  2. Appelt, D., Hobbs, J., Bear, J., Israel, D., Kameyama, M., Tyson, M.: SRI: Description of the JV-FASTUS System used for MUC-5. In: Proc. 5th Message Understanding Conf (MUC-5), Baltimore, MD. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  3. Bikel, D., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a highperformance learning name-finder. In: Proc. 5th Applied Natural Language Processing Conf., Washington, DC (1997)

    Google Scholar 

  4. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with cotraining. In: Proc. 11th Annl. Conf. Computational Learning Theory (COLT 1998), New York. ACM Press, New York (1998)

    Google Scholar 

  5. Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proc. 6th Workshop on Very Large Corpora, Montreal, Canada (1998)

    Google Scholar 

  6. Brin, S.: Extracting patterns and relations from the world wide web. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377. Springer, Heidelberg (1998)

    Google Scholar 

  7. Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, Menlo Park, CA. AAAI Press, Menlo Park (1998)

    Google Scholar 

  8. Califf, M.E.: Relational Learning Techniques for Natural Language Information Extraction. Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin, TX (1998)

    Google Scholar 

  9. Cardie, C., Pierce, D.: Proposal for an interactive environment for information extraction. Technical Report TR98-1702, Cornell University (1998)

    Google Scholar 

  10. Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: Proc. 17th Intl. Joint Conf. on AI (IJCAI 2001), Seattle, WA (2001)

    Google Scholar 

  11. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proc. Joint SIGDAT Conf. on EMNLP/VLC, College Park, MD (1999)

    Google Scholar 

  12. Dagan, I., Marcus, S., Markovitch, S.: Contextual word similarity and estimation from sparse data. In: Proceedings of the 31st Annual Meeting of the Assn. for Computational Linguistics, Columbus, OH, pp. 31–37 (1993)

    Google Scholar 

  13. Fisher, D., Soderland, S., McCarthy, J., Feng, F., Lehnert, W.: Description of the UMass system as used for MUC-6. In: Proc. 6th Message Understanding Conf. (MUC-6), Columbia, MD. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  14. Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multiword terms: the C-value/NC-value method. Intl. Journal on Digital Libraries (3), 115–130 (2000)

    Article  Google Scholar 

  15. Freitag, D., McCallum, A.: Information extraction with HMMs and shrinkage. In: Proceedings of Workshop on Machine Learning and Information Extraction (AAAI 1999), Orlando, FL (1999)

    Google Scholar 

  16. Grishman, R., Macleod, C., Meyers, A.: Comlex Syntax: Building a computational lexicon. In: Proc. 15th Int’l Conf. Computational Linguistics (COLING 1994), Kyoto, Japan (1994)

    Google Scholar 

  17. Grishman, R., Huttunen, S., Yangarber, R.: Event extraction for infectious disease outbreaks. In: Proc. 2nd Human Language Technology Conf. (HLT 2002), San Diego, CA (2002)

    Google Scholar 

  18. Grishman, R.: The NYU system for MUC-6, or where’s the syntax? In: Proc. 6th Message Understanding Conf. (MUC-6), Columbia, MD. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  19. Grishman, R.: Information extraction: Techniques and challenges. In: Pazienza, M.T. (ed.) SCIE 1997. LNCS (LNAI), vol. 1299, Springer, Heidelberg (1997)

    Google Scholar 

  20. Hirschman, L., Grishman, R., Sager, N.: Grammatically-based automatic word class formation. Information Processing and Management 11(1/2), 39–57 (1975)

    Article  Google Scholar 

  21. Justeson, J.S., Katz, S.M.: Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(1), 9–27 (1995)

    Article  Google Scholar 

  22. Lehnert, W., Cardie, C., Fisher, D., McCarthy, J., Riloff, E., Soderland, S.: University of Massachusetts: MUC-4 test results and analysis. In: Proc. Fourth Message Understanding Conf., McLean, VA. Morgan Kaufmann, San Francisco (1992)

    Google Scholar 

  23. Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schwartz, R., Stone, R., Weischedel, R., The Annotation Group: Algorithms that learn to extract information; BBN: Description of the SIFT system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax, VA (1998)

    Google Scholar 

  24. Miller, G.A.: Wordnet a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  25. Mitchell, T.: The role of unlabeled data in supervised learning. In: Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain (1999)

    Google Scholar 

  26. Proceedings of the 6th Message Understanding Conference (MUC-6), Columbia, MD. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  27. Proceedings of the 7th Message Understanding Conference (MUC-7), Fairfax, VA (1998), www.itl.nist.gov/iaui/894.02/related_projects/muc/

  28. Nichols, J.: Secondary predicates. In: Proceedings of the 4th Annual Meeting of Berkeley Linguistics Society (1978)

    Article  Google Scholar 

  29. Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of ACL 1993, Columbus, OH (1993)

    Google Scholar 

  30. Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: Proc. 16th Natl. Conf. on AI (AAAI 1999), Orlando, FL (1999)

    Google Scholar 

  31. Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proc. 11th Annl. Conf. Artificial Intelligence. The AAAI Press/MIT Press (1993)

    Google Scholar 

  32. Riloff, E.: Automatically generating extraction patterns from untagged text. In: Proc. 13th Natl. Conf. on AI (AAAI 1996). The AAAI Press/MIT Press (1996)

    Google Scholar 

  33. Sasaki, Y.: Applying type-oriented ILP to IE rule generation. In: Proc. Workshop on Machine Learning and Information Extraction (AAAI 1999), Orlando, FL (1999)

    Google Scholar 

  34. Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 44(1-3), 233–272 (1999)

    Article  Google Scholar 

  35. Strzalkowski, T., Wang, J.: A self-learning universal concept spotter. In: Proc. 16th Intl. Conf. Computational Linguistics (COLING 1996), Copenhagen, Denmark (1996)

    Google Scholar 

  36. Tapanainen, P., Järvinen, T.: A non-projective dependency parser. In: Proc. 5th Conf. Applied Natural Language Processing, Washington, D.C., ACL (1997)

    Google Scholar 

  37. Thompson, C.A., Califf, M.E., Mooney, R.J.: Active learning for natural language parsing and information extraction. In: Proc. 16th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  38. Wakao, T., Gaizauskas, R., Wilks, Y.: Evaluation of an algorithm for the recognition and classification of proper names. In: Proc. 16th Int’l Conf. on Computational Linguistics (COLING 1996), Copenhagen, Denmark (1996)

    Google Scholar 

  39. Yangarber, R., Grishman, R.: Customization of information extraction systems. In: Velardi, P. (ed.) International Workshop on Lexically Driven Information Extraction, Frascati, Italy, Università di Roma (1997)

    Google Scholar 

  40. Yangarber, R., Grishman, R.: NYU: Description of the Proteus/PET system as used for MUC-7 ST. In: MUC-7: 7th Message Understanding Conf., Columbia, MD (1998)

    Google Scholar 

  41. Yangarber, R., Grishman, R., Tapanainen, P., Huttunen, S.: Unsupervised discovery of scenario-level patterns for information extraction. In: Proc. Conf. Applied Natural Language Processing (ANLP-NAACL 2000), Seattle, WA (2000a)

    Google Scholar 

  42. Yangarber, R., Grishman, R., Tapanainen, P., Huttunen, S.: Automatic acquisition of domain knowledge for information extraction. In: Proc. 18th Intl. Conf. Computational Linguistics (COLING 2000), Saarbrücken, Germany (2000b)

    Google Scholar 

  43. Yangarber, R., Lin, W., Grishman, R.: Unsupervised learning of generalized names. In: Proc. 19th Intl. Conf. Computational Linguistics (COLING 2002), Taiwan (2002)

    Google Scholar 

  44. Yangarber, R.: Scenario Customization for Information Extraction. Ph.D. thesis, New York University, New York, NY (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yangarber, R. (2003). Acquisition of Domain Knowledge. In: Pazienza, M.T. (eds) Information Extraction in the Web Era. SCIE 2002. Lecture Notes in Computer Science(), vol 2700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45092-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45092-4_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40579-5

  • Online ISBN: 978-3-540-45092-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics