Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1040))

Included in the following conference series:

Abstract

Many corpus-based methods for natural language processing are based on supervised training, requiring expensive manual annotation of training corpora. This paper investigates reducing annotation cost by sample selection. In this approach, the learner examines many unlabeled examples and selects for labeling only those that are most informative at each stage of training. In this way it is possible to avoid redundantly annotating examples that contribute little new information. The paper first analyzes the issues that need to be addressed when constructing a sample selection algorithm, arguing for the attractiveness of committee-based selection methods. We then focus on selection for training probabilistic classifiers, which are commonly applied to problems in statistical natural language processing. We report experimental results of applying a specific type of committee-based selection during training of a stochastic part-of-speech tagger, and demonstrate substantially improved learning rates over complete training using all of the text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Angluin, D. 1987. Learning regular sets from queries and counterexamples. Information and Computation 75(2):87–106.

    Article  Google Scholar 

  • Black, E.; Jelinek, F.; Lafferty, J.; Magerman, D.; Mercer, R.; and Roukos, S. 1993. Towards history-based grammars: using richer models for probabilistic parsing. In Proc. of the Annual Meeting of the ACL, 31–37.

    Google Scholar 

  • Brill, E. 1992. A simple rule-based part of speech tagger. In Proc. of ACL Conference on Applied Natural Language Processing.

    Google Scholar 

  • Briscoe, T., and Carroll, J. 1993. Generalized probabilistic LR parsing of natural language corpora with unification-based grammars. Computational Linguistics 19(1):25–60.

    Google Scholar 

  • Church, K. W. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Proc. of ACL Conference on Applied Natural Language Processing.

    Google Scholar 

  • Cohn, D.; Atlas, L.; and Ladner, R. 1994. Improving generalization with active learning. Machine Learning 15.

    Google Scholar 

  • Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1995. Active learning with statistical models. In Tesauro, G.; Touretzky, D.; and Alspector, J., eds., Advances in Neural Information Processing, volume 7. Morgan Kaufmann.

    Google Scholar 

  • Dagan, I., and Engelson, S. 1995. Committee-based sampling for training probabilistic classifiers. In Proc. Int'l Conference on Machine Learning.

    Google Scholar 

  • Elworthy, D. 1994. Does Baum-Welch re-estimation improve taggers? In Proc. of ACL Conference on Applied Natural Language Processing, 53–58.

    Google Scholar 

  • Finch, S. 1994. Exploiting sophisticated representations for document retrieval. In Proceedings of the 4th Conference on Applied Natural Language Processing.

    Google Scholar 

  • Freund, Y.; Seung, H. S.; Shamir, E.; and Tishby, N. 1993. Information, prediction, and query by committee. In Advances in Neural Information Processing, volume 5. Morgan Kaufmann.

    Google Scholar 

  • Freund, Y. 1990. An improved boosting algorithm and its implications on learning complexity. In Proc. Fifth Workshop on Computational Learning Theory.

    Google Scholar 

  • Gale, W.; Church, K.; and Yarowsky, D. 1993. A method for disambiguating word senses in a large corpus. Computers and the Humanities 26:415–439.

    Google Scholar 

  • Hearst, M. 1991. Noun homograph disambiguation using local context in large text corpora. In Proc. of the Annual Conference of the UW Center for the New OED and Text Research, 1–22.

    Google Scholar 

  • Hindle, D., and Rooth, M. 1993. Structural ambiguity and lexical relations. Computational Linguistics 19(1):103–120.

    Google Scholar 

  • Iwayama, M., and Tokunaga, T. 1994. A probabilistic model for text categorization based on a single random variable with multiple values. In Proceedings of the 4th Conference on Applied Natural Language Processing.

    Google Scholar 

  • Johnson, N. L. 1972. Continuous Multivariate Distributions. New York: John Wiley & Sons.

    Google Scholar 

  • Kupiec, J. 1992. Robust part-of-speech tagging using a hidden makov model. Computer Speech and Language 6:225–242.

    Google Scholar 

  • Lewis, D., and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings of the 11th International Conference.

    Google Scholar 

  • Lewis, D., and Gale, W. 1994. Training text classifiers by uncertainty sampling. In Proceedings of ACM-SIGIR Conference on Information Retrieval.

    Google Scholar 

  • MacKay, D. J. C. 1992a. The evidence framework applied to classification networks. Neural Computation 4.

    Google Scholar 

  • MacKay, D. J. C. 1992b. Information-based objective functions for active data selection. Neural Computation 4.

    Google Scholar 

  • Matan, O. 1995. On-site learning. Technical Report LOGIC-95-4, Stanford University.

    Google Scholar 

  • Merialdo, B. 1991. Tagging text with a probabilistic model. In Proc. Int'l Conf. on Acoustics, Speech, and Signal Processing.

    Google Scholar 

  • Mitchell, T. 1982. Generalization as search. Artificial Intelligence 18.

    Google Scholar 

  • Plutowski, M., and White, H. 1993. Selecting concise training sets from clean data. IEEE Trans. on Neural Networks 4(2).

    Google Scholar 

  • Rabiner, L. R. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. of the IEEE 77(2).

    Google Scholar 

  • Seung, H. S.; Opper, M.; and Sompolinsky, H. 1992. Query by committee. In Proc. ACM Workshop on Computational Learning Theory.

    Google Scholar 

  • Stolcke, A., and Omohundro, S. 1992. Hidden Markov Model induction by Bayesian model merging. In Advances in Neural Information Processing, volume 5. Morgan Kaufmann.

    Google Scholar 

  • Yarowsky, D. 1994. Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French. In Proc. of the Annual Meeting of the ACL, 88–95.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Stefan Wermter Ellen Riloff Gabriele Scheler

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Engelson, S.P., Dagan, I. (1996). Sample selection in natural language learning. In: Wermter, S., Riloff, E., Scheler, G. (eds) Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing. IJCAI 1995. Lecture Notes in Computer Science, vol 1040. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60925-3_50

Download citation

  • DOI: https://doi.org/10.1007/3-540-60925-3_50

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-60925-4

  • Online ISBN: 978-3-540-49738-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics