Sample selection in natural language learning

Engelson, Sean P.; Dagan, Ido

doi:10.1007/3-540-60925-3_50

Sean P. Engelson¹ &
Ido Dagan¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1040))

Included in the following conference series:

International Joint Conference on Artificial Intelligence

208 Accesses
1 Citations

Abstract

Many corpus-based methods for natural language processing are based on supervised training, requiring expensive manual annotation of training corpora. This paper investigates reducing annotation cost by sample selection. In this approach, the learner examines many unlabeled examples and selects for labeling only those that are most informative at each stage of training. In this way it is possible to avoid redundantly annotating examples that contribute little new information. The paper first analyzes the issues that need to be addressed when constructing a sample selection algorithm, arguing for the attractiveness of committee-based selection methods. We then focus on selection for training probabilistic classifiers, which are commonly applied to problems in statistical natural language processing. We report experimental results of applying a specific type of committee-based selection during training of a stochastic part-of-speech tagger, and demonstrate substantially improved learning rates over complete training using all of the text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Angluin, D. 1987. Learning regular sets from queries and counterexamples. Information and Computation 75(2):87–106.
Article Google Scholar
Black, E.; Jelinek, F.; Lafferty, J.; Magerman, D.; Mercer, R.; and Roukos, S. 1993. Towards history-based grammars: using richer models for probabilistic parsing. In Proc. of the Annual Meeting of the ACL, 31–37.
Google Scholar
Brill, E. 1992. A simple rule-based part of speech tagger. In Proc. of ACL Conference on Applied Natural Language Processing.
Google Scholar
Briscoe, T., and Carroll, J. 1993. Generalized probabilistic LR parsing of natural language corpora with unification-based grammars. Computational Linguistics 19(1):25–60.
Google Scholar
Church, K. W. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Proc. of ACL Conference on Applied Natural Language Processing.
Google Scholar
Cohn, D.; Atlas, L.; and Ladner, R. 1994. Improving generalization with active learning. Machine Learning 15.
Google Scholar
Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1995. Active learning with statistical models. In Tesauro, G.; Touretzky, D.; and Alspector, J., eds., Advances in Neural Information Processing, volume 7. Morgan Kaufmann.
Google Scholar
Dagan, I., and Engelson, S. 1995. Committee-based sampling for training probabilistic classifiers. In Proc. Int'l Conference on Machine Learning.
Google Scholar
Elworthy, D. 1994. Does Baum-Welch re-estimation improve taggers? In Proc. of ACL Conference on Applied Natural Language Processing, 53–58.
Google Scholar
Finch, S. 1994. Exploiting sophisticated representations for document retrieval. In Proceedings of the 4th Conference on Applied Natural Language Processing.
Google Scholar
Freund, Y.; Seung, H. S.; Shamir, E.; and Tishby, N. 1993. Information, prediction, and query by committee. In Advances in Neural Information Processing, volume 5. Morgan Kaufmann.
Google Scholar
Freund, Y. 1990. An improved boosting algorithm and its implications on learning complexity. In Proc. Fifth Workshop on Computational Learning Theory.
Google Scholar
Gale, W.; Church, K.; and Yarowsky, D. 1993. A method for disambiguating word senses in a large corpus. Computers and the Humanities 26:415–439.
Google Scholar
Hearst, M. 1991. Noun homograph disambiguation using local context in large text corpora. In Proc. of the Annual Conference of the UW Center for the New OED and Text Research, 1–22.
Google Scholar
Hindle, D., and Rooth, M. 1993. Structural ambiguity and lexical relations. Computational Linguistics 19(1):103–120.
Google Scholar
Iwayama, M., and Tokunaga, T. 1994. A probabilistic model for text categorization based on a single random variable with multiple values. In Proceedings of the 4th Conference on Applied Natural Language Processing.
Google Scholar
Johnson, N. L. 1972. Continuous Multivariate Distributions. New York: John Wiley & Sons.
Google Scholar
Kupiec, J. 1992. Robust part-of-speech tagging using a hidden makov model. Computer Speech and Language 6:225–242.
Google Scholar
Lewis, D., and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings of the 11th International Conference.
Google Scholar
Lewis, D., and Gale, W. 1994. Training text classifiers by uncertainty sampling. In Proceedings of ACM-SIGIR Conference on Information Retrieval.
Google Scholar
MacKay, D. J. C. 1992a. The evidence framework applied to classification networks. Neural Computation 4.
Google Scholar
MacKay, D. J. C. 1992b. Information-based objective functions for active data selection. Neural Computation 4.
Google Scholar
Matan, O. 1995. On-site learning. Technical Report LOGIC-95-4, Stanford University.
Google Scholar
Merialdo, B. 1991. Tagging text with a probabilistic model. In Proc. Int'l Conf. on Acoustics, Speech, and Signal Processing.
Google Scholar
Mitchell, T. 1982. Generalization as search. Artificial Intelligence 18.
Google Scholar
Plutowski, M., and White, H. 1993. Selecting concise training sets from clean data. IEEE Trans. on Neural Networks 4(2).
Google Scholar
Rabiner, L. R. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. of the IEEE 77(2).
Google Scholar
Seung, H. S.; Opper, M.; and Sompolinsky, H. 1992. Query by committee. In Proc. ACM Workshop on Computational Learning Theory.
Google Scholar
Stolcke, A., and Omohundro, S. 1992. Hidden Markov Model induction by Bayesian model merging. In Advances in Neural Information Processing, volume 5. Morgan Kaufmann.
Google Scholar
Yarowsky, D. 1994. Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French. In Proc. of the Annual Meeting of the ACL, 88–95.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Bar-Ilan University, 52900, Ramat Gan, Israel
Sean P. Engelson & Ido Dagan

Authors

Sean P. Engelson
View author publications
You can also search for this author in PubMed Google Scholar
Ido Dagan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Stefan Wermter Ellen Riloff Gabriele Scheler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Engelson, S.P., Dagan, I. (1996). Sample selection in natural language learning. In: Wermter, S., Riloff, E., Scheler, G. (eds) Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing. IJCAI 1995. Lecture Notes in Computer Science, vol 1040. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60925-3_50

Download citation

DOI: https://doi.org/10.1007/3-540-60925-3_50
Published: 07 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60925-4
Online ISBN: 978-3-540-49738-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics