Abstract
Many corpus-based methods for natural language processing are based on supervised training, requiring expensive manual annotation of training corpora. This paper investigates reducing annotation cost by sample selection. In this approach, the learner examines many unlabeled examples and selects for labeling only those that are most informative at each stage of training. In this way it is possible to avoid redundantly annotating examples that contribute little new information. The paper first analyzes the issues that need to be addressed when constructing a sample selection algorithm, arguing for the attractiveness of committee-based selection methods. We then focus on selection for training probabilistic classifiers, which are commonly applied to problems in statistical natural language processing. We report experimental results of applying a specific type of committee-based selection during training of a stochastic part-of-speech tagger, and demonstrate substantially improved learning rates over complete training using all of the text.
Preview
Unable to display preview. Download preview PDF.
References
Angluin, D. 1987. Learning regular sets from queries and counterexamples. Information and Computation 75(2):87–106.
Black, E.; Jelinek, F.; Lafferty, J.; Magerman, D.; Mercer, R.; and Roukos, S. 1993. Towards history-based grammars: using richer models for probabilistic parsing. In Proc. of the Annual Meeting of the ACL, 31–37.
Brill, E. 1992. A simple rule-based part of speech tagger. In Proc. of ACL Conference on Applied Natural Language Processing.
Briscoe, T., and Carroll, J. 1993. Generalized probabilistic LR parsing of natural language corpora with unification-based grammars. Computational Linguistics 19(1):25–60.
Church, K. W. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Proc. of ACL Conference on Applied Natural Language Processing.
Cohn, D.; Atlas, L.; and Ladner, R. 1994. Improving generalization with active learning. Machine Learning 15.
Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1995. Active learning with statistical models. In Tesauro, G.; Touretzky, D.; and Alspector, J., eds., Advances in Neural Information Processing, volume 7. Morgan Kaufmann.
Dagan, I., and Engelson, S. 1995. Committee-based sampling for training probabilistic classifiers. In Proc. Int'l Conference on Machine Learning.
Elworthy, D. 1994. Does Baum-Welch re-estimation improve taggers? In Proc. of ACL Conference on Applied Natural Language Processing, 53–58.
Finch, S. 1994. Exploiting sophisticated representations for document retrieval. In Proceedings of the 4th Conference on Applied Natural Language Processing.
Freund, Y.; Seung, H. S.; Shamir, E.; and Tishby, N. 1993. Information, prediction, and query by committee. In Advances in Neural Information Processing, volume 5. Morgan Kaufmann.
Freund, Y. 1990. An improved boosting algorithm and its implications on learning complexity. In Proc. Fifth Workshop on Computational Learning Theory.
Gale, W.; Church, K.; and Yarowsky, D. 1993. A method for disambiguating word senses in a large corpus. Computers and the Humanities 26:415–439.
Hearst, M. 1991. Noun homograph disambiguation using local context in large text corpora. In Proc. of the Annual Conference of the UW Center for the New OED and Text Research, 1–22.
Hindle, D., and Rooth, M. 1993. Structural ambiguity and lexical relations. Computational Linguistics 19(1):103–120.
Iwayama, M., and Tokunaga, T. 1994. A probabilistic model for text categorization based on a single random variable with multiple values. In Proceedings of the 4th Conference on Applied Natural Language Processing.
Johnson, N. L. 1972. Continuous Multivariate Distributions. New York: John Wiley & Sons.
Kupiec, J. 1992. Robust part-of-speech tagging using a hidden makov model. Computer Speech and Language 6:225–242.
Lewis, D., and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings of the 11th International Conference.
Lewis, D., and Gale, W. 1994. Training text classifiers by uncertainty sampling. In Proceedings of ACM-SIGIR Conference on Information Retrieval.
MacKay, D. J. C. 1992a. The evidence framework applied to classification networks. Neural Computation 4.
MacKay, D. J. C. 1992b. Information-based objective functions for active data selection. Neural Computation 4.
Matan, O. 1995. On-site learning. Technical Report LOGIC-95-4, Stanford University.
Merialdo, B. 1991. Tagging text with a probabilistic model. In Proc. Int'l Conf. on Acoustics, Speech, and Signal Processing.
Mitchell, T. 1982. Generalization as search. Artificial Intelligence 18.
Plutowski, M., and White, H. 1993. Selecting concise training sets from clean data. IEEE Trans. on Neural Networks 4(2).
Rabiner, L. R. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. of the IEEE 77(2).
Seung, H. S.; Opper, M.; and Sompolinsky, H. 1992. Query by committee. In Proc. ACM Workshop on Computational Learning Theory.
Stolcke, A., and Omohundro, S. 1992. Hidden Markov Model induction by Bayesian model merging. In Advances in Neural Information Processing, volume 5. Morgan Kaufmann.
Yarowsky, D. 1994. Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French. In Proc. of the Annual Meeting of the ACL, 88–95.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Engelson, S.P., Dagan, I. (1996). Sample selection in natural language learning. In: Wermter, S., Riloff, E., Scheler, G. (eds) Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing. IJCAI 1995. Lecture Notes in Computer Science, vol 1040. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60925-3_50
Download citation
DOI: https://doi.org/10.1007/3-540-60925-3_50
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60925-4
Online ISBN: 978-3-540-49738-7
eBook Packages: Springer Book Archive