Abstract
In statistical language modelling there is always a problem of sparse data. A way to reduce this problem is to form groups of words in order to get equivalence classes. In this paper we present a clustering algorithm that builds abstract word equivalence classes. The algorithm finds a local optimum according to a maximum-likelihood criterion. Experiments were made on an English 1.1-million word corpus and a German 100,000-word corpus. Compared to a word bigram model, the use of clustered equivalence classes in a bigram class model leads to a significant improvement, as measured by the perplexity. Depending on the size of the training material, the automatically clustered word classes are even better than manually determined categories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bahl, L.R.; Jelinek, F.; Mercer, R.L. (1983): A maximum likelihood approach to continuous speech recognition. In: IEEE Trans. on Pattern Analysis and Machine Intelligence 5 (March), 179–190.
Derouault, A.M.; Merialdo, B. (1986): Natural language modeling for phoneme-to-text transcription. In: IEEE Trans. on Pattern Analysis and Machine Intelligence 8 (Nov.), 742–749.
Duda, R.O.; Hart, P.E. (1973): Pattern Classification and Scene Analysis. New York: Wiley
Kuhn, R.; de Mori, R. (1990): A cache-based natural language model for speech recognition. In: IEEE Trans. on Pattern Analysis and Machine Intelligence 12 (June), 570–583.
Ney, H.; Essen, U. (1991): On smoothing techniques for bigram-based natural language modelling. In: Proc. ICASSP 2 (May), 825–828.
Steinbiss, V.; Noll, A.; Paeseler, A.; Ney, H. et al. (1990): A 10000-word continuous-speech recognition system. In: Proc. ICASSP, Vol. 1 (April), 57–60.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1993 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Kneser, R., Ney, H. (1993). Forming Word Classes by Statistical Clustering for Statistical Language Modelling. In: Köhler, R., Rieger, B.B. (eds) Contributions to Quantitative Linguistics. Springer, Dordrecht. https://doi.org/10.1007/978-94-011-1769-2_15
Download citation
DOI: https://doi.org/10.1007/978-94-011-1769-2_15
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-010-4777-7
Online ISBN: 978-94-011-1769-2
eBook Packages: Springer Book Archive