Vari-gram language model based on word clustering

Yuan, Li-chi

doi:10.1007/s11771-012-1109-z

Vari-gram language model based on word clustering

Published: 29 March 2012

Volume 19, pages 1057–1062, (2012)
Cite this article

Journal of Central South University Aims and scope Submit manuscript

Li-chi Yuan (袁里驰)^1,2

61 Accesses
2 Citations
Explore all metrics

Abstract

Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks: 1) The problem of word clustering. It is hard to find a suitable clustering method with good performance and less computation. 2) Class-based method always loses the prediction ability to adapt the text in different domains. In order to solve above problems, a definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance, and the perplexity is reduced from 283 to 218. At the same time, an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability. The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora, and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Word Representation Quality Trained by word2vec via a More Efficient Hierarchical Clustering Method

Long-Distance Continuous Space Language Modeling for Speech Recognition

A Domain-Adapting Word Representation Method for Word Clustering

References

MANNING C D, SCHUTZE H. Foundations of statistical natural language processing [M]. London: The MIT Press, 1999: 210–225.
Google Scholar
GOODMAN J T. A bit of progress in language modeling [J]. Computer Speech and Language, 2001, 15(4): 403–434.
Article Google Scholar
XUE Nian-wen, XIA Fei, CHIOU Fu-dong, PALMER M. The Penn Chinese treebank: Phrase structure annotation of a large corpus [J]. Natural Language Engineering, 2005, 11(2): 207–238.
Article Google Scholar
FUNG P, NGAI G, YANG Yong-sheng, CHEN Ben-feng. A maximum-entropy Chinese parser augmented by transformation-based learning [J]. ACM Trans on Asian language Processing, 2004, 3(2): 159–168.
Article Google Scholar
CHELBA C, JELINEK F. Structured language modeling [J]. Computer Speech and Language, 2000, 14(4): 283–332.
Article Google Scholar
AVIRAN S, SIEGEL P H, WOLF J K. Optimal parsing trees for run-length coding of biased data [J]. IEEE Transaction on Information Theory, 2008, 54(2): 841–849.
Article MathSciNet Google Scholar
ZHOU De-yu, HE Yu-lan. Discriminative training of the hidden vectors state model for semantic parsing [J]. IEEE Transaction on Knowledge and Data Engineering, 2009, 21(1): 66–77.
Article MathSciNet Google Scholar
Seo Kwang-Jun, Nam Ki-Chun, Choi Key-Sun. A probalistic model of the dependency parse of the variable-word-order languages by using ascending dependency [J]. Computer Processing of Oriental Languages, 2000, 12(3): 309–322.
Google Scholar
LI Zheng-hua, CHE Wan-xiang, LIU Ting. Beam-search based high-Order dependency parser [J]. Journal of Chinese Information Processing, 2010, 24(1): 37–41. (in Chinese)
MATH Google Scholar
YUAN Li-chi. A speech recognition method based on improved hidden Markov model [J]. Journal of Central South University: Natural Science, 2008, 39(6): 1303–1308. (in Chinese)
Google Scholar
MATSUZAKI T, MIYAO Y, TSUJII J. An efficient clustering algorithm for class-Based language models [C]// Proceedings of the 7th Conference on Computational Natural Language Learning (CoNLL-2003). Edmonton, Canada, 2003: 119–126.
DAGAN I. Context word similarity and estimation from sparse data [J]. Computer Speech and Language, 1995, 9(2): 123–152.
Article Google Scholar
CUTTING D R, KARGER D R, PEDERSEN J O, TUKEY J R. Scatter/garther: A cluster-based approach to browsing large document collections [C]// Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92). Copenhagen, Denmark, 1992: 318–329.
LEE L. Similarity-based approaches to natural language processing [D]. Harvard University, Cambridge, MA. 1997: 56–72.
Google Scholar
KAROV Y, EDELMAN S. Learning similarity-based word sense disambiguation from sparse data [C]// Proceedings of the Fourth Workshop on Very Large Corpora. Copenhagen, Denmark, 1996: 42–55.
YUAN Li-chi. Dependency language paring model based on word clustering [J]. Journal of Central South University: Natural Science and Technology, 2011, 42(7): 2023–2027. (in Chinese).
Google Scholar
NIESLER T R, WOODLAND P C. A variable-length category-based n-gram language model [C]// Proceeding of the International Conference of Acoustics Speech and Signal Processing. Atlanta, Georgia, USA, 1996: 164–167.
GAO Jian-feng, WANG Hai-feng, LI Ming-jing, LEE Kai-fu. A unified approach to statistical language modeling for Chinese [C]// Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000). Istanbul, Turkey, 2000, 6: 1703–1706.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology, Jiangxi University of Finance and Economics, Nanchang, 330013, China
Li-chi Yuan (袁里驰)
School of Information Science and Engineering, Central South University, Changsha, 410083, China
Li-chi Yuan (袁里驰)

Authors

Li-chi Yuan (袁里驰)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li-chi Yuan (袁里驰).

Additional information

Foundation item: Project(60763001) supported by the National Natural Science Foundation of China; Project(2010GZS0072) supported by the Natural Science Foundation of Jiangxi Province, China; Project(GJJ12271) supported by the Science and Technology Foundation of Provincial Education Department of Jiangxi Province, China

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, Lc. Vari-gram language model based on word clustering. J. Cent. South Univ. Technol. 19, 1057–1062 (2012). https://doi.org/10.1007/s11771-012-1109-z

Download citation

Received: 16 May 2011
Accepted: 16 November 2011
Published: 29 March 2012
Issue Date: April 2012
DOI: https://doi.org/10.1007/s11771-012-1109-z

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vari-gram language model based on word clustering

Abstract

Access this article

Similar content being viewed by others

Improving Word Representation Quality Trained by word2vec via a More Efficient Hierarchical Clustering Method

Long-Distance Continuous Space Language Modeling for Speech Recognition

A Domain-Adapting Word Representation Method for Word Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Vari-gram language model based on word clustering

Abstract

Access this article

Similar content being viewed by others

Improving Word Representation Quality Trained by word2vec via a More Efficient Hierarchical Clustering Method

Long-Distance Continuous Space Language Modeling for Speech Recognition

A Domain-Adapting Word Representation Method for Word Clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation