Abstract
Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks: 1) The problem of word clustering. It is hard to find a suitable clustering method with good performance and less computation. 2) Class-based method always loses the prediction ability to adapt the text in different domains. In order to solve above problems, a definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance, and the perplexity is reduced from 283 to 218. At the same time, an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability. The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora, and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.
Similar content being viewed by others
References
MANNING C D, SCHUTZE H. Foundations of statistical natural language processing [M]. London: The MIT Press, 1999: 210–225.
GOODMAN J T. A bit of progress in language modeling [J]. Computer Speech and Language, 2001, 15(4): 403–434.
XUE Nian-wen, XIA Fei, CHIOU Fu-dong, PALMER M. The Penn Chinese treebank: Phrase structure annotation of a large corpus [J]. Natural Language Engineering, 2005, 11(2): 207–238.
FUNG P, NGAI G, YANG Yong-sheng, CHEN Ben-feng. A maximum-entropy Chinese parser augmented by transformation-based learning [J]. ACM Trans on Asian language Processing, 2004, 3(2): 159–168.
CHELBA C, JELINEK F. Structured language modeling [J]. Computer Speech and Language, 2000, 14(4): 283–332.
AVIRAN S, SIEGEL P H, WOLF J K. Optimal parsing trees for run-length coding of biased data [J]. IEEE Transaction on Information Theory, 2008, 54(2): 841–849.
ZHOU De-yu, HE Yu-lan. Discriminative training of the hidden vectors state model for semantic parsing [J]. IEEE Transaction on Knowledge and Data Engineering, 2009, 21(1): 66–77.
Seo Kwang-Jun, Nam Ki-Chun, Choi Key-Sun. A probalistic model of the dependency parse of the variable-word-order languages by using ascending dependency [J]. Computer Processing of Oriental Languages, 2000, 12(3): 309–322.
LI Zheng-hua, CHE Wan-xiang, LIU Ting. Beam-search based high-Order dependency parser [J]. Journal of Chinese Information Processing, 2010, 24(1): 37–41. (in Chinese)
YUAN Li-chi. A speech recognition method based on improved hidden Markov model [J]. Journal of Central South University: Natural Science, 2008, 39(6): 1303–1308. (in Chinese)
MATSUZAKI T, MIYAO Y, TSUJII J. An efficient clustering algorithm for class-Based language models [C]// Proceedings of the 7th Conference on Computational Natural Language Learning (CoNLL-2003). Edmonton, Canada, 2003: 119–126.
DAGAN I. Context word similarity and estimation from sparse data [J]. Computer Speech and Language, 1995, 9(2): 123–152.
CUTTING D R, KARGER D R, PEDERSEN J O, TUKEY J R. Scatter/garther: A cluster-based approach to browsing large document collections [C]// Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92). Copenhagen, Denmark, 1992: 318–329.
LEE L. Similarity-based approaches to natural language processing [D]. Harvard University, Cambridge, MA. 1997: 56–72.
KAROV Y, EDELMAN S. Learning similarity-based word sense disambiguation from sparse data [C]// Proceedings of the Fourth Workshop on Very Large Corpora. Copenhagen, Denmark, 1996: 42–55.
YUAN Li-chi. Dependency language paring model based on word clustering [J]. Journal of Central South University: Natural Science and Technology, 2011, 42(7): 2023–2027. (in Chinese).
NIESLER T R, WOODLAND P C. A variable-length category-based n-gram language model [C]// Proceeding of the International Conference of Acoustics Speech and Signal Processing. Atlanta, Georgia, USA, 1996: 164–167.
GAO Jian-feng, WANG Hai-feng, LI Ming-jing, LEE Kai-fu. A unified approach to statistical language modeling for Chinese [C]// Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000). Istanbul, Turkey, 2000, 6: 1703–1706.
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: Project(60763001) supported by the National Natural Science Foundation of China; Project(2010GZS0072) supported by the Natural Science Foundation of Jiangxi Province, China; Project(GJJ12271) supported by the Science and Technology Foundation of Provincial Education Department of Jiangxi Province, China
Rights and permissions
About this article
Cite this article
Yuan, Lc. Vari-gram language model based on word clustering. J. Cent. South Univ. Technol. 19, 1057–1062 (2012). https://doi.org/10.1007/s11771-012-1109-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11771-012-1109-z