Skip to main content
Log in

Vari-gram language model based on word clustering

  • Published:
Journal of Central South University Aims and scope Submit manuscript

Abstract

Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks: 1) The problem of word clustering. It is hard to find a suitable clustering method with good performance and less computation. 2) Class-based method always loses the prediction ability to adapt the text in different domains. In order to solve above problems, a definition of word similarity by utilizing mutual information was presented. Based on word similarity, the definition of word set similarity was given. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance, and the perplexity is reduced from 283 to 218. At the same time, an absolute weighted difference method was presented and was used to construct vari-gram language model which has good prediction ability. The perplexity of vari-gram model is reduced from 234.65 to 219.14 on Chinese corpora, and is reduced from 195.56 to 184.25 on English corpora compared with category-based model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. MANNING C D, SCHUTZE H. Foundations of statistical natural language processing [M]. London: The MIT Press, 1999: 210–225.

    Google Scholar 

  2. GOODMAN J T. A bit of progress in language modeling [J]. Computer Speech and Language, 2001, 15(4): 403–434.

    Article  Google Scholar 

  3. XUE Nian-wen, XIA Fei, CHIOU Fu-dong, PALMER M. The Penn Chinese treebank: Phrase structure annotation of a large corpus [J]. Natural Language Engineering, 2005, 11(2): 207–238.

    Article  Google Scholar 

  4. FUNG P, NGAI G, YANG Yong-sheng, CHEN Ben-feng. A maximum-entropy Chinese parser augmented by transformation-based learning [J]. ACM Trans on Asian language Processing, 2004, 3(2): 159–168.

    Article  Google Scholar 

  5. CHELBA C, JELINEK F. Structured language modeling [J]. Computer Speech and Language, 2000, 14(4): 283–332.

    Article  Google Scholar 

  6. AVIRAN S, SIEGEL P H, WOLF J K. Optimal parsing trees for run-length coding of biased data [J]. IEEE Transaction on Information Theory, 2008, 54(2): 841–849.

    Article  MathSciNet  Google Scholar 

  7. ZHOU De-yu, HE Yu-lan. Discriminative training of the hidden vectors state model for semantic parsing [J]. IEEE Transaction on Knowledge and Data Engineering, 2009, 21(1): 66–77.

    Article  MathSciNet  Google Scholar 

  8. Seo Kwang-Jun, Nam Ki-Chun, Choi Key-Sun. A probalistic model of the dependency parse of the variable-word-order languages by using ascending dependency [J]. Computer Processing of Oriental Languages, 2000, 12(3): 309–322.

    Google Scholar 

  9. LI Zheng-hua, CHE Wan-xiang, LIU Ting. Beam-search based high-Order dependency parser [J]. Journal of Chinese Information Processing, 2010, 24(1): 37–41. (in Chinese)

    MATH  Google Scholar 

  10. YUAN Li-chi. A speech recognition method based on improved hidden Markov model [J]. Journal of Central South University: Natural Science, 2008, 39(6): 1303–1308. (in Chinese)

    Google Scholar 

  11. MATSUZAKI T, MIYAO Y, TSUJII J. An efficient clustering algorithm for class-Based language models [C]// Proceedings of the 7th Conference on Computational Natural Language Learning (CoNLL-2003). Edmonton, Canada, 2003: 119–126.

  12. DAGAN I. Context word similarity and estimation from sparse data [J]. Computer Speech and Language, 1995, 9(2): 123–152.

    Article  Google Scholar 

  13. CUTTING D R, KARGER D R, PEDERSEN J O, TUKEY J R. Scatter/garther: A cluster-based approach to browsing large document collections [C]// Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92). Copenhagen, Denmark, 1992: 318–329.

  14. LEE L. Similarity-based approaches to natural language processing [D]. Harvard University, Cambridge, MA. 1997: 56–72.

    Google Scholar 

  15. KAROV Y, EDELMAN S. Learning similarity-based word sense disambiguation from sparse data [C]// Proceedings of the Fourth Workshop on Very Large Corpora. Copenhagen, Denmark, 1996: 42–55.

  16. YUAN Li-chi. Dependency language paring model based on word clustering [J]. Journal of Central South University: Natural Science and Technology, 2011, 42(7): 2023–2027. (in Chinese).

    Google Scholar 

  17. NIESLER T R, WOODLAND P C. A variable-length category-based n-gram language model [C]// Proceeding of the International Conference of Acoustics Speech and Signal Processing. Atlanta, Georgia, USA, 1996: 164–167.

  18. GAO Jian-feng, WANG Hai-feng, LI Ming-jing, LEE Kai-fu. A unified approach to statistical language modeling for Chinese [C]// Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000). Istanbul, Turkey, 2000, 6: 1703–1706.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li-chi Yuan  (袁里驰).

Additional information

Foundation item: Project(60763001) supported by the National Natural Science Foundation of China; Project(2010GZS0072) supported by the Natural Science Foundation of Jiangxi Province, China; Project(GJJ12271) supported by the Science and Technology Foundation of Provincial Education Department of Jiangxi Province, China

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, Lc. Vari-gram language model based on word clustering. J. Cent. South Univ. Technol. 19, 1057–1062 (2012). https://doi.org/10.1007/s11771-012-1109-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11771-012-1109-z

Key words

Navigation