Chinese Pinyin-Text Conversion on Segmented Text

  • Wei Liu
  • Louise Guthrie
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5729)


Most current research and applications on Pinyin to Chinese word conversion employs a hidden Markov model (HMMs) which in turn uses a character-based language model. The reason is because Chinese texts are written without word boundaries. However in some tasks that involve the Pinyin to Chinese conversion, such as Chinese text proofreading, the original Chinese text is known. This enables us to extract the words and a word-based language model can be developed. In this paper we compare the two models and come to a conclusion that using word-based bi-gram language model achieve higher conversion accuracy than character-based bi-gram language model.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Zhou, X., Hu, X., Zhang, X., Shen, X.: A segment-based hidden markov model for real-setting pinyin-to-chinese conversion. In: CIKM 20007: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 1027–1030. ACM, New York (2007)CrossRefGoogle Scholar
  2. 2.
    Chen, Z., Lee, K.F.: A new statistical approach to chinese pinyin input. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong, pp. 241–247 (2000)Google Scholar
  3. 3.
    Sen, Z., Laprie, Y.: Mandarin text-to-pinyin conversion based on context knowledge and d-tree. In: Natural Language Processing and Knowledge Engineering, pp. 227–230 (2003)Google Scholar
  4. 4.
    Poritz, A.B.: Hidden markov models: a guided tour. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1988, pp. 7–13 (1988)Google Scholar
  5. 5.
    Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition, 267–296 (1990)Google Scholar
  6. 6.
    Gales, M., Young, S.: The application of hidden markov models in speech recognition. Found. Trends Signal Process. 1(3), 195–304 (2007)CrossRefGoogle Scholar
  7. 7.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)Google Scholar
  8. 8.
    Smailbegovic, F., Georgi, N., Gaydadjiev, S.V.: Sparse matrix storage format. In: Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing, pp. 445–448 (2005)Google Scholar
  9. 9.
    Goodman, J.T.: A bit of progress in language modeling, extended version. Technical report, Machine Learning and Applied Statistics Group, Microsoft Research (2001)Google Scholar
  10. 10.
    James, F.: Modified kneser-ney smoothing of n-gram models. Technical report (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Wei Liu
    • 1
  • Louise Guthrie
    • 1
  1. 1.Department of Computer ScienceUniversity of SheffieldUK

Personalised recommendations