Abstract
The paper presents some main progresses and achievements in Chinese information processing. It focuses on six aspects, i.e., Chinese syntactic analysis, Chinese semantic analysis, machine translation, information retrieval, information extraction, and speech recognition and synthesis. The important techniques and possible key problems of the respective branch in the near future are discussed as well.
Similar content being viewed by others
References
Xu Lin, Zhao Tiejun. Review on recently finished NSFC sponsored NLP projects. Journal of Software, 16(10): 1853–1858.
Jin Kiat Low, Hwee Tou Ng, Wenyuan Guo. A maximum entropy approach to Chinese word segmentation. In Proc. Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, 14–15 October 2005, pp. 161–164.
Chen Yin, Yang Muyun, Zhao Tiejun et al. A lexicalized second-order-HMM for ambiguity resolution in Chinese segmentation and POS tagging. High Technology Letters, 2005, 11(4): 346–350.
Yinghong Liang, Tiejun Zhao. Distributed English text chunking using multi-agent based architecture. In Proc. Int. Conf. Artificial Intelligence, Mexican City, 2005, pp. 752–760.
Zhang Min. Research on algorithms of Chinese treebank construction based on weakly restricted stochastic context-sensitive grammars [Dissertation]. Harbin Institute of Technology, 1997.
Zhou Q. A statistics-based Chinese parser. In Proc. the 5th Workshop on Very Large Corpora, 1997, pp. 4–15.
Zhou M. A block-based dependency parser for unrestricted Chinese text. In Proc. 2nd Chinese Language Processing Workshop, ACL2000, Hong Kong, 2000, pp. 78–84.
Meng Yao. Research on global Chinese parsing model based maximum entropy and parsing algorithm [Dissertation]. Harbin Institute of Technology, 2003.
One-Soon Her. Grammatical functions and verb subcategorization in Mandarin Chinese [Dissertation]. University of Hawaii, 1990, pp. 342–359.
Dorr B J, Gina-Anne Levow, Dekang Lin, Scott Thomas. Chinese-English semantic resource construction. In Proc. the 2nd Int. Conf. Language Resources and Evaluation, Athens, Greece, 2000, pp. 757–760.
Han Xiwu. Research on automatic acquisition of Chinese verb subcategorization [Dissertation]. Harbin Institute of Technology, 2006.
Huang ChangNing, TongXiang. Auto-tagging Chinese real-text word sense. Applied Linguistics, 1993, (4): 18–25.
Li JuanZi. The research on Chinese word sense disambiguation [Dissertation]. Tsinghua University, Beijing, 1999.
Lu Song, Bai Shuo, Huang Xiong. An unsuptervised approach to word sense disambiguation based on sense-words in vector space model. Journal of Software, 2002, 3(6): 1082–1089.
Lu Zhimao, Liu Ting, Zhang Gang et al. Word sense disambiguation based on dependency relationship analysis and Bayes model. High Technology Letters, 2003, 13(5): 4–10.
Xu Min, Wang Nengzhong, Ma Yanhua. On study of anaphora resolution of Chinese character. J. Southwest China Normal University (Natural Science), 1999, 24(6): 633–637.
Zhang Wei, Zhou Changle. Study on meta-anaphoric resolution in Chinese discourse understanding. Journal of Software, 2002, 13(4): 732–738.
Wang Xiaobin, Zhou Changle. Study on Chinese pronominal anaphora resolution based on discourse representation theory. Journal of Xiamen University (Natural Science), 2004, 43(1): 31–35.
Li Guochen, Luo Yunfei. Chinese pronominal anaphora resolution via a preference selection approach. Journal of Chinese Information Processing, 2005, 19(4): 24–30.
Coulthard R. An Introduction to Discourse Analysis. London: Longman, 1985.
Fu Jianlian, Chen Qunxiu. Topic analysis in the automatic summarization system. Journal of Chinese Information Processing, 2005, 19(6): 28–36.
Mei Lijun, Zhou Qiang. Research on the information combination of HOWNET and thesaurus. Journal of Chinese Information Processing, 2005, 1(19): 63–70.
Wu Weitian. Chinese Computational Semantic Theory. Beijing: Electron Industry Press, 1999.
Lu Chuan. Semantic Network in Chinese Grammar. Beijing: Business Press, 1999.
Huang Zengyang. Hierarchical Network of Concepts Theory. Beijing: Tsinghua University Press, 1998.
Jin Guangjin, Lu Ruzhan. A method for extracting logical functors from Chinese sentences. Journal of Software, 1998, 9(6): 444–447.
Zhan Weidong. A framework of Chinese semantic representation — Generalized valence mode. In Proc. 5th Joint Symposium of Computation Linguistics, Beijing, 1999, pp. 1–7.
Wu H, Zhou H. Synonymous collocation extraction using translation information. In Proc. the 41st Annual Meeting of the Association for Computational Linguistics, Japan, 2003, pp. 120–127.
Le Sun, Youbin Jin, Lin Du, Yufang Sun. Word alignment of English-Chinese bilingual corpus based on chunks. In Proc. 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, Hong Kong, 2000, pp. 110–116.
Lv Yajuan, Zhao Tiejun, Li Sheng, Yang Muyun. English-Chinese word alignment based on statistic and lexicon. In Proc. the 6th Joint Symposium of Computational Linguistics, Taiyuan, China, 2001, pp. 108–115.
Wei Wang, Ming Zhou, Jin-Xia Huang, Chang-Ning Huang. Structure alignment using bilingual chunking. In Proc. the 19th Int. Conf. Computational Linguistics, Taipei, 2002, pp. 1072–1078.
Dekai Wu. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 1997, 23(3): 377–404.
Yajuan Lv. Research on bilingual corpus alignment and automatic translation knowledge acquisition [Dissertation]. Harbin Institute of Technology, 2003.
Lin Xianming, Li Tangqiu, Shi Xiaodong. Auto-extraction of template library in template based machine translation (TBMT) system. Computer Applications. 2004, 24(9): 127–128.
Zhang Chunxiang, Zhao Tiejun, Yang Muyun et al. Research on adapting machine translation system to new domain. Computer Engineering and Applications, 2005, 41(3): 10–11.
Wang Haifeng. Chinese oral language analysis and its application in MT [Dissertation]. Harbin Institute of Technology, 1999.
Yang Muyun, Zhao Tiejun. Auto word alignment based Chinese-English EBMT. In Proc. Int. Workshop on Spoken Language Translation, 2004, pp. 10–13.
Xie G, Zong C, Xu B. Chinese spoken language analyzing based on combination of statistical and rule methods. In Proc. the Int. Conf. Spoken Language Processing (ICSLP’2002), Colorado, USA, 2002, pp. 613–616.
Wu H, Huang T et al. Chinese generation in a spoken dialogue translation system. In 18th Int. Conf. Computational Linguistics, Germany, 2000, pp. 1141–1144.
Zhou Y, Zong C, Xu Bo. Various aligned models in Chinese-to-English statistical machine translation. In Proc. the IEEE Int. Conf. Natural Language Processing and Knowledge Engineering (NLP-KE), Wuhan, China, 2005, pp. 443–448.
Pang W, Yang Zhengdong, Zhenbiao Chen et al. The CASIA phrase-based machine translation system. In Proc. 2005 International Workshop on Spoken Language Translation, Pittsburgh, USA, 2005, pp. 31–36.
Wangxin Xue. The current state and development of Chinese search engines. Information Technology and Economic Development, 2005, 3(15): 1–3.
Luk R W P, Kwok K L. A comparison of Chinese document indexing strategies and retrieval models. ACM Trans. Asian Language Information Processing, 2002, 1(3): 225–268.
Peng F, Huang X, Schuurmans D et al. Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR. In Proc. 19th Int. Conf. Computational Linguistics, Taipei, 2002, pp. 793–799.
Hong G, He P, Wu G et al. The impact of Chinese segmentation to Chinese information retrieval. Computer Engineering and Application, 2003, 39(19): 78–90. (in Chinese)
Du L, Sun Y F. A new indexing method based on word proximity for Chinese text retrieval. Journal of Computer Science and Technology, 2000, 15(3): 280–286.
Seo H C, Kim S B et al. Improving query translation in English-Korean cross-language information retrieval. Information Processing and Management, 2005, 41(3): 507–522.
Gao J F, Nie J Y, He Zh J et al. Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In Proc. 25th Annual Int. Conf. Research and Development in Information Retrieval, Tampere, Finland, 2002, pp. 183–190.
Zheng Dequan. Research on cross language information retrieval based on a combination of ontology and statistical language model [Dissertation]. Harbin Institute of Technology, 2006.
Ion Muslea. Extraction patterns for information extraction tasks: A survey. In Proc. AAAI’99, Workshop on Machine Learning for Information Extraction, USA, 1999.
Nicholas Kushmerick, Bernd Thomas. Adaptive information extraction: Core technologies for information agents. Intelligents Information Agents R&D in Europe: An AgentLink Perspective, 2002.
Yuan Junpeng, Zhu Donghua, LI Yi et al. Survey of text mining technology. Application Research of Computers, 2006, 23(2): 1–4.
Wang Haochang, Zhao Tiejun, Yu Hao, Extracting information from biomedical literatures. In Proc. 8th Joint Symp. Computational Linguistics, Nanjing, 2005, pp. 217–220.
Wang J B, Du C L, Wang K Z. Study of automatic abstraction system based on natural language understanding. Journal of Chinese Information Processing, 1995, 9(3): 33–42.
Wang Y C, Xu H M. OA automatic abstracting system on Chinese documents. Journal of the China Society for Scientific and Technical Information, 1997, 16(2): 128–132.
Tianshun Yao et al. Natural Language Processing—A Research of Making Computers Understand Human Languages. Beijing: Tsinghua University Press, Guangxi Science and Technology Publishing House, 1995.
Li L, Zhong Y X, Guo X H. An understanding-based Chinese automatic abstract system in special field. J. Computer Research and Development, 2000, 37(4): 6–10.
Zheng Y, Huang X J, Wu L D. Research and implementation of automatic multi-documents summarization system. J. Computer Research and Development, 2003, 40(11): 1606–1611.
Qing B, Ling T, Li S. Multi-document summarization based on local topics identification and extraction. Acta Automatica Sinica, 2004, 30(6): 906–910.
Liu D R, Wang Y C, Liu C H. Study of multiple documents summarization based on subject concept cohesion. J. the China Society for Scientific and Technical Information, 2005, 24(1): 69–71.
Li Xiao, Liu Jimin, Shi Zhongzhi. The concept-reasoning network and its application in text classification. J. Computer Research and Development, 2000, 37(9): 1033–1038.
Xie Chongfeng, Li Xing. A sequence-based automatic text classification algorithm. J. Software, 2002, 13(6): 783–789.
Tang Chunsheng, Jin Yihui. A multiple classifiers integration method based on full information matrix. Journal of Software, 2003, 14(6): 1103–1109.
Wang Zhiyong, Wang Zhengou. New text clustering method based on statistical reduction dimension and Kohonen network. Computer Applications, 2005, 15(10): 2328–2330.
Luo Weihua, Yu Manquan. The study of topic detection based on algorithm of division and multi-level clustering with multi-strategy optimization. Journal of Chinese Information Processing, 2006, 20(1): 29–36.
Wai Lam, Helen M Meng, Kin Hui. Multilingual topic detection using a parallel corpus. In Proc. Topic Detection and Tracking 2000 Workshop, USA, 2000.
Jingbo Zhu, Wenliang Chen, Tianshun Yao. TDT-oriented topic similarity computation model. In Proc. 7th Joint Symposium of Computational Linguistics, Harbin, 2003, pp. 476–481.
Guo Qing, Wu Wenhu, Fang Litang. A new method in hidden Markov model for modeling frame correlation. Journal of Software, 1999, 10(6): 631–635.
Li Jian, Wang Zuo-ying. A new re-estmation algorithm of HMM’s transition probability. Acta Ectronica Sinica, 2001, 29(S1): 1833–1835.
Wang Renhua, Jiang Hui. Forward and backward hidden Markov model with their application to continuous speech recognition. Acta Electronica Sinica, 1996, 24(10): 63–68.
Tang Yun, Liu Wen-ju, Xu Bo. Mandarin digit string recognition based on segment model using prosterior probability decoding. Chinese J. Computers, 2006, 29(4): 635–641.
Han Zhao-bing, Jia Lei, Zhang Shu-wu, Xu Bo. A combined clustering algorithm of acoustic modeling for continuous speech recognition. Journal of Chinese Information Processing, 2003, 17(4): 33–38.
Yu Sheng-min, Zhang Shu-wu, Xu bo. Research of Chinese-English bilingual acoustic modeling. Journal of Chinese Information Processing, 2004, 18(5): 78–84.
Zhu Xiaoyan, Wang Yu, Xu Wei. Speech recognition model based on recurrent neural networks. Chinese Journal of Computer, 2001, 24(2): 213–218.
Zhang Ruiqiang, Wang Zuoying, Lu Dajin. Zero-probabilities of language model in translation of Chinese spellings to characters. Acta Electronic Sinica, 1998, 26(8): 43–46.
Taiyi Huang, Caifei Wang, Yoh-Han Pao. A Chinese text-to-speech synthesis system based on an initial-final model. In Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Paris, 1982, pp. 1601–1604.
Wei Huawu, Cai Lianhong. Mandarin Sentence Synthesis System and the Acoustic Coding. In Proc. the 2nd National Conference on Acoustics, Guilin, China, 2002, pp. 281–291.
Chu Min, Lv Shinan. A Chinese text-to-speech system with high intelligibility and high naturalness. Journal of Acoustics, 1996, 21(4), 639–647.
Wang Renhua, Liu Qinfeng, Hu Yu. KD2000 Chinese text-to-speech system. In Proc. 3rd Int. Conf. Multimodal Interface, Beijing, 2000, pp. 187–190.
Author information
Authors and Affiliations
Corresponding author
Additional information
Survey: Supported by the National Natural Science Foundation of China (Grant Nos. 60375019, 60373101 and 60575041).
Sheng Li is a professor and Ph.D. supervisor of Harbin Institute of Technology. He is a standing director of Chinese Information Processing Society, appraiser of National Science Foundation, director of MOE-MS Key laboratory of NLP & Speech in HIT. His research interests include machine translation, information retrieval and natural language processing. In recent years, he has accomplished more than 10 projects funded by the Natural Science Foundation of China or 863 Hi-Tech Project. He has won 4 Second Prizes and 3 Third Prizes of the Ministry Science and Technology Progress Award. He has published more than 70 academic papers in the journals and conferences at home and abroad.
Tie-Jun Zhao is a professor and Ph.D. supervisor of Harbin Institute of technology, vice director of MOE-MS Key Laboratory of NLP & Speech in HIT. He is the member of NLP subject committee of Chinese Information Society, member of editorial board of Journal of Chinese Information Processing, member of the committee of China Language Data Consortium, member of Harbin Expert Group on Information Security, the senior member of China Computer Federation. His research interests include natural language processing, content-based web information processing, applied artificial intelligence. He has won 3 prizes of Ministry Science & Technology Award. He has published over 60 academic papers and 2 books.
Rights and permissions
About this article
Cite this article
Li, S., Zhao, TJ. Chinese Information Processing and Its Prospects. J Comput Sci Technol 21, 838–846 (2006). https://doi.org/10.1007/s11390-006-0838-6
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/s11390-006-0838-6