Chinese Abbreviation Identification Using Abbreviation-Template Features and Context Information

Sun, Xu; Wang, Houfeng

doi:10.1007/11940098_26

Xu Sun²² &
Houfeng Wang²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4285))

Included in the following conference series:

International Conference on Computer Processing of Oriental Languages

1043 Accesses
3 Citations

Abstract

Chinese abbreviations are frequently used without being defined, which has brought much difficulty into NLP. In this study, the definition-independent abbreviation identification problem is proposed and resolved as a classification task in which abbreviation candidates are classified as either ‘abbreviation’ or ‘non-abbreviation’ according to the posterior probability. To meet our aim of identifying new abbreviations from existing ones, our solution is to add generalization capability to the abbreviation lexicon by replacing words with word classes and therefore create abbreviation-templates. By utilizing abbreviation-template features as well as context information, a SVM model is employed as the classifier. The evaluation on a raw Chinese corpus obtains an encouraging performance. Our experiments further demonstrate the improvement after integrating with morphological analysis, substring analysis and person name identification.

Supported by National Social Science Foundation of China (No. 05BYY043) and National Natural Science Foundation of China (No. 60473138, No. 60675035).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chang, J., Schütze, H., Altman, R.: Creating an online dictionary of abbreviations from MEDLINE. Journal of American Medical Information Association 9(6), 612–620 (2002)
Article Google Scholar
Gao, J., Li, M., Huang, C.: Improved Source-channel Models for Chinese Word Segmentation. In: Proceedings of the 41st Annual Meeting of Association for Computational Linguistics (ACL), Sapporo, Japan, July 8-10, pp. 272–279 (2003)
Google Scholar
Chang, J.-S., Lai, Y.-T.: A Preliminary Study on Probabilistic Models for Chinese Abbreviations. In: Proceedings of the Third SIGHAN Workshop on Chinese Language Learning, ACL, Barcelona, Spain, pp. 9–16 (2004)
Google Scholar
Sun, J., Gao, J., Zhang, L., Zhou, M., Huang, C.-N.: Chinese Named Entity Identification Using Class-based Language Model. In: Proc. of the 19th International Conference on Computational Linguistics, Taipei, pp. 967–973 (2002)
Google Scholar
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE ASSP 35(3), 400–401 (1987)
Article Google Scholar
Och, Franz Josef: An efficient method for determining bilingual word classes. In: EACL 1999: Ninth Conference of the European Chapter of the Association for Computational Linguistics, pp. 71–76 (1999)
Google Scholar
Sproat, R., Shih, C., Gale, W., Chang, N.: A Stochastic Finite-State Word-Segmentation Algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)
Google Scholar
Sproat, R., Shih, C.: Corpus-Based Methods in Chinese Morphology and Phonology. In: COLING 2002 (2002)
Google Scholar
Schwartz, A., Hearst, M.: A simple algorithm for identifying abbreviation definitions in biomedical texts. In: Pacific Symposium on Biocomputing (PSB 2003), Kauai, Hawaii (2003)
Google Scholar
Martin, S., Liermann, J., Ney, H.: Algorithms for Bigram and Trigram Word Clustering. Speech Communication 24(1), 19–37 (1998)
Article Google Scholar
Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. International journal on Document Analysis and Recognition, 191–198 (1999)
Google Scholar
Joachims, T.: Making large-Scale SVM Learning Practical. In: Schkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999)
Google Scholar
Yeates, S.: Automatic extraction of acronyms from text. In: Third New Zealand Computer Science Research Students’ Conference, pp. 117–124 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, School of Electronic Engineering and Computer Science, Peking University, Beijing, 100871, China
Xu Sun & Houfeng Wang

Authors

Xu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Houfeng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, 630-0192, Takayama, Ikoma, Nara, Japan
Yuji Matsumoto
Dept of ECE, University of Illinois at Urbana Champaign, IL 61801, Urbana, USA
Richard W. Sproat
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
State Key Lab of Intelligent Tech. & Sys., Tsinghua University,
Min Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, X., Wang, H. (2006). Chinese Abbreviation Identification Using Abbreviation-Template Features and Context Information. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_26

Download citation

DOI: https://doi.org/10.1007/11940098_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics