Abstract
Characters have commonly been regarded as the minimal processing unit in Natural Language Processing (NLP). But many non-latin languages have hieroglyphic writing systems, involving a big alphabet with thousands or millions of characters. Each character is composed of even smaller parts, which are often ignored by the previous work. In this paper, we propose a novel architecture employing two stacked Long Short-Term Memory Networks (LSTMs) to learn sub-character level representation and capture deeper level of semantic meanings. To build a concrete study and substantiate the efficiency of our neural architecture, we take Chinese Word Segmentation as a research case example. Among those languages, Chinese is a typical case, for which every character contains several components called radicals. Our networks employ a shared radical level embedding to solve both Simplified and Traditional Chinese Word Segmentation, without extra Traditional to Simplified Chinese conversion, in such a highly end-to-end way the word segmentation can be significantly simplified compared to the previous work. Radical level embeddings can also capture deeper semantic meaning below character level and improve the system performance of learning. By tying radical and character embeddings together, the parameter count is reduced whereas semantic knowledge is shared and transferred between two levels, boosting the performance largely. On 3 out of 4 Bakeoff 2005 datasets, our method surpassed state-of-the-art results by up to 0.4%. Our results are reproducible; source codes and corpora are available on GitHub (https://github.com/hankcs/sub-character-cws).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
https://github.com/facebookresearch/fastText With tiny modification to output n-gram vectors.
- 3.
http://www.sighan.org/bakeoff2003/score This script rounds a score to one digit.
References
T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, S. Khudanpur, Recurrent neural network based language model, in Interspeech, vol. 2 (2010), p. 3
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space (2013). arXiv.org
P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information (2016). arXiv.org. arXiv:1607.04606
Y. Kim, Y. Jernite, D. Sontag, A.M. Rush, Character-aware neural language models, in AAAI (2016), pp. 2741–2749
Y. Pinter, R. Guthrie, J. Eisenstein, Mimicking word embeddings using subword RNNs (2017, preprint). arXiv:1707.06961
Y. Sun, L. Lin, N. Yang, Z. Ji, X. Wang, Radical-enhanced Chinese character embedding, in ICONIP, vol. 8835, Chap. 34 (2014), pp. 279–286
Y. Li, W. Li, F. Sun, S. Li, Component-enhanced Chinese character embeddings, in EMNLP (2015)
X. Shi, J. Zhai, X. Yang, Z. Xie, C. Liu, Radical embedding - delving deeper to Chinese radicals, in ACL (2015)
C. Dong, J. Zhang, C. Zong, M. Hattori, H. Di, Character-based LSTM-CRF with radical-level features for Chinese named entity recognition, in NLPCC/ICCPOL (2016)
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in NIPS (2013)
X. Zheng, H. Chen, T. Xu, Deep learning for Chinese word segmentation and POS tagging, in EMNLP (2013)
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P.P. Kuksa, Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
W. Pei, T. Ge, B. Chang, Max-margin tensor neural network for Chinese word segmentation, in ACL (2014)
X. Chen, X. Qiu, C. Zhu, P. Liu, X. Huang, Long short-term memory neural networks for Chinese word segmentation, in EMNLP (2015)
D. Cai , H. Zhao, Neural word segmentation learning for Chinese, in ACL (2016)
D. Cai, H. Zhao, Z. Zhang, Y. Xin, Y. Wu, F. Huang, Fast and accurate neural word segmentation for Chinese (2017). arXiv.org. arXiv:1704.07047
G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures for named entity recognition. CoRR (2016)
C. Huang, H. Zhao, Chinese word segmentation: a decade review. J. Chin. Inf. Process. 21(3), 8–19 (2007)
N. Xue, Chinese word segmentation as character tagging, in IJCLCLP (2003)
F. Peng, F. Feng, A. Mccallum, Chinese segmentation and new word detection using conditional random fields, in COLING (2004), pp. 562–568
H. Tseng, P. Chang, G. Andrew, D. Jurafsky, C. Manning, A conditional random field word segmenter for sighan bakeoff 2005, in SIGHAN Workshop on Chinese Language Processing (2005), pp. 168–171
H. Zhao, C. Huang, M. Li, B.-L. Lu, Effective tag set selection in Chinese word segmentation via conditional random field modeling, in PACLIC (2006)
H. Zhao, C.N. Huang, M. Li, B.L. Lu, A unified character-based tagging framework for chinese word segmentation. ACM Trans. Asian Lang. Inf. Process. 9(2), 1–32 (2010)
X. Sun, H. Wang, W. Li, Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection, in ACL (2012), pp. 253–262
Y. Qi, S.G. Das, R. Collobert, J. Weston, Deep learning for character-based information extraction, in ECIR (2014)
X. Chen, Z. Shi, X. Qiu, X. Huang, Adversarial multi-criteria learning for Chinese word segmentation. vol. 1704 (2017). arXiv:1704.07556
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)
J.D. Lafferty, A. Mccallum, F.C.N. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in Eighteenth International Conference on Machine Learning (2001), pp. 282–289
T. Emerson, The second international chinese word segmentation bakeoff, in Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island (2005), pp. 123–133
Y. Zhang, S. Clark, Chinese Segmentation with a Word-Based Perceptron Algorithm (Association for Computational Linguistics, Prague, 2007), pp. 840–847. http://www.aclweb.org/anthology/P/P07/P07-1106
X. Sun, Y. Zhang, T. Matsuzaki, Y. Tsuruoka, J. Tsujii, A discriminative latent variable chinese segmenter with hybrid word/character information, in NAACL (2009), pp. 56–64
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
He, H. et al. (2018). Dual Long Short-Term Memory Networks for Sub-Character Representation Learning. In: Latifi, S. (eds) Information Technology - New Generations. Advances in Intelligent Systems and Computing, vol 738. Springer, Cham. https://doi.org/10.1007/978-3-319-77028-4_55
Download citation
DOI: https://doi.org/10.1007/978-3-319-77028-4_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77027-7
Online ISBN: 978-3-319-77028-4
eBook Packages: EngineeringEngineering (R0)