Skip to main content

Overview of Character-Based Models for Natural Language Processing

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10761))

  • 1118 Accesses

Abstract

Character-based models become more and more popular for different natural language processing task, especially due to the success of neural networks. They provide the possibility of directly model text sequences without the need of tokenization and, therefore, enhance the traditional preprocessing pipeline. This paper provides an overview of character-based models for a variety of natural language processing tasks. We group existing work in three categories: tokenization-based approaches, bag-of-n-gram models and end-to-end models. For each category, we present prominent examples of studies with a particular focus on recent character-based deep learning work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    There are also difficult cases in English, such as “Yahoo!” or “San Francisco-Los Angeles flights”.

  2. 2.

    In our view, morpheme-based models are not true instances of character-level models as linguistically motivated morphological segmentation is an equivalent step to tokenization, but on a different level. We therefore do not cover most work on morphological segmentation in this paper.

References

  1. Alex, B.: An unsupervised system for identifying english inclusions in german text. In: Annual Meeting of the Association for Computational Linguistics (2005)

    Google Scholar 

  2. Andor, D., et al.: Globally normalized transition-based neural networks. In: Annual Meeting of the Association for Computational Linguistics (2016)

    Google Scholar 

  3. Asgari, E., Mofrad, M.R.K.: Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10(11), 1–15 (2015)

    Article  Google Scholar 

  4. Asgari, E., Mofrad, M.R.K.: Comparing fifty natural languages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. In: Workshop on Multilingual and Cross-lingual Methods in NLP, pp. 65–74 (2016)

    Google Scholar 

  5. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4945–4949 (2016)

    Google Scholar 

  6. Baldwin, T., Lui, M.: Language identification: the long and the short of the matter. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies, pp. 229–237 (2010)

    Google Scholar 

  7. Ballesteros, M., Dyer, C., Smith, N.A.: Improved transition-based parsing by modeling characters instead of words with LSTMS. In: Conference on Empirical Methods in Natural Language Processing (2015)

    Google Scholar 

  8. Bilmes, J., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies (2003)

    Google Scholar 

  9. Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)

    Article  Google Scholar 

  10. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics (2017)

    Google Scholar 

  11. Bojanowski, P., Joulin, A., Mikolov, T.: Alternative structures for character-level RNNS. In: Workshop at International Conference on Learning Representations (2016)

    Google Scholar 

  12. Botha, J.A., Blunsom, P.: Compositional morphology for word representations and language modelling. In: International Conference on Machine Learning (2014)

    Google Scholar 

  13. Cao, K., Rei, M.: A joint model for word embedding and word morphology. In: Annual Meeting of the Association for Computational Linguistics, pp. 18–26 (2016)

    Google Scholar 

  14. Cavnar, W.: Using an n-gram-based document representation with a vector processing retrieval model. NIST SPECIAL PUBLICATION SP, pp. 269–269 (1995)

    Google Scholar 

  15. Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4960–4964 (2016)

    Google Scholar 

  16. Chen, A., He, J., Xu, L., Gey, F.C., Meggs, J.: Chinese text retrieval without using a dictionary. ACM SIGIR Forum 31(SI), 42–49 (1997)

    Article  Google Scholar 

  17. Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: International Joint Conference on Artificial Intelligence, pp. 1236–1242 (2015)

    Google Scholar 

  18. Chiu, J.P.C., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNS. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016)

    Google Scholar 

  19. Chung, J., Ahn, S., Bengio, Y.: Hierarchical multiscale recurrent neural networks. In: Proceedings of International Conference on Learning Representations (2017)

    Google Scholar 

  20. Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. In: Annual Meeting of the Association for Computational Linguistics (2016)

    Google Scholar 

  21. Church, K.W.: Char\(\_\)align: a program for aligning parallel texts at the character level. In: Annual Meeting of the Association for Computational Linguistics, pp. 1–8 (1993)

    Google Scholar 

  22. Clark, A.: Combining distributional and morphological information for part of speech induction. In: Conference of the European Chapter of the Association for Computational Linguistics, pp. 59–66 (2003)

    Google Scholar 

  23. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)

    MATH  Google Scholar 

  24. Costa-Jussà, M.R., Fonollosa, J.A.R.: Character-based neural machine translation. In: Annual Meeting of the Association for Computational Linguistics (2016)

    Google Scholar 

  25. Cotterell, R., Vieira, T., Schütze, H.: A joint model of orthography and morphological segmentation. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies (2016)

    Google Scholar 

  26. Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267, 843–848 (1995)

    Article  Google Scholar 

  27. De Heer, T.: Experiments with syntactic traces in information retrieval. Inf. Storage Retr. 10(3–4), 133–144 (1974)

    Article  Google Scholar 

  28. Dunning, T.: Statistical identification of language. Technical Report MCCS 940–273, Computing Research Laboratory, New Mexico State (1994)

    Google Scholar 

  29. Eyben, F., Wöllmer, M., Schuller, B.W., Graves, A.: From speech to letters - using a novel neural network architecture for grapheme based ASR. In: IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pp. 376–380 (2009)

    Google Scholar 

  30. Eyecioglu, A., Keller, B.: ASOBEK at SemEval-2016 task 1: sentence representation with character n-gram embeddings for semantic textual similarity. In: SemEval-2016: The 10th International Workshop on Semantic Evaluation, pp. 1320–1324 (2016)

    Google Scholar 

  31. Faruqui, M., Tsvetkov, Y., Neubig, G., Dyer, C.: Morphological inflection generation using character sequence to sequence learning. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies (2016)

    Google Scholar 

  32. Gerdjikov, S., Schulz, K.U.: Corpus analysis without prior linguistic knowledge-unsupervised mining of phrases and subphrase structure. CoRR abs/1602.05772 (2016)

    Google Scholar 

  33. Gillick, D., Brunk, C., Vinyals, O., Subramanya, A.: Multilingual language processing from bytes. In: North American Chapter of the Association for Computational Linguistics, pp. 1296–1306, June 2016

    Google Scholar 

  34. Golub, D., He, X.: Character-level question answering with attention. In: Conference on Empirical Methods in Natural Language Processing (2016)

    Google Scholar 

  35. Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013)

    Google Scholar 

  36. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772 (2014)

    Google Scholar 

  37. Haizhou, L., Min, Z., Jian, S.: A joint source-channel model for machine transliteration. In: Annual Meeting of the Association for Computational Linguistics, p. 159 (2004)

    Google Scholar 

  38. Hardmeier, C.: A neural model for part-of-speech tagging in historical texts. In: International Conference on Computational Linguistics, pp. 922–931 (2016)

    Google Scholar 

  39. Hirsimäki, T., Creutz, M., Siivola, V., Kurimo, M., Virpioja, S., Pylkkönen, J.: Unlimited vocabulary speech recognition with morph language models applied to finnish. Comput. Speech Lang. 20(4), 515–541 (2006)

    Article  Google Scholar 

  40. Ircing, P., et al.: On large vocabulary continuous speech recognition of highly inflectional language-czech. In: Proceedings of the 7th European Conference on Speech Communication and Technology, vol. 1, pp. 487–490. ISCA: International Speech Communication Association (2001)

    Google Scholar 

  41. Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M., Smith, N.A.: Hierarchical character-word models for language identification. In: Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, pp. 84–93 (2016)

    Google Scholar 

  42. Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., Kavukcuoglu, K.: Neural machine translation in linear time. CoRR abs/1610.10099 (2016)

    Google Scholar 

  43. Kann, K., Cotterell, R., Schütze, H.: Neural morphological analysis: encoding-decoding canonical segments. In: Conference on Empirical Methods in Natural Language Processing (2016)

    Google Scholar 

  44. Kann, K., Schütze, H.: MED: The LMU system for the SIGMORPHON 2016 shared task on morphological reinflection. In: SIGMORPHON Workshop (2016)

    Google Scholar 

  45. Kann, K., Schütze, H.: Single-model encoder-decoder with explicit morphological representation for reinflection. In: Annual Meeting of the Association for Computational Linguistics (2016)

    Google Scholar 

  46. Kaplan, R.M., Kay, M.: Regular models of phonological rule systems. Comput. Linguist. 20(3), 331–378 (1994)

    Google Scholar 

  47. Kettunen, K., McNamee, P., Baskaya, F.: Using syllables as indexing terms in full-text information retrieval. In: Human Language Technologies - The Baltic Perspective - Proceedings of the Fourth International Conference Baltic HLT 2010, Riga, Latvia, October 7–8, 2010, pp. 225–232 (2010)

    Google Scholar 

  48. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI Conference on Artificial Intelligence, pp. 2741–2749 (2016)

    Google Scholar 

  49. Kirchhoff, K., Vergyri, D., Bilmes, J., Duh, K., Stolcke, A.: Morphology-based language modeling for conversational arabic speech recognition. Comput. Speech Lang. 20(4), 589–608 (2006)

    Article  Google Scholar 

  50. Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Computational Natural Language Learning, pp. 180–183 (2003)

    Google Scholar 

  51. Knight, K., Graehl, J.: Machine transliteration. Comput. Linguist. 24(4), 599–612 (1998)

    Google Scholar 

  52. Kocmi, T., Bojar, O.: SubGram: extending skip-gram word representation with substrings. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 182–189. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_21

    Chapter  Google Scholar 

  53. Kombrink, S., Hannemann, M., Burget, L., Heřmanský, H.: Recovery of rare words in lecture speech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 330–337. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_42

    Chapter  Google Scholar 

  54. Kou, W., Li, F., Baldwin, T.: Automatic labelling of topic models using word vectors and letter trigram vectors. In: Asia Information Retrieval Societies Conference (AIRS), pp. 253–264 (2015)

    Google Scholar 

  55. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies (2016)

    Google Scholar 

  56. Lee, J., Cho, K., Hofmann, T.: Fully character-level neural machine translation without explicit segmentation. CoRR abs/1610.03017 (2016)

    Google Scholar 

  57. Lepage, Y., Denoual, E.: Purest ever example-based machine translation: detailed presentation and assessment. Mach. Transl. 19(3–4), 251–282 (2005)

    Google Scholar 

  58. Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1520–1530 (2015)

    Google Scholar 

  59. Ling, W., Trancoso, I., Dyer, C., Black, A.W.: Character-based neural machine translation. CoRR abs/1511.04586 (2015)

    Google Scholar 

  60. Luong, M., Manning, C.D.: Achieving open vocabulary neural machine translation with hybrid word-character models. In: Annual Meeting of the Association for Computational Linguistics (2016)

    Google Scholar 

  61. Luong, M.T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Computational Natural Language Learning (2013)

    Google Scholar 

  62. Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional LSTM-CNNS-CRF. In: Annual Meeting of the Association for Computational Linguistics (2016)

    Google Scholar 

  63. McNamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1–2), 73–97 (2004)

    Article  Google Scholar 

  64. Mihalcea, R., Nastase, V.: Letter level learning for language independent diacritics restoration. In: Computational Natural Language Learning (2002)

    Google Scholar 

  65. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  66. Mikolov, T., Sutskever, I., Deoras, A., Le, H.S., Kombrink, S., Cernocky, J.: Subword language modeling with neural networks (2012)

    Google Scholar 

  67. Miyamoto, Y., Cho, K.: Gated word-character recurrent language model. In: Conference on Empirical Methods in Natural Language Processing, pp. 1992–1997 (2016)

    Google Scholar 

  68. Müller, T., Schmid, H., Schütze, H.: Efficient higher-order CRFs for morphological tagging. In: Conference on Empirical Methods in Natural Language Processing, pp. 322–332 (2013)

    Google Scholar 

  69. Parada, C., Dredze, M., Sethy, A., Rastrow, A.: Learning sub-word units for open vocabulary speech recognition. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 712–721 (2011)

    Google Scholar 

  70. Peng, F., Schuurmans, D., Wang, S., Keselj, V.: Language independent authorship attribution using character level language models. In: Conference of the European Chapter of the Association for Computational Linguistics, pp. 267–274 (2003)

    Google Scholar 

  71. Pettersson, E., Megyesi, B., Nivre, J.: A multilingual evaluation of three spelling normalisation methods for historical text. In: Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 32–41 (2014)

    Google Scholar 

  72. Plank, B., Søgaard, A., Goldberg, Y.: Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In: Annual Meeting of the Association for Computational Linguistics (2016)

    Google Scholar 

  73. Rastogi, P., Cotterell, R., Eisner, J.: Weighting finite-state transductions with neural context. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies, pp. 623–633 (2016)

    Google Scholar 

  74. Ratnaparkhi, A., et al.: A maximum entropy model for part-of-speech tagging. In: Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 133–142. Philadelphia, USA (1996)

    Google Scholar 

  75. Sajjad, H.: Statistical models for unsupervised, semi-supervised and supervised transliteration mining. In: Computational Linguistics (2012)

    Google Scholar 

  76. dos Santos, C.N., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: International Conference on Computational Linguistics. pp. 69–78 (2014)

    Google Scholar 

  77. dos Santos, C.N., Guimarães, V.: Boosting named entity recognition with neural character embeddings. In: Fifth Named Entity Workshop, pp. 25–33 (2015)

    Google Scholar 

  78. dos Santos, C.N., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: International Conference on Machine Learning, pp. 1818–1826 (2014)

    Google Scholar 

  79. Schütze, H.: Word space. In: Advances in Neural Information Processing Systems, pp. 895–902 (1992)

    Google Scholar 

  80. Schütze, H.: Nonsymbolic text representation. CoRR abs/1610.00479 (2016)

    Google Scholar 

  81. Sejnowski, T.J., Rosenberg, C.R.: Parallel networks that learn to pronounce english text. Complex Syst. 1(1), 145–168 (1987)

    MATH  Google Scholar 

  82. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Annual Meeting of the Association for Computational Linguistics (2016)

    Google Scholar 

  83. Shaik, M.A.B., Mousa, A.E.D., Schlüter, R., Ney, H.: Hybrid language models using mixed types of sub-lexical units for open vocabulary german lvcsr. In: Annual Conference of the International Speech Communication Association, pp. 1441–1444 (2011)

    Google Scholar 

  84. Shaik, M.A.B., Mousa, A.E., Schlüter, R., Ney, H.: Feature-rich sub-lexical language models using a maximum entropy approach for german LVCSR. In: Annual Conference of the International Speech Communication Association, pp. 3404–3408 (2013)

    Google Scholar 

  85. Shannon, C.E.: Prediction and entropy of printed english. Bell Labs Tech. J. 30(1), 50–64 (1951)

    Article  Google Scholar 

  86. Sperr, H., Niehues, J., Waibel, A.: Letter n-gram-based input encoding for continuous space language models. In: Workshop on Continuous Vector Space Models and their Compositionality, pp. 30–39 (2013)

    Google Scholar 

  87. Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. In: ICML 2015 Deep Learing Workshop (2015)

    Google Scholar 

  88. Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: International Conference on Machine Learning, pp. 1017–1024 (2011)

    Google Scholar 

  89. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  90. Tiedemann, J., Nakov, P.: Analyzing the use of character-level translation with sparse and noisy datasets. In: Recent Advances in Natural Language Processing, RANLP 2013, 9–11 September, 2013, Hissar, Bulgaria, pp. 676–684 (2013)

    Google Scholar 

  91. Murthy, V., Khapra, M.M., Bhattacharyya, P.: Sharing network parameters for crosslingual named entity recognition. CoRR abs/1607.00198 (2016)

    Google Scholar 

  92. Vergyri, D., Kirchhoff, K., Duh, K., Stolcke, A.: Morphology-based language modeling for arabic speech recognition. In: Annual Conference of the International Speech Communication Association, 4, 2245–2248 (2004)

    Google Scholar 

  93. Vilar, D., Peter, J.T., Ney, H.: Can we translate letters? In: Workshop on Statistical Machine Translation (2007)

    Google Scholar 

  94. Vylomova, E., Cohn, T., He, X., Haffari, G.: Word representation models for morphologically rich languages in neural machine translation. CoRR abs/1606.04217 (2016)

    Google Scholar 

  95. Wang, L., Cao, Z., Xia, Y., de Melo, G.: Morphological segmentation with window LSTM neural networks. In: AAAI Conference on Artificial Intelligence (2016)

    Google Scholar 

  96. Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: embedding words and sentences via character n-grams. In: Conference on Empirical Methods in Natural Language Processing (2016)

    Google Scholar 

  97. Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016)

    Google Scholar 

  98. Xiao, Y., Cho, K.: Efficient character-level document classification by combining convolution and recurrent layers. CoRR abs/1602.00367 (2016)

    Google Scholar 

  99. Yaghoobzadeh, Y., Schütze, H.: Multi-level representations for fine-grained typing of knowledge base entities. In: Conference of the European Chapter of the Association for Computational Linguistics (2017)

    Google Scholar 

  100. Yang, Z., Chen, W., Wang, F., Xu, B.: A character-aware encoder for neural machine translation. In: International Conference on Computational Linguistics, pp. 3063–3070 (2016)

    Google Scholar 

  101. Yang, Z., Salakhutdinov, R., Cohen, W.W.: Multi-task cross-lingual sequence tagging from scratch. CoRR abs/1603.06270 (2016)

    Google Scholar 

  102. Yu, L., Buys, J., Blunsom, P.: Online segment to segment neural transduction. In: Conference on Empirical Methods in Natural Language Processing, pp. 1307–1316 (2016)

    Google Scholar 

  103. Zhang, X., LeCun, Y.: Text understanding from scratch. CoRR abs/1502.01710 (2015)

    Google Scholar 

  104. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hinrich Schütze .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Adel, H., Asgari, E., Schütze, H. (2018). Overview of Character-Based Models for Natural Language Processing. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10761. Springer, Cham. https://doi.org/10.1007/978-3-319-77113-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77113-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77112-0

  • Online ISBN: 978-3-319-77113-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics