Deep Learning Models for Representing Out-of-Vocabulary Words

Lochter, Johannes V.; Silva, Renato M.; Almeida, Tiago A.

doi:10.1007/978-3-030-61377-8_29

Johannes V. Lochter^10,11,
Renato M. Silva¹² &
Tiago A. Almeida¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12319))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

1279 Accesses
5 Citations

Abstract

Communication has become increasingly dynamic with the popularization of social networks and applications that allow people to express themselves and communicate instantly. In this scenario, distributed representation models have their quality impacted by new words that appear frequently or that are derived from spelling errors. These words that are unknown by the models, known as out-of-vocabulary (OOV) words, need to be properly handled to not degrade the quality of the natural language processing (NLP) applications, which depend on the appropriate vector representation of the texts. To better understand this problem and finding the best techniques to handle OOV words, in this study, we present a comprehensive performance evaluation of deep learning models for representing OOV words. We performed an intrinsic evaluation using a benchmark dataset and an extrinsic evaluation using different NLP tasks: text categorization, named entity recognition, and part-of-speech tagging. Although the results indicated that the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.

We gratefully acknowledge the support provided by the São Paulo Research Foundation (FAPESP; grants \(\#\)2017/09387-6, \(\#\)2018/02146-6), CAPES, and CNPq.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this study, we use the word sample to denote instance or text document.
2.
Transformers. Available at https://huggingface.co/transformers, Accessed on 2020/10/07 14:45:55.
3.
HiCE. Available at https://github.com/acbull/HiCE, Accessed on 2020/10/07 14:45:55.
4.
PyTorch Github. Available at https://bit.ly/2B7LS3U, Accessed on 2020/10/07 14:45:55.
5.
HiCE. Available at https://github.com/acbull/HiCE, Accessed on 2020/10/07 14:45:55.
6.
Keras. Available at https://keras.io/. Accessed on 2020/10/07 14:45:55.
7.
TensorFlow. Available at https://www.tensorflow.org/. Accessed on 2020/10/07 14:45:55.
8.
NER dataset with unusual and unseen entities in the context of emerging discussions.
9.
NER dataset with sentences that have technical terms in the biology domain.
10.
POS tagging dataset of Twitter messages.

References

Adams, O., Makarucha, A., Neubig, G., Bird, S., Cohn, T.: Cross-lingual word embeddings for low-resource language modeling. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 937–947. Association for Computational Linguistics, Valencia, April 2017
Google Scholar
Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media (LSM 2011), pp. 30–38. Association for Computational Linguistics, Portland, June 2011
Google Scholar
Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754. Association for Computational Linguistics, Vancouver, August 2017. https://doi.org/10.18653/v1/S17-2126
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics, Doha, October 2014
Google Scholar
Chomsky, N.: Syntactic Structures. Mouton and Co., The Hague (1957)
Book Google Scholar
Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. In: International Conference on Learning Representations (2020)
Google Scholar
Derczynski, L., Nichols, E., van Erp, M., Limsopatham, N.: Results of the WNUT2017 shared task on novel and emerging entity recognition. In: Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 140–147. Association for Computational Linguistics, Copenhagen, September 2017. https://doi.org/10.18653/v1/W17-4418
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Garneau, N., Leboeuf, J.S., Lamontagne, L.: Predicting and interpreting embeddings for out of vocabulary words in downstream tasks. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 331–333. Association for Computational Linguistics, Brussels, November 2018. https://doi.org/10.18653/v1/W18-5439
Hu, Z., Chen, T., Chang, K.W., Sun, Y.: Few-shot representation learning for out-of-vocabulary words. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4102–4112. Association for Computational Linguistics, Florence, July 2019. https://doi.org/10.18653/v1/P19-1402
Khodak, M., Saunshi, N., Liang, Y., Ma, T., Stewart, B., Arora, S.: A la carte embedding: cheap but effective induction of semantic feature vectors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12–22. Association for Computational Linguistics, Melbourne, July 2018. https://doi.org/10.18653/v1/P18-1002
Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. JNLPBA 2004, pp. 70–75. Association for Computational Linguistics, Stroudsburg (2004)
Google Scholar
Lazaridou, A., Marelli, M., Zamparelli, R., Baroni, M.: Compositional-ly derived representations of morphologically complex words in distributional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1517–1526. Association for Computational Linguistics, Sofia, August 2013
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)
Google Scholar
Lochter, J., Pires, P., Bossolani, C., Yamakami, A., Almeida, T.: Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 315–322, July 2018
Google Scholar
Lochter, J., Zanetti, R., Reller, D., Almeida, T.: Short text opinion detection using ensemble of classifiers and semantic indexing. Expert Syst. Appl. 62, 243–249 (2016)
Article Google Scholar
Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS 2013), pp. 3111–3119. Curran Associates Inc., Lake Tahoe (2013)
Google Scholar
Ouyang, X., Zhou, P., Li, C.H., Liu, L.: Sentiment analysis using convolutional neural network. In: 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, pp. 2359–2364 (2015)
Google Scholar
Pinter, Y., Guthrie, R., Eisenstein, J.: Mimicking word embeddings using subword RNNs. CoRR abs/1707.06961 (2017). http://arxiv.org/abs/1707.06961
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Google Scholar
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. Association for Computational Linguistics, Edinburgh, July 2011
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019)
Google Scholar
Shamma, D.A., Kennedy, L., Churchill, E.F.: Tweet the debates: understanding community annotation of uncollected sources. In: Proceedings of the First SIGMM Workshop on Social Media (WSM 2009), pp. 3–10. ACM, Beijing (2009). https://doi.org/10.1145/1631144.1631148
Speriosu, M., Sudan, N., Upadhyay, S., Baldridge, J.: Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of the First Workshop on Unsupervised Learning in NLP (EMNLP 2011), pp. 53–63. Association for Computational Linguistics, Edinburgh (2011)
Google Scholar
Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: INTERSPEECH (2012)
Google Scholar
Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment strength detection for the social web. J. Am. Soc. Inf. Sci. Technol. 63(1), 163–173 (2012). https://doi.org/10.1002/asi.21662
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017)
Google Scholar
Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. WSDM 2011, pp. 177–186. ACM, New York (2011)
Google Scholar
Yang, X., Macdonald, C., Ounis, I.: Using word embeddings in Twitter election classification. Inf. Retrieval J. 21(2–3), 183–207 (2017). https://doi.org/10.1007/s10791-017-9319-5
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Systems and Energy, University of Campinas (UNICAMP), Campinas, São Paulo, Brazil
Johannes V. Lochter
Smart Campus, Engineering College of Sorocaba (Facens), Sorocaba, São Paulo, Brazil
Johannes V. Lochter
Department of Computer Science, Federal University of São Carlos (UFSCar), Sorocaba, São Paulo, Brazil
Renato M. Silva & Tiago A. Almeida

Authors

Johannes V. Lochter
View author publications
You can also search for this author in PubMed Google Scholar
Renato M. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Tiago A. Almeida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Renato M. Silva .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Ricardo Cerri
Federal University of ABC, Santo Andre, Brazil
Ronaldo C. Prati

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lochter, J.V., Silva, R.M., Almeida, T.A. (2020). Deep Learning Models for Representing Out-of-Vocabulary Words. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12319. Springer, Cham. https://doi.org/10.1007/978-3-030-61377-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-61377-8_29
Published: 13 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61376-1
Online ISBN: 978-3-030-61377-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics