Abstract
We examine named entity recognition (NER), an essential and commonly used first step in many natural language processing tasks, including chatbots and language translation. We focus on the application of NER to texts that have a lot of noise, such as tweets, which is difficult due to the casual and unstructured language often used in these mediums. In this study, we make use of the largest available labeled data sets for Turkish NER, specifically targeting three informal platforms, namely Twitter, Facebook and Donanimhaber. We choose Turkish as a representative agglutinative language, which has a significantly different structure than other well-known languages such as English, French, and German. We emphasize that the methodologies and insights gained from this study can be extended to other agglutinative languages, like Finnish, Hungarian, Japanese, and Korean. We apply NER to these datasets using 16 different named entity tags through a framework that employs bidirectional long short-term memory (BiLSTM) networks followed by conditional random fields (CRF), known together as the BiLSTM-CRF model. Our experiments show an F1 score of 84% on a combined dataset, which indicates that deep learning models can also be effectively used for business applications in informal settings in agglutinative languages such as Turkish.
Similar content being viewed by others
Availability of data and materials
Data sharing is not applicable to this article as no new data were analyzed in this study.
Notes
For instance, employing i as opposed to ı, or g instead of ğ.
References
Yilmaz, S.F., Balaban, I., Tekin, S.F., and Kozat, S.S.: Hybrid framework for named entity recognition in turkish social media. In 2020 28th Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2020)
Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: language-independent named entity recognition (2003) arXiv preprint arXiv:cs/0306050
Chen, X., Du, J., Zhang, H.: Lipreading with densenet and resbi-lstm. SIViP 14, 981–989 (2020)
Bontcheva, K., et al.: Twitie: an open-source information extraction pipeline for microblog text. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP), pp. 83–90 (2013)
Mohit, B.: Named entity recognition. In Natural Language Processing of Semitic Languages, pp. 221–245, (2014)
Mollá, D., et al.: Named entity recognition for question answering (2006)
Babych, B., Hartley, A.: Improving machine translation quality with automatic named entity recognition. In Proceedings of the 7th International EAMT Workshop on MT and Other Language Technology Tools. Association for Computational Linguistics, pp. 1–8 (2003)
Shi, Y., et al.: A natural language-inspired multilabel video streaming source identification method based on deep neural networks. SIViP 15, 1161–1168 (2021)
Ritter, A., et al.: Named entity recognition in tweets: an experimental study. In Proceedings of the Conference Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp. 1524–1534 (2011)
Şahinuç, F., Yilmaz, E. H., Toraman, C., Koç, A.: The effect of gender bias on hate speech detection. SIViP 1–7 (2022)
Yeniterzi, R. et al.: Turkish named-entity recognition. In Turkish Natural Language Processing, pp. 115–132. Springer (2018)
Alazaidah, R., Ahmad, F.K.: Trending challenges in multi label classification. Int. J. Adv. Comput. Sci. Appl. (2016)
Tür, G.: A statistical information extraction system for turkish, Ph.D. dissertation, Bilkent Univ., (2000)
Küçük, D., Yazici, A.: A hybrid named entity recognizer for Turkish with applications to different text genres. In Computing and Information Science, pp. 113–116. Springer (2011)
Tatar, S., Cicekli, I.: Automatic rule learning exploiting morphological features for named entity recognition in Turkish. J. Inf. Sci. 37(2), 137–151 (2011)
Tür, G., Hakkani-Tür, D., Oflazer, K.: A statistical information extraction system for Turkish. Nat. Lang. Eng. 9(2), 181–210 (2003)
Küçük, D. et al.: Named entity recognition experiments on Turkish texts. In International Conference on Flexible Query Answering Systems, pp. 524–535. Springer (2009)
Şeker, G. A., Eryiğit, G.: Initial explorations on using crfs for Turkish named entity recognition. In Proceedings of the COLING, pp. 2459–2474 (2012)
Demir, H., Özgür, A.: Improving named entity recognition for morphologically rich languages using word embeddings. In ICMLA (2014)
Çelikkaya, G. et al.: Named entity recognition on real data: a preliminary investigation for turkish. In proceedings of the 7th International Conference on Information, Communication and Computing Technology, IEEE, pp. 1–5 (2013)
Eken, B., Tantug, C.: Recognizing named entities in turkish tweets. In Proceedings of the Fourth International Conference on Software Engineering and Application, Dubai, UAE (2015)
Küçük, D., Steinberger, R.: Experiments to improve named entity recognition on turkish tweets (2014) arXiv preprint arXiv:1410.8668
Vural, N.M., Ilhan, F., Yilmaz, S.F., Ergüt, S., Kozat, S.S.: Achieving online regression performance of LSTMS with simple RNNS. IEEE Trans. Neural Netw. Learn. Syst. 33(12), 7632–7643 (2022)
Yilmaz, S.F., Kaynak, E.B., Koç, A., Dibeklioğlu, H., Kozat, S.S.: Multi-label sentiment analysis on 100 languages with dynamic weighting for label imbalance. IEEE Trans. Neural Netw. Learn. Syst. (2021)
Jin, Y., Xie, J., Guo, W., Luo, C., Wu, D., Wang, R.: LSTM-CRF neural network with gated self attention for Chinese NER. IEEE Access 7, 136694–136703 (2019)
Akkaya, E.K.: Deep neural networks for named entity recognition on social media, Master’s thesis, Fen Bilimleri Enstitüsü, (2018)
Yilmaz, S.F., Balaban, I., Kozat, S.S.: Improved named entity recognition in Turkish news via word lookup methods. In 2020 28th Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2020)
Nakayama, H. et al.: doccano: Text annotation tool for human (2018) [Online]. Available: https://github.com/doccano/doccano
Eryiğit, G.: Itu turkish nlp web service. In Proceedings of the Demonstrations 14th Conference of the European Chapter of the Association for Computational Linguistic, pp. 1–4 (2014)
Akın, A.A., Akın, M.D.: Zemberek, an open source NLP framework for Turkic languages. Structure 10, 1–5 (2007)
Manning, C. et al.: The stanford corenlp natural language processing toolkit. In 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1577–1586 (2013)
Giritlioğlu, D., Mandira, B., Yilmaz, S.F., Ertenli, C.U., Akgür, B.F., Kınıklıoğlu, M., Kurt, A.G., Mutlu, E., Gürel, ŞC., Dibeklioğlu, H.: Multimodal analysis of personality traits on videos of self-presentation and induced behavior. J. Multimodal User Interfaces 15(4), 337–358 (2021)
Mandıra, B., Giritlioglu, D., Yilmaz, S.F., Ertenli, C.U., Akgür, B.F., Kınıklıoglu, M., Kurt, A.G., Doganlı, M.N., Mutlu, E., Gürel, S.C., et al.: Spatiotemporal and multimodal analysis of personality traits. In 15th International Summer Workshop on Multimodal Interfaces, (2019)
Collobert, R., et al.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Grave, E. et al.: Learning word vectors for 157 languages. In Proceedings of theInternational Conference on Language Resources and Evaluation (LREC 2018), (2018)
Kuru, O.: Charner: character-level named entity recognition. In Proceedings of the of COLING, et al.: The 26th International Conference on Computational Linguistics: Technical Papers 2016, 911–921 (2016)
Gungor, O. et al.: Morphological embeddings for named entity recognition in morphologically rich languages (2017) arXiv preprint arXiv:1706.00506
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, pp. 1064–1074 (2016)
Lesk, M.E., Schmidt, E.: Lex: A lexical analyzer generator (1975)
Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)
Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tagging (2015) arXiv preprint arXiv:1508.01991
Reimers , N., Gurevych, I.: Reporting score distributions: performance study of lstm-networks for sequence tagging (2017) arXiv:1707.09861
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014) arXiv:1412.6980
Eşref, Y., Can, B.: Using morpheme-level attention mechanism for turkish sequence labelling. In 27th Signal Processing and Communications Applications Conference (SIU). IEEE, pp. 1–4 (2019)
Güneş, A., Tantug, A.C.: Turkish named entity recognition with deep learning. In 26th Signal Processing and Communications Applications Conference (SIU). IEEE, pp. 1–4 (2018)
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
All authors agreed on the content of this study. SY and FM conducted the analysis based on the agreed steps. Results and conclusions are examined and written together.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work has appeared in part at the 2022 IEEE Signal Processing and Communications Applications Conference [1] and was done when Selim F. Yilmaz was affiliated with Bilkent University, Ankara, Turkey.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yilmaz, S.F., Mutlu, F.B., Balaban, I. et al. TMD-NER: Turkish multi-domain named entity recognition for informal texts. SIViP 18, 2255–2263 (2024). https://doi.org/10.1007/s11760-023-02898-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02898-0