Abstract
We analyzed two conversational corpora in Finnish: A public library question-answering (QA) data and a private medical chat dataẆe developed response retrieval (ranking) models using TF-IDF, StarSpace, ESIM and BERT methods. These four represent techniques ranging from the simple and classical ones to recent pretrained transformer neural networks. We evaluated the effect of different preprocessing strategies, including raw, casing, lemmatization and spell-checking for the different methods. Using our medical chat data, We also developed a novel three-stage preprocessing pipeline with speaker role classification. We found the BERT model pretrained with Finnish (FinBERT) an unambiguous winner in ranking accuracy, reaching 92.2% for the medical chat and 98.7% for the library QA in the 1-out-of-10 response ranking task where the chance level was 10%. The best accuracies were reached using uncased text with spell-checking (BERT models) or lemmatization (non-BERT models). The role of preprocessing had less impact for BERT models compared to the classical and other neural network models. Furthermore, we found the TF-IDF method still a strong baseline for the vocabulary-rich library QA task, even surpassing the more advanced StarSpace method. Our results highlight the complex interplay between preprocessing strategies and model type when choosing the optimal approach in chat-data modelling. Our study is the first work on dialogue modelling using neural networks for the Finnish language. It is also first of the kind to use real medical chat data. Our work contributes towards the development of automated chatbots in the professional domain.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
References
The suomi24 sentences corpus 2001–2017, korp version 1.1. http://urn.fi/urn:nbn:fi:lb-2020021803
Aunimo, L., Heinonen, O., Kuuskoski, R., Makkonen, J., Petit, R., Virtanen, O.: Question answering system for incomplete and noisy data. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 193–206. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36618-0_14
Beißwenger, M., Bartsch, S., Evert, S., Würzner, K.M.: EmpiriST 2015: a shared task on the automatic linguistic annotation of computer-mediated communication and web corpora. In: Proceedings of the 10th Web as Corpus Workshop, pp. 44–56 (2016)
Beißwenger, M., Storrer, A.: 21. Corpora of computer-mediated communication. Corpus Linguistics. An International Handbook. Series: Handbücher zur Sprach-und Kommunikationswissenschaft/Handbooks of Linguistics and Communication Science. Mouton de Gruyter, Berlin (2008)
Camacho-Collados, J., Pilehvar, M.T.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Stroudsburg, PA, USA, pp. 40–46. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/W18-5406
Chaudhuri, D., Kristiadi, A., Lehmann, J., Fischer, A.: Improving response selection in multi-turn dialogue systems by incorporating domain knowledge. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium. Association for Computational Linguistics (2018)
Chen, Q., Zhu, X., Ling, Z.H., Wei, S., Jiang, H., Inkpen, D.: Enhanced LSTM for natural language inference. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1657–1668. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1152
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1423
Dong, J., Huang, J.: Enhance word representation for out-of-vocabulary on ubuntu dialogue corpus. arXiv preprint arXiv:1802.02614 (2018)
Gu, J.C., Li, T., Liu, Q., Zhu, X., Ling, Z.H., Ruan, Y.P.: Pre-trained and attention-based neural networks for building noetic task-oriented dialogue systems. arXiv preprint arXiv:2004.01940 (2020)
Gu, J.C., Ling, Z.H., Liu, Q.: Interactive matching network for multi-turn response selection in retrieval-based chatbots. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, New York, NY, USA, pp. 2321–2324. Association for Computing Machinery (2019). https://doi.org/10.1145/3357384.3358140
Gunasekara, C., Kummerfeld, J.K., Polymenakos, L., Lasecki, W.: DSTC7 task 1: noetic end-to-end response selection. In: Proceedings of the First Workshop on NLP for Conversational AI, Florence, Italy, pp. 60–67. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/W19-4107
Humeau, S., Shuster, K., Lachaux, M.A., Weston, J.: Poly-encoders: transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969 (2019)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017)
Kanerva, J., Ginter, F., Miekka, N., Leino, A., Salakoski, T.: Turku neural parser pipeline: an end-to-end system for the CoNLL 2018 shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics (2018)
Lowe, R., Pow, N., Serban, I.V., Pineau, J.: The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 285–294 (2015)
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
Mao, G., Jindian, S., Yu, S., Luo, D.: Multi-turn response selection for chatbots with hierarchical aggregation network of multi-representation. IEEE Access PP, 1 (2019). https://doi.org/10.1109/ACCESS.2019.2934149
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, vol. 2, pp. 3111–3119. Curran Associates Inc., Red Hook (2013)
Miller, A., et al.: ParlAI: a dialog research software platform. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 79–84 (2017)
Numminen, P.: Kysy kirjastonhoitajalta-neuvontapalvelun kysymystyypit. Informaatiotutkimus 27(2), 55–60 (2008)
Riou, M., Salim, S., Hernandez, N.: Using discursive information to disentangle French language chat (2015)
Ritter, A., Cherry, C., Dolan, B.: Unsupervised modeling of twitter conversations. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 172–180. Association for Computational Linguistics (2010)
Serban, I.V., Lowe, R., Henderson, P., Charlin, L., Pineau, J.: A survey of available corpora for building data-driven dialogue systems. Dialogue Discourse 9(1), 1–49 (2018)
Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1577–1586 (2015)
Swanson, K., Yu, L., Fox, C., Wohlwend, J., Lei, T.: Building a production model for retrieval-based chatbots. In: Proceedings of the First Workshop on NLP for Conversational AI, pp. 32–41 (2019)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017)
Vig, J., Ramea, K.: Comparison of transfer-learning approaches for response selection in multi-turn conversations. In: Association for the Advancement of Artificial Intelligence (2019)
Virtanen, A., et al.: Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076 (2019)
Wang, H., Lu, Z., Li, H., Chen, E.: A dataset for research on short-text conversations. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 935–945 (2013)
Wu, L.Y., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: StarSpace: embed all the things! In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Wu, Y., Wu, W., Xing, C., Xu, C., Li, Z., Zhou, M.: A sequential matching framework for multi-turn response selection in retrieval-based chatbots. Comput. Linguist. 45(1), 163–197 (2019)
Wu, Y., Wu, W., Xing, C., Zhou, M., Li, Z.: Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 496–505 (2017)
Acknowledgement
The authors wish to acknowledge CSC – IT Center for Science, Finland, for computational resources.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Kauttonen, J., Aunimo, L. (2020). Dialog Modelling Experiments with Finnish One-to-One Chat Data. In: Filchenkov, A., Kauttonen, J., Pivovarova, L. (eds) Artificial Intelligence and Natural Language. AINL 2020. Communications in Computer and Information Science, vol 1292. Springer, Cham. https://doi.org/10.1007/978-3-030-59082-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-59082-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59081-9
Online ISBN: 978-3-030-59082-6
eBook Packages: Computer ScienceComputer Science (R0)