Abstract
The ever-growing volume of data of user-generated content on social media provides a nearly unlimited corpus of unlabeled data even in languages where resources are scarce. In this paper, we demonstrate that state-of-the-art results on two Thai social text categorization tasks can be realized by pretraining a language model on a large noisy Thai social media corpus of over 1.26 billion tokens and later fine-tuned on the downstream classification tasks. Due to the linguistically noisy and domain-specific nature of the content, our unique data preprocessing steps designed for Thai social media were utilized to ease the training comprehension of the model. We compared four modern language models: ULMFiT, ELMo with biLSTM, OpenAI GPT, and BERT. We systematically compared the models across different dimensions including speed of pretraining and fine-tuning, perplexity, downstream classification benchmarks, and performance in limited pretraining data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
The dictionary is referenced at our GitHub https://github.com/Knight-H/thai-lm.
References
Bert-th. (2019). https://github.com/ThAIKeras/bert
Scrapy. (2019). https://github.com/scrapy/scrapy
thai2fit. (2019). https://github.com/cstorm125/thai2fit
Aroonmanakun, W.: Thoughts on word and sentence segmentation in Thai (2007)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv e-prints arXiv:1406.1078, June 2014
Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. arXiv e-prints arXiv:1511.01432, November 2015
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv e-prints arXiv:1810.04805, October 2018
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv e-prints arXiv:1801.06146, January 2018
Lertpiya, A., et al.: A preliminary study on fundamental Thai NLP tasks for user-generated web content. In: 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–8, November 2018. https://doi.org/10.1109/iSAI-NLP.2018.8692946
Merity, S., Shirish Keskar, N., Socher, R.: Regularizing and Optimizing LSTM Language Models. arXiv e-prints arXiv:1708.02182, August 2017
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv e-prints arXiv:1301.3781, January 2013
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Peters, M.E., et al.: Deep contextualized word representations. arXiv e-prints arXiv:1802.05365, February 2018
Pythainlp 2.0. (2019). https://github.com/PyThaiNLP/pythainlp
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv e-prints arXiv:1409.3215, September 2014
Vaswani, A., et al.: Attention is all you need. arXiv e-prints arXiv:1706.03762, June 2017
Acknowledgements
In the making of the paper, the authors would like to acknowledge Mr. Can Udomcharoenchaikit for his continuous and insightful research suggestions until the completion of this paper.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Horsuwan, T., Kanwatchara, K., Vateekul, P., Kijsirikul, B. (2020). A Comparative Study of Pretrained Language Models on Thai Social Text Categorization. In: Nguyen, N., Jearanaitanakij, K., Selamat, A., Trawiński, B., Chittayasothorn, S. (eds) Intelligent Information and Database Systems. ACIIDS 2020. Lecture Notes in Computer Science(), vol 12033. Springer, Cham. https://doi.org/10.1007/978-3-030-41964-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-41964-6_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41963-9
Online ISBN: 978-3-030-41964-6
eBook Packages: Computer ScienceComputer Science (R0)