A Comparative Study of Pretrained Language Models on Thai Social Text Categorization

Horsuwan, Thanapapas; Kanwatchara, Kasidis; Vateekul, Peerapon; Kijsirikul, Boonserm

doi:10.1007/978-3-030-41964-6_6

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12033))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

1332 Accesses
1 Citations

Abstract

The ever-growing volume of data of user-generated content on social media provides a nearly unlimited corpus of unlabeled data even in languages where resources are scarce. In this paper, we demonstrate that state-of-the-art results on two Thai social text categorization tasks can be realized by pretraining a language model on a large noisy Thai social media corpus of over 1.26 billion tokens and later fine-tuned on the downstream classification tasks. Due to the linguistically noisy and domain-specific nature of the content, our unique data preprocessing steps designed for Thai social media were utilized to ease the training comprehension of the model. We compared four modern language models: ULMFiT, ELMo with biLSTM, OpenAI GPT, and BERT. We systematically compared the models across different dimensions including speed of pretraining and fine-tuning, perplexity, downstream classification benchmarks, and performance in limited pretraining data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
github.com/fedelopez77/langdetect.
2.
The dictionary is referenced at our GitHub https://github.com/Knight-H/thai-lm.

References

Bert-th. (2019). https://github.com/ThAIKeras/bert
Scrapy. (2019). https://github.com/scrapy/scrapy
thai2fit. (2019). https://github.com/cstorm125/thai2fit
Aroonmanakun, W.: Thoughts on word and sentence segmentation in Thai (2007)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv e-prints arXiv:1406.1078, June 2014
Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. arXiv e-prints arXiv:1511.01432, November 2015
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv e-prints arXiv:1810.04805, October 2018
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv e-prints arXiv:1801.06146, January 2018
Lertpiya, A., et al.: A preliminary study on fundamental Thai NLP tasks for user-generated web content. In: 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–8, November 2018. https://doi.org/10.1109/iSAI-NLP.2018.8692946
Merity, S., Shirish Keskar, N., Socher, R.: Regularizing and Optimizing LSTM Language Models. arXiv e-prints arXiv:1708.02182, August 2017
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv e-prints arXiv:1301.3781, January 2013
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. arXiv e-prints arXiv:1802.05365, February 2018
Pythainlp 2.0. (2019). https://github.com/PyThaiNLP/pythainlp
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv e-prints arXiv:1409.3215, September 2014
Vaswani, A., et al.: Attention is all you need. arXiv e-prints arXiv:1706.03762, June 2017

Download references

Acknowledgements

In the making of the paper, the authors would like to acknowledge Mr. Can Udomcharoenchaikit for his continuous and insightful research suggestions until the completion of this paper.

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand
Thanapapas Horsuwan, Kasidis Kanwatchara, Peerapon Vateekul & Boonserm Kijsirikul

Authors

Thanapapas Horsuwan
View author publications
You can also search for this author in PubMed Google Scholar
Kasidis Kanwatchara
View author publications
You can also search for this author in PubMed Google Scholar
Peerapon Vateekul
View author publications
You can also search for this author in PubMed Google Scholar
Boonserm Kijsirikul
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Peerapon Vateekul or Boonserm Kijsirikul .

Editor information

Editors and Affiliations

Department of Applied Informatics, Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Kietikul Jearanaitanakij
Faculty of Computer Science and Information, University Teknologi Malaysia, Kuala Lumpur, Malaysia
Ali Selamat
Department of Applied Informatics, Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Suphamit Chittayasothorn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Horsuwan, T., Kanwatchara, K., Vateekul, P., Kijsirikul, B. (2020). A Comparative Study of Pretrained Language Models on Thai Social Text Categorization. In: Nguyen, N., Jearanaitanakij, K., Selamat, A., Trawiński, B., Chittayasothorn, S. (eds) Intelligent Information and Database Systems. ACIIDS 2020. Lecture Notes in Computer Science(), vol 12033. Springer, Cham. https://doi.org/10.1007/978-3-030-41964-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-41964-6_6
Published: 04 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41963-9
Online ISBN: 978-3-030-41964-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics