Skip to main content

A Comparative Study of Pretrained Language Models on Thai Social Text Categorization

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2020)

Abstract

The ever-growing volume of data of user-generated content on social media provides a nearly unlimited corpus of unlabeled data even in languages where resources are scarce. In this paper, we demonstrate that state-of-the-art results on two Thai social text categorization tasks can be realized by pretraining a language model on a large noisy Thai social media corpus of over 1.26 billion tokens and later fine-tuned on the downstream classification tasks. Due to the linguistically noisy and domain-specific nature of the content, our unique data preprocessing steps designed for Thai social media were utilized to ease the training comprehension of the model. We compared four modern language models: ULMFiT, ELMo with biLSTM, OpenAI GPT, and BERT. We systematically compared the models across different dimensions including speed of pretraining and fine-tuning, perplexity, downstream classification benchmarks, and performance in limited pretraining data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    github.com/fedelopez77/langdetect.

  2. 2.

    The dictionary is referenced at our GitHub https://github.com/Knight-H/thai-lm.

References

  1. Bert-th. (2019). https://github.com/ThAIKeras/bert

  2. Scrapy. (2019). https://github.com/scrapy/scrapy

  3. thai2fit. (2019). https://github.com/cstorm125/thai2fit

  4. Aroonmanakun, W.: Thoughts on word and sentence segmentation in Thai (2007)

    Google Scholar 

  5. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  6. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv e-prints arXiv:1406.1078, June 2014

  7. Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. arXiv e-prints arXiv:1511.01432, November 2015

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv e-prints arXiv:1810.04805, October 2018

  9. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv e-prints arXiv:1801.06146, January 2018

  10. Lertpiya, A., et al.: A preliminary study on fundamental Thai NLP tasks for user-generated web content. In: 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–8, November 2018. https://doi.org/10.1109/iSAI-NLP.2018.8692946

  11. Merity, S., Shirish Keskar, N., Socher, R.: Regularizing and Optimizing LSTM Language Models. arXiv e-prints arXiv:1708.02182, August 2017

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv e-prints arXiv:1301.3781, January 2013

  13. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  14. Peters, M.E., et al.: Deep contextualized word representations. arXiv e-prints arXiv:1802.05365, February 2018

  15. Pythainlp 2.0. (2019). https://github.com/PyThaiNLP/pythainlp

  16. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  17. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv e-prints arXiv:1409.3215, September 2014

  18. Vaswani, A., et al.: Attention is all you need. arXiv e-prints arXiv:1706.03762, June 2017

Download references

Acknowledgements

In the making of the paper, the authors would like to acknowledge Mr. Can Udomcharoenchaikit for his continuous and insightful research suggestions until the completion of this paper.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Peerapon Vateekul or Boonserm Kijsirikul .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Horsuwan, T., Kanwatchara, K., Vateekul, P., Kijsirikul, B. (2020). A Comparative Study of Pretrained Language Models on Thai Social Text Categorization. In: Nguyen, N., Jearanaitanakij, K., Selamat, A., Trawiński, B., Chittayasothorn, S. (eds) Intelligent Information and Database Systems. ACIIDS 2020. Lecture Notes in Computer Science(), vol 12033. Springer, Cham. https://doi.org/10.1007/978-3-030-41964-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-41964-6_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-41963-9

  • Online ISBN: 978-3-030-41964-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics