Automatic Discovery of Abusive Thai Language Usages in Social Networks

  • Suppawong TuarobEmail author
  • Jarernsri L. Mitrpanont
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10647)


Social networks have become a standard means of communication that allows a massive amount of users to interact and consume information anywhere and anytime. In Thailand, millions of users have access to social networks, a majority of which include young children. The colloquial nature of social media inherently encourages certain expressions of language that do not conform to the standard, some of which may be considered abusive and offensive. Such ill-mannered language fashion has become increasingly used by a large number of Thai social media users. If these abusive languages are exposed to adolescents without proper guidance, they could compulsorily develop a familiar attitude towards such language styles. To address the issue, we present a set of algorithms based on machine learning, that automatically detect abusive Thai language in social networks. Our best results yield 86% f-measure (88.73% precision and 83.53% recall).


Abusive language detection Thai natural language processing Large scale social networks 



This research project was partially supported by Faculty of Information and Communication Technology, Mahidol University.


  1. 1.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)Google Scholar
  2. 2.
    Aref, A., Tran, T.: Using ensemble of Bayesian classifying algorithms for medical systematic reviews. In: Sokolova, M., van Beek, P. (eds.) AI 2014. LNCS (LNAI), vol. 8436, pp. 263–268. Springer, Cham (2014). CrossRefGoogle Scholar
  3. 3.
    Atsawintarangkun, P., Theeramunkong, T., Haruechaiyasak, C.: A statistical and rule-based method for chunking verbal units in thai texts. Thammasat Int. J. Sci. Technol. 17(2), 70–86 (2012)Google Scholar
  4. 4.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer-Verlag New York, Inc., Secaucus (2006)zbMATHGoogle Scholar
  5. 5.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  6. 6.
    Burfoot, C., Baldwin, T.: Automatic satire detection: are you having a laugh? In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 161–164. Association for Computational Linguistics (2009)Google Scholar
  7. 7.
    Chen, Y., Zhou, Y., Zhu, S., Xu, H.: Detecting offensive language in social media to protect adolescent online safety. In: 2012 International Conference on Privacy, Security, Risk and Trust (PASSAT) and 2012 International Conference on Social Computing (SocialCom), pp. 71–80. IEEE (2012)Google Scholar
  8. 8.
    Cohen, W.W.: Fast effective rule induction. In: Twelfth International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann (1995)Google Scholar
  9. 9.
    Dadvar, M., Trieschnigg, D., Ordelman, R., de Jong, F.: Improving cyberbullying detection with user context. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 693–696. Springer, Heidelberg (2013). CrossRefGoogle Scholar
  10. 10.
    Forman, G.: BNS feature scaling: an improved representation over TF-IDF for SVM text classification. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 263–270. ACM (2008)Google Scholar
  11. 11.
    Hall, M.A., Frank, E.: Combining Naive Bayes and decision tables. In: FLAIRS Conference, vol. 2118, pp. 318–319 (2008)Google Scholar
  12. 12.
    Haruechaiyasak, C., Kongthon, A.: Lextoplus: a Thai lexeme tokenization and normalization tool. In: WSSANLP-2013, p. 9 (2013)Google Scholar
  13. 13.
    John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Mateo (1995)Google Scholar
  14. 14.
    Le Cessie, S., Van Houwelingen, J.C.: Ridge estimators in logistic regression. Appl. Stat. 41, 191–201 (1992)CrossRefzbMATHGoogle Scholar
  15. 15.
    Mitrpanont, J., Chongcharoen, P.: Th_wsd: Thai word sense disambiguation using cross-language knowledge sources approach. Int. J. Comput. Theory Eng. 7(6), 428 (2015)CrossRefGoogle Scholar
  16. 16.
    Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language detection in online user content. In: Proceedings of the 25th International Conference on World Wide Web, pp. 145–153. International World Wide Web Conferences Steering Committee (2016)Google Scholar
  17. 17.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, San Francisco (2014)Google Scholar
  18. 18.
    Razavi, A.H., Inkpen, D., Uritsky, S., Matwin, S.: Offensive language detection using multi-level classification. In: Farzindar, A., Kešelj, V. (eds.) AI 2010. LNCS (LNAI), vol. 6085, pp. 16–27. Springer, Heidelberg (2010). CrossRefGoogle Scholar
  19. 19.
    Su, J., Zhang, H., Ling, C.X., Matwin, S.: Discriminative parameter learning for Bayesian networks. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1016–1023. ACM (2008)Google Scholar
  20. 20.
    Tuarob, S., Bhatia, S., Mitra, P., Giles, C.L.: Algorithmseer: a system for extracting and searching for algorithms in scholarly big data. IEEE Trans. Big Data 2(1), 3–17 (2016)CrossRefGoogle Scholar
  21. 21.
    Tuarob, S., Mitra, P., Giles, C.L.: A hybrid approach to discover semantic hierarchical sections in scholarly documents. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1081–1085. IEEE (2015)Google Scholar
  22. 22.
    Tuarob, S., Tucker, C.S., Kumara, S., Giles, C.L., Pincus, A.L., Conroy, D.E., Ram, N.: How are you feeling? A personalized methodology for predicting mental states from temporally observable physical and behavioral information. J. Biomed. Inform. 68, 1–19 (2017)CrossRefGoogle Scholar
  23. 23.
    Warner, W., Hirschberg, J.: Detecting hate speech on the world wide web. In: Proceedings of the Second Workshop on Language in Social Media, pp. 19–26. Association for Computational Linguistics (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Faculty of Information and Communication TechnologyMahidol UniversitySalayaThailand

Personalised recommendations