Impact of Content Features for Automatic Online Abuse Detection

  • Etienne PapegniesEmail author
  • Vincent Labatut
  • Richard Dufour
  • Georges Linarès
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10762)


Online communities have gained considerable importance in recent years due to the increasing number of people connected to the Internet. Moderating user content in online communities is mainly performed manually, and reducing the workload through automatic methods is of great financial interest for community maintainers. Often, the industry uses basic approaches such as bad words filtering and regular expression matching to assist the moderators. In this article, we consider the task of automatically determining if a message is abusive. This task is complex since messages are written in a non-standardized way, including spelling errors, abbreviations, community-specific codes, etc. First, we evaluate the system that we propose using standard features of online messages. Then, we evaluate the impact of the addition of pre-processing strategies, as well as original specific features developed for the community of an online in-browser strategy game. We finally propose to analyze the usefulness of this wide range of features using feature selection. This work can lead to two possible applications: (1) automatically flag potentially abusive messages to draw the moderator’s attention on a narrow subset of messages; and (2) fully automate the moderation process by deciding whether a message is abusive without any human intervention.


  1. 1.
    Sood, S.O., Antin, J., Churchill, E.F.: Using crowdsourcing to improve profanity detection. In: AAAI Spring Symposium: Wisdom of the Crowd (2012)Google Scholar
  2. 2.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet Physics Doklady, vol. 10, p. 707 (1966)Google Scholar
  3. 3.
    Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: 38th Annual Meeting on Association for Computational Linguistics, pp. 286–293 (2000)Google Scholar
  4. 4.
    Lee, H., Ng, A.Y.: Spam deobfuscation using a hidden Markov model. In: 2nd Conference on Email and Anti-Spam (2005)Google Scholar
  5. 5.
    Spertus, E.: Smokey: automatic recognition of hostile messages. In: 14th National Conference on Artificial Intelligence and 9th Conference on Innovative Applications of Artificial Intelligence, pp. 1058–1065 (1997)Google Scholar
  6. 6.
    Chen, Y., Zhou, Y., Zhu, S., Xu, H.: Detecting offensive language in social media to protect adolescent online safety. In: International Conference on Privacy, Security, Risk and Trust and International Conference on Social Computing, pp. 71–80 (2012)Google Scholar
  7. 7.
    Dinakar, K., Reichart, R., Lieberman, H.: Modeling the detection of textual cyberbullying. Soc. Mob. Web 11, 02 (2011)Google Scholar
  8. 8.
    Chavan, V.S., Shylaja, S.S.: Machine learning approach for detection of cyber-aggressive comments by peers on social media network. In: International Conference on Advances in Computing, Communications and Informatics, pp. 2354–2358 (2015)Google Scholar
  9. 9.
    Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection of harassment on web 2.0. In: WWW Workshop: Content Analysis in the WEB 2.0, pp. 1–7 (2009)Google Scholar
  10. 10.
    Cheng, J., Danescu-Niculescu-Mizil, C., Leskovec, J.: Antisocial behavior in online discussion communities (2015). Preprint arXiv:1504.00680
  11. 11.
    Balci, K., Salah, A.A.: Automatic analysis and identification of verbal aggression and abusive behaviors for online social games. Comput. Hum. Behav. 53, 517–526 (2015)CrossRefGoogle Scholar
  12. 12.
    Garimella, K., De Francisci Morales, G., Gionis, A., Mathioudakis, M.: Quantifying controversy in social media. In: 9th ACM International Conference on Web Search and Data Mining, pp. 33–42 (2016)Google Scholar
  13. 13.
    Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive presentation sessions, Association for Computational Linguistics, pp. 69–72 (2006)Google Scholar
  14. 14.
    Batista, L.V., Meira, M.M.: Texture classification using the Lempel-Ziv-Welch algorithm. In: Brazilian Symposium on Artificial Intelligence, pp. 444–453 (2004)CrossRefGoogle Scholar
  15. 15.
    Roy, S., Dhar, S., Bhattacharjee, S., Das, A.: A lexicon based algorithm for noisy text normalization as pre processing for sentiment analysis. Int. J. Res. Eng. Technol. 2, 67–70 (2013)Google Scholar
  16. 16.
    Senter, R.J., Smith, E.A.: Automated readability index. Technical Report AMRL-TR-6620, Wright-Patterson Air Force Base (1967)Google Scholar
  17. 17.
    Chen, Y., Skiena, S.: Building sentiment lexicons for all major languages. In: 52nd Annual Meeting of the Association for Computational Linguistics, pp. 383–389 (2014)Google Scholar
  18. 18.
    Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Etienne Papegnies
    • 1
    Email author
  • Vincent Labatut
    • 1
  • Richard Dufour
    • 1
  • Georges Linarès
    • 1
  1. 1.LIA - University of AvignonAvignonFrance

Personalised recommendations