Authorship Attribution of Internet Comments with Thousand Candidate Authors

  • Jurgita Kapočiūtė-DzikienėEmail author
  • Andrius Utka
  • Ligita Šarkutė
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 538)


In this paper we report the first authorship attribution results for the Lithuanian language using Internet comments with a thousand of candidate authors. The task is complicated due to the following reasons: large number of candidate authors, extremely short non-normative texts, and problems associated with morphologically and vocabulary rich language.

The effectiveness of the proposed similarity-based method was investigated using lexical, morphological, and character features; as well as several dimensionality reduction techniques. Marginally the best results were obtained with the word-level character tetra-grams and entire feature set. However, the technique based on the randomized feature sets even using a few thousands of features achieved very similar performance levels, besides it outperformed method’s implementations based on the sophisticated feature ranking.

The best obtained f-score and accuracy values exceeded random and majority baselines by more than 10.9 percentage points.


Similarity-based paradigm Randomized feature set Internet comments The Lithuanian language 



Research was funded by a grant (No. LIT-8-69) from the Research Council of Lithuania.


  1. 1.
    Abbasi, A., Chen, H.: writerprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. 26(2), 1–29 (2008)CrossRefGoogle Scholar
  2. 2.
    Alrabaee, S., Saleem, N., Preda, S., Wang, L., Debbabi, M.: OBA2: an onion approach to binary code authorship attribution. Digit. Invest. 11(1), S94–S103 (2014)CrossRefGoogle Scholar
  3. 3.
    Argamon, S., Dawhle, S., Koppel, M., Pennebaker, J.W.: Lexical predictors of personality type. In: Proceedings of the Joint Annual Meeting of the Interface and the Classification Society of North America, pp. 1–16 (2005)Google Scholar
  4. 4.
    Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the 2005 Joint Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, pp. 1–3 (2005)Google Scholar
  5. 5.
    Cristani, M., Roffo, G., Segalin, C., Bazzani, L., Vinciarelli, A., Murino, V.: Conversationally-inspired stylometric features for authorship attribution in instant messaging. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1121–1124 (2012)Google Scholar
  6. 6.
    Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 611–617 (2004)Google Scholar
  7. 7.
    Holmes, D.I.: The evolution of stylometry in humanities scholarship. Literary Linguist. Comput. 13(3), 111–117 (1998)CrossRefGoogle Scholar
  8. 8.
    Ikonomakis, M., Kotsiantis, S.B., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 8(4), 966–974 (2005)Google Scholar
  9. 9.
    Inches, G., Harvey, M., Crestani, F.: Finding participants in a chat: authorship attribution for conversational documents. In: International Conference on Social Computing, pp. 272–279 (2013)Google Scholar
  10. 10.
    Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Literary Linguist. Comput. 25, 215–223 (2010)CrossRefGoogle Scholar
  11. 11.
    Juola, P.: Future trends in authorship attribution. In: Advances in Digital Forensics III - IFIP International Conference on Digital Forensics, vol. 242, pp. 119–132 (2007)Google Scholar
  12. 12.
    Kapočiūtė-Dzikienė, J., Utka, A., Šarkutė, L.: Feature exploration for authorship attribution of lithuanian parliamentary speeches. In: 17th International Conference on Text, Speech, and Dialogue, pp. 93–100 (2014)Google Scholar
  13. 13.
    Kapočiūtė-Dzikienė, J., Vaassen, F., Daelemans, W., Krupavičius, A.: Improving topic classification for highly inflective languages. In: Proceedings of 24th International Conference on Computational Linguistics, pp. 1393–1410 (2012)Google Scholar
  14. 14.
    Kapočiūtė-Dzikienė, J., Šarkutė, L., Utka, A.: The effect of author set size in authorship attribution for lithuanian. In: NODALIDA: 20th Nordic Conference of Computational Linguistics, pp. 87–96 (2015)Google Scholar
  15. 15.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2002)CrossRefGoogle Scholar
  16. 16.
    Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45(1), 83–94 (2011)CrossRefGoogle Scholar
  17. 17.
    Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? J. Law Policy 21, 317–331 (2013)Google Scholar
  18. 18.
    Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 659–660 (2006)Google Scholar
  19. 19.
    Koppel, M., Schler, J., Argamon, S., Winter, Y.: The “fundamental problem” of authorship attribution. Engl. Stud. 93(3), 284–291 (2012)CrossRefGoogle Scholar
  20. 20.
    Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)zbMATHGoogle Scholar
  21. 21.
    Luyckx, K.: Scalability Issues in Authorship Attribution. Ph.D. thesis, University of Antwerp, Belgium (2010)Google Scholar
  22. 22.
    Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 513–520 (2008)Google Scholar
  23. 23.
    McNemar, Q.M.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)CrossRefGoogle Scholar
  24. 24.
    Mendenhall, T.C.: The characteristic curves of composition. Science 9(214), 237–246 (1887)CrossRefGoogle Scholar
  25. 25.
    Mikros, G.K., Argiri, E.K.: Investigating topic influence in authorship attribution. In: Proceedings of the 30th SIGIR, Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, pp. 29–35 (2007)Google Scholar
  26. 26.
    Mosteller, F., Wallace, D.L.: Inference in an authorship problem. J. Am. Stat. Assoc. 58(302), 275–309 (1963)zbMATHGoogle Scholar
  27. 27.
    Narayanan, A., Paskov, H., Gong, N.Z., Bethencourt, J., Stefanov, E., Shin, E.C.R., Song, D.: On the feasibility of internet-scale author identification. In: Proceedings of the 2012 IEEE Symposium on Security and Privacy, pp. 300–314 (2012)Google Scholar
  28. 28.
    Okuno, S., Asai, H., Yamana, H.: A challenge of authorship identification for ten-thousand-scale microblog users. In: IEEE International Conference on Big Data, pp. 52–54 (2014)Google Scholar
  29. 29.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  30. 30.
    Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 482–491 (2006)Google Scholar
  31. 31.
    Savoy, J.: Authorship attribution: a comparative study of three text corpora and three languages. J. Quant. Linguist. 19(2), 132–161 (2012)CrossRefGoogle Scholar
  32. 32.
    Savoy, J.: Authorship attribution based on a probabilistic topic model. Inf. Process. Manage. 49(1), 341–354 (2013)CrossRefGoogle Scholar
  33. 33.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: Proceedings of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, pp. 199–205 (2006)Google Scholar
  34. 34.
    Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro-messages. In: Empirical Methods in Natural Langauge Processing, pp. 1880–1891 (2013)Google Scholar
  35. 35.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)MathSciNetCrossRefGoogle Scholar
  36. 36.
    Seroussi, Y., Zukerman, I., Bohnert, F.: Authorship attribution with latent Dirichlet allocation. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 181–189 (2011)Google Scholar
  37. 37.
    Solorio, T., Pillay, S., Raghavan, S., Montes-y Gómez, M.: Modality specific meta feature for authorship attribution in web forum posts. In: The 5th International Joint Conference on Natural Language Processing, pp. 156–164 (2011)Google Scholar
  38. 38.
    Sousa Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., Maia, B.: `twazn me!!!; (‘automatic authorship analysis of micro-blogging messages. In: Muñoz, R., Montoyo, A., Métais, E. (eds.) NLDB 2011. LNCS, vol. 6716, pp. 161–168. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  39. 39.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Assoc. Inf. Sci. Technol. 60(3), 538–556 (2009)CrossRefGoogle Scholar
  40. 40.
    Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inf. Sci. Technol. 62(12), 2512–2527 (2011)CrossRefGoogle Scholar
  41. 41.
    Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21(2), 421–439 (2013)Google Scholar
  42. 42.
    Tan, E., Guo, L., Chen, S., Zhang, X., Zhao, Y.: UNIK: unsupervised social network spam detection. In: Proceedings of the 22nd ACM International Conference on Conference on Information and Knowledge Management, pp. 479–488 (2013)Google Scholar
  43. 43.
    Van Halteren, H., Baayen, R.H., Tweedie, F., Haverkort, M., Neijt, A.: New machine learning methods demonstrate the existence of a human stylome. J. Quant. Linguist. 12, 65–77 (2005)CrossRefGoogle Scholar
  44. 44.
    de Vel, O., Anderson, A.M., Corney, M.W., Mohay, G.M.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)CrossRefGoogle Scholar
  45. 45.
    Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  46. 46.
    Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)CrossRefGoogle Scholar
  47. 47.
    Zinkevičius, V.: Lemuoklis – morfologinei analizei [Morphological Analysis with Lemuoklis] (in Lithuanian). Darbai ir dienos 24, 246–273 (2000)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Jurgita Kapočiūtė-Dzikienė
    • 1
    Email author
  • Andrius Utka
    • 1
  • Ligita Šarkutė
    • 2
  1. 1.Vytautas Magnus UniversityKaunasLithuania
  2. 2.Kaunas University of TechnologyKaunasLithuania

Personalised recommendations