Problems of Authorship Identification of the National Language Electronic Discourse

  • Algimantas Venčkauskas
  • Robertas DamaševičiusEmail author
  • Romas Marcinkevičius
  • Arnas Karpavičius
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 538)


The paper presents a comprehensive overview and analysis of the authorship identification methods in the national language electronic discourse. First, an overview and analysis of methods for English language is presented. Next, adaptations of general methods and well as language specific methods for national languages are considered. Challenges of authorship identification in electronic discourse is discussed. The requirements for developing authorship identification systems for forensics applications are discussed. Finally, the recommendations for developers of authorship identification methods and tools are presented.


Authorship identification Text analysis Text mining National languages Forensic linguistics Expert system 



The authors acknowledge the contribution of the project “Lithuanian Cybercrime Centre of Excellence for Training, Research and Education”, Grant Agreement No. HOME/2013/ISEC/AG/INT/4000005176, co-funded by the Prevention of and Fight against Crime Programme of the European Union.


  1. 1.
    Sánchez-Moya, A., Cruz-Moya, O.: Whatsapp, textese, and moral panics: discourse features and habits across two generations. Procedia – Soc. Behav. Sci. 173, 300–306 (2015)CrossRefGoogle Scholar
  2. 2.
    Segerstad, Y.H.: Use and adaptation of written language to the conditions of Computer-Mediated Communication. PhD dissertation, Göteborg University (2002)Google Scholar
  3. 3.
    Thurlow, C.: Generation txt? The sociolinguistics of young people’s text-messaging. Discourse Anal. Online 1(1), 30 (2003)Google Scholar
  4. 4.
    MacLeod, N., Grant, T.: Whose tweet?: authorship analysis of micro-blogs and other short form messages. In: Proceedings of the International Association of Forensic Linguists’ 10th Biennial Conference (2011)Google Scholar
  5. 5.
    Voutilainen, A.: Part-of-speech tagging. In: Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, pp. 219–232. University Press, Oxford (2003)Google Scholar
  6. 6.
    Nivre, J.: Logic programming tools for probabilistic part-of-speech tagging. Master’s thesis, Växjö University (2000)Google Scholar
  7. 7.
    Bogdanova, D., Lazaridou, A.: Cross-language authorship attribution. In: The International Conference on Language Resources and Evaluation, pp. 2015–2020 (2014)Google Scholar
  8. 8.
    Potthast, M., Barron-Cedeno, A., Stein, B., Rosso, P.: Cross-language plagiarism detection, language resources and evaluation (LRE). Spec. Issue Plagiarism Authorship Anal. 45(1), 1–18 (2011)Google Scholar
  9. 9.
    Salvador, M.F., Gupta, P., Rosso, P.: Cross-language plagiarism detection using a multilingual semantic network. In: Proceedings of the 35th European conference on Advances in Information Retrieval, ECIR 2013, pp. 710–713 (2013)Google Scholar
  10. 10.
    Navigli, R., Ponzetto, S.P.: BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 216–225 (2010)Google Scholar
  11. 11.
    Panicheva, P., Cardiff, J., Rosso, P.: Personal sense and idiolect: combining authorship attribution and opinion analysis. In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC 2010 (2010)Google Scholar
  12. 12.
    Dunn, R., Beaudry, J., Klavas, A.: Survey of research on learning styles. Educ. Leadersh. 46(6), 50–58 (1989)Google Scholar
  13. 13.
    Bellman, R.: Adaptive Control Processes: a Guided Tour. Princeton University Press, Princeton (1961)CrossRefGoogle Scholar
  14. 14.
    Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of International Conference on Machine Learning, pp. 170–178 (1997)Google Scholar
  15. 15.
    Fuka, K., Hanka, R.: Feature set reduction for document classification problems. In: Proceedings of IJCAI-01 Workshop: Text Learning: Beyond Supervision, Seattle (2001)Google Scholar
  16. 16.
    Zervas, G., Rüger, S.M.: The curse of dimensionality and document clustering. In: Proceedings of the IEEE Searching for Information: AI and IR Approaches (1999)Google Scholar
  17. 17.
    Pearl, L., Steyvers, M.: Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC 27(2), 183–196 (2012)Google Scholar
  18. 18.
    Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. (TISSEC), 15(3), Article 12, 22 p. (2012)CrossRefGoogle Scholar
  19. 19.
    De Vel O.: Mining e-mail authorship. In: ACM International Conference on Knowledge Discovery and Data Mining, KDD 2000, Workshop on Text Mining (2000)Google Scholar
  20. 20.
    Holmes, D.: Authorship attribution. Comput. Humanit. 28(2), 87–106 (1994)CrossRefGoogle Scholar
  21. 21.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)CrossRefGoogle Scholar
  22. 22.
    Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. JASIST (JASIS) 57(3), 378–393 (2006)CrossRefGoogle Scholar
  23. 23.
    Graovac, J.: A variant of n-gram based language-independent text categorization. Intell. Data Anal. 18(4) (2014)Google Scholar
  24. 24.
    Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  25. 25.
    Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21(2), 421 (2013)Google Scholar
  26. 26.
    Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45, 83–94 (2011)CrossRefGoogle Scholar
  27. 27.
    Reicher, T., Krišto, I., Belša, I., Šilić, A.: Automatic authorship attribution for texts in croatian language using combinations of features. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010, Part II. LNCS, vol. 6277, pp. 21–30. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  28. 28.
    Argamon, S., Levitan S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the Association for Literary and Linguistic Computing/Association Computer Humanities Conference (2005)Google Scholar
  29. 29.
    Chaski, C.E.: Who’s at the keyboard? Authorship attribution in digital evidence investigations. Int. J. Digit. Evid. 4(1), 1–13 (2005)Google Scholar
  30. 30.
    Hilton, O.: Scientific Examination of Questioned Documents. CRC Press, Boca Raton (1993)Google Scholar
  31. 31.
    McMenamin, G.R.: Forensic Linguistics: Advances in Forensic Stylistics. CRC Press, Boca Raton (2003)Google Scholar
  32. 32.
    Martindale, C., McKenzie, D.: On the utility of content analysis in author attribution: the Federalist. Comput. Humanit. 29, 259–270 (1995)CrossRefGoogle Scholar
  33. 33.
    Palkovskii, Y., Belov, A., Muzika I.: Exploring Fingerprinting as External Plagiarism Detection Method - Lab Report for PAN at CLEF 2010. CLEF (Notebook Papers/LABs/Workshops) (2010)Google Scholar
  34. 34.
    Yang, T., Lee, D.: T3: On mapping text to time series. In: Proceedings of the 3rd Alberto Mendelzon International Workshop on Foundations of Data Management. CEUR Workshop Proceedings 450 (2009)Google Scholar
  35. 35.
    Fukunaga, K.: Introduction to Statistical Pattern Recognition. Academic Press, New York (1972)zbMATHGoogle Scholar
  36. 36.
    Qu, Y., Ostrouchovz, G., Samatovaz, N., Geist, A.: Principal component analysis for dimension reduction in massive distributed data sets. In: Proceedings of IEEE International Conference on Data Mining (ICDM) (2002)Google Scholar
  37. 37.
    Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the 29th ACM SIGIR Conference on Research and Development on Information Retrieval Seattle, Washington, pp. 659–660 (2006)Google Scholar
  38. 38.
    Koppel, M., Schler, J., Argamon, S., Winter, Y.: The “fundamental problem” of authorship attribution. Engl. Stud. 93(3), 284–291 (2012)CrossRefGoogle Scholar
  39. 39.
    Huang, A.: Similarity measures for text document clustering. In: Proceedings of the 6th New Zealand Computer Science Research Student Conference NZCSRSC2008, pp. 49–56 (2008)Google Scholar
  40. 40.
    Kjell, B., Woods, W.A., Frieder, O.: Discrimination of authorship using visualization. Inf. Process. Manage. 30(1), 141–150 (1994)CrossRefGoogle Scholar
  41. 41.
    Shaw, C.D., Kukla, J.M., Soboroff, I., Ebert, D.S., Nicholas, C.K., Zwa, A., Miller, E.L., Roberts, D.A.: Interactive volumetric information visualization for document corpus management. Int. J. Digit. Libr. 2, 144–156 (1999)CrossRefGoogle Scholar
  42. 42.
    Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguist. Comput. 17, 267–287 (2002)CrossRefGoogle Scholar
  43. 43.
    Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: the Federalist. Addison-Wesley, Reading, MA (1964)zbMATHGoogle Scholar
  44. 44.
    Antiqueira, L., Pardo, T.A.S., das Gracas Volpe Nunes, M., de Oliveira Jr., O.N., da Fontoura Costa, L.: Some issues on complex networks for author characterization. Revista Iberoamericana de Inteligencia Artificial 11(36), 51–58 (2006)Google Scholar
  45. 45.
    Segarra, S., Eisen, M., Ribeiro, A.: Authorship attribution using function words adjacency networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, pp. 5563–5567 (2013)Google Scholar
  46. 46.
    Ke, J., Yao, Y.: Analysing language development from a network, approach. J. Quant. Linguist. 15(1), 70–99 (2008)MathSciNetCrossRefGoogle Scholar
  47. 47.
    Leskovec, J., Kleinberg J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1), Article 2 (2007)CrossRefGoogle Scholar
  48. 48.
    Matsuo, Y., Ohsawa, Y., Ishizuka, M.: A document as a small world. In: Terano, T., Nishida, T., Namatame, A., Tsumoto, S., Ohsawa, Y., Washio, T. (eds.) JSAI-WS 2001. LNCS (LNAI), vol. 2253, pp. 444–448. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  49. 49.
    Chang, F., Lieven, E., Tomasello, M.: Automatic evaluation of syntactic learners in typologically-different languages. Cogn. Syst. Res. 9(3), 198–213 (2008)CrossRefGoogle Scholar
  50. 50.
    Rizvic, H., Martincic-Ipsic, S., Mestrovic, A.: Network Motifs Analysis of Croatian Literature. CoRR abs/1411.4960 (2014)Google Scholar
  51. 51.
    Wagner, H., Dłotko, P., Mrozek, M.: Computational topology in text mining. In: Ferri, M., Frosini, P., Landi, C., Cerri, A., Di Fabio, B. (eds.) CTIC 2012. LNCS, vol. 7309, pp. 68–78. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  52. 52.
    Beeferman, D., Berger, A., Lafferty, J.: A model of lexical attraction and repulsion. In: 35th Annual Meeting of the Association for Computational Linguistics (1997)Google Scholar
  53. 53.
    Amancio, D.R.: Authorship recognition via fluctuation analysis of network topology and word intermittency. J. Stat. Mech. P03005 (2015)Google Scholar
  54. 54.
    Basile, C., Benedetto, D., Caglioti, E., Degli Esposti, M.: An example of mathematical authorship attribution. J. Math. Phys. 49, 125211–125230 (2008)MathSciNetCrossRefGoogle Scholar
  55. 55.
    Todirascu, A., Pado, S., Krisch, J., Kisselew, M., Heid, U.: French and German corpora for audience–based text type classification. LREC 2012, 1591–1597 (2012)Google Scholar
  56. 56.
    Varela, P., Justino, E., Oliveira, L.S.: Verbs and pronouns for authorship attribution. In: 17th International Conference on Systems, Signals and Image Processing (IWSSIP 2010), pp. 89–92 (2010)Google Scholar
  57. 57.
    Pavelec, D., Oliveira, L.S., Justino, E., Batista, L.V.: Using conjunctions and adverbs for author verification. J. Univ. Comput. Sci. 14(18), 2967–2981 (2008)Google Scholar
  58. 58.
    Hancke, J., Meurers, D., Vajjala, S.: Readability classification for German using lexical, syntactic, and morphological features. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING), pp. 1063–1080 (2012)Google Scholar
  59. 59.
    Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1), 109–123 (2003)CrossRefGoogle Scholar
  60. 60.
    Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Probl. Inf. Transm. 37(2), 172–184 (2001)MathSciNetCrossRefGoogle Scholar
  61. 61.
    Zecevic, A., Utvic, M.: An authorship attribution for Serbian. In: BCI (Local), pp. 109–112 (2012)Google Scholar
  62. 62.
    Žalkauskaitė, G.: Idiolect signs in the e-mail. PhD dissertation, Vilnius University (2012)Google Scholar
  63. 63.
    Barragán, J.: Why some hard cases remain unsolved. Legal knowledge based systems. In: JURIX 1993 (1993)Google Scholar
  64. 64.
    Grant, T.: TXT 4N6 method, consistency, and distinctiveness in the analysis of SMS text messages. J. Law Policy 21(2), 467–494 (2013)Google Scholar
  65. 65.
    Mohtasseb, H., Ahmed, A.: Two-layered blogger identification model integrating profile and instance-based methods. Knowl. Inf. Syst. 31(1), 1–21 (2012)CrossRefGoogle Scholar
  66. 66.
    Guillén-Nieto, V., Vargas-Sierra, C., Pardiño-Juan, M., Martínez-Barco, P., Suárez-Cueto, A.: Exploring state-of-the art software for forensic authorship identification. Int. J. Engl. Stud. 8(1), 1–28 (2008)Google Scholar
  67. 67.
    Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization. In: Croft, W.B., Lafferty, J. (eds.) Language modeling for information retrieval, pp. 141–165. Springer, Dordrecht (2003)CrossRefGoogle Scholar
  68. 68.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)CrossRefGoogle Scholar
  69. 69.
    Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20(5), 67–75 (2005)CrossRefGoogle Scholar
  70. 70.
    Kapociute-Dzikiene, J., Vaassen, F., Daelemans, W., Krupavicius, A.: Improving topic classification for highly inflective languages. In: 24th International Conference on Computational Linguistics, COLING 2012, pp. 1393–1410 (2012)Google Scholar
  71. 71.
    Napoli, C., Tramontana, E., Lo Sciuto, G., Wozniak, M., Damasevicius, R., Borowik, G.: Authorship semantical identification using holomorphic Chebyshev projectors. In: Proceedings of 3rd Asia-Pacific Conference on Computer Aided System Engineering (APCASE) (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Algimantas Venčkauskas
    • 1
  • Robertas Damaševičius
    • 2
    Email author
  • Romas Marcinkevičius
    • 2
  • Arnas Karpavičius
    • 1
  1. 1.Computer Science DepartmentKaunas University of TechnologyKaunasLithuania
  2. 2.Software Engineering DepartmentKaunas University of TechnologyKaunasLithuania

Personalised recommendations