Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches

  • Jurgita Kapočiūtė-Dzikienė
  • Andrius Utka
  • Ligita Šarkutė
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)


This paper reports the first authorship attribution results based on the automatic computational methods for the Lithuanian language. Using supervised machine learning techniques we experimentally investigated the influence of different feature types (lexical, character, and syntactic) focusing on a few authors within three datasets, containing transcripts of the parliamentary speeches and debates. Due to our aim to keep as many interfering factors as possible to a minimum, all datasets were composed by selecting candidates having the same political views (avoiding ideology-based classification) from the overlapping parliamentary terms (avoiding topic classification task).

Experiments revealed that content-based features are more useful compared with the function words or part-of-speech tags; moreover, lemma n-grams (sometimes used in concatenation with morphological information) outperform word or document-level character n-grams. Due to the fact that Lithuanian is highly inflective, morphologically and vocabulary rich; moreover, we were dealing with the normative language; therefore morphological tools were maximally helpful.


Authorship attribution supervised ML Lithuanian 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: 2005 Joint Conference of the Association for Computers and Humanities and the Association for Literary and Linguistic Computing, pp. 1–3 (2005)Google Scholar
  2. 2.
    Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)zbMATHGoogle Scholar
  3. 3.
    Daudaravičius, V., Rimkutė, E., Utka, A.: Morphological annotation of the Lithuanian corpus. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies (ACL 2007), pp. 94–99 (2007)Google Scholar
  4. 4.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1), 10–18 (2009)CrossRefGoogle Scholar
  5. 5.
    Kapočiūtė-Dzikienė, J., Vaassen, F., Daelemans, W., Krupavičius, A.: Improving Topic Classification for Highly Inflective Languages. In: 24th International Conference on Computational Linguistics (COLING 2012), pp. 1393–1410 (2012)Google Scholar
  6. 6.
    Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research 8, 1261–1276 (2007)zbMATHGoogle Scholar
  7. 7.
    Kotsiantis, S.B.: Supervised Machine Learning: A Review of Classification Techniques. Informatica 31, 249–268 (2007)zbMATHMathSciNetGoogle Scholar
  8. 8.
    Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 3–12 (1994)Google Scholar
  9. 9.
    Lithuanian Parliament official page,
  10. 10.
    Luyckx, K.: Authorship Attribution of E-mail as a Multi-Class Task – Notebook for PAN at CLEF 2011. In: Petras, V., Forner P., Clough P. (eds.) Cross-Language Evaluation Forum (Notebook Papers/Labs/Workshop) (2011)Google Scholar
  11. 11.
    Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26(1), 35–55 (2011)CrossRefGoogle Scholar
  12. 12.
    Maciej, E.: Does size matter? Authorship attribution, small samples, big problem. In: Literary and Linguistic Computing (2013)Google Scholar
  13. 13.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  14. 14.
    McNemar, Q.M.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)CrossRefGoogle Scholar
  15. 15.
    Mikros, G.K., Perifanos, K.: Authorship identification in large email collections: Experiments using features that belong to different linguistic levels – Notebook for PAN at CLEF 2011. In: Petras, V., Forner P., Clough P. (eds.) Cross-Language Evaluation Forum (Notebook Papers/Labs/Workshop) (2011)Google Scholar
  16. 16.
    Pikčilingis, J.: Kas yra stilius (What is style?). Vaga, Vilnius (1971) (in Lithuanian)Google Scholar
  17. 17.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)CrossRefMathSciNetGoogle Scholar
  18. 18.
    Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the Association for Information Science and Technology 60(3), 538–556 (2009)CrossRefGoogle Scholar
  19. 19.
    WEKA Machine Learning Toolkit,
  20. 20.
    Zinkevičius, V.: Lemuoklis morfologinei analizei (Morphological analysis with Lemuoklis). In: Gudaitis, L. (ed.) Darbai ir Dienos, vol. 24, pp. 246–273 (2000) (in Lithuanian)Google Scholar
  21. 21.
    Žalkauskaitė, G.: Idiolekto požymiai elektroniniuose laiškuose (Idiolect signs in the e-mails). PhD dissertation, Vilnius University, Lithuania (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jurgita Kapočiūtė-Dzikienė
    • 1
  • Andrius Utka
    • 1
  • Ligita Šarkutė
    • 2
  1. 1.Vytautas Magnus UniversityKaunasLithuania
  2. 2.Kaunas University of TechnologyKaunasLithuania

Personalised recommendations