Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches
This paper reports the first authorship attribution results based on the automatic computational methods for the Lithuanian language. Using supervised machine learning techniques we experimentally investigated the influence of different feature types (lexical, character, and syntactic) focusing on a few authors within three datasets, containing transcripts of the parliamentary speeches and debates. Due to our aim to keep as many interfering factors as possible to a minimum, all datasets were composed by selecting candidates having the same political views (avoiding ideology-based classification) from the overlapping parliamentary terms (avoiding topic classification task).
Experiments revealed that content-based features are more useful compared with the function words or part-of-speech tags; moreover, lemma n-grams (sometimes used in concatenation with morphological information) outperform word or document-level character n-grams. Due to the fact that Lithuanian is highly inflective, morphologically and vocabulary rich; moreover, we were dealing with the normative language; therefore morphological tools were maximally helpful.
KeywordsAuthorship attribution supervised ML Lithuanian
Unable to display preview. Download preview PDF.
- 1.Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: 2005 Joint Conference of the Association for Computers and Humanities and the Association for Literary and Linguistic Computing, pp. 1–3 (2005)Google Scholar
- 3.Daudaravičius, V., Rimkutė, E., Utka, A.: Morphological annotation of the Lithuanian corpus. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies (ACL 2007), pp. 94–99 (2007)Google Scholar
- 5.Kapočiūtė-Dzikienė, J., Vaassen, F., Daelemans, W., Krupavičius, A.: Improving Topic Classification for Highly Inflective Languages. In: 24th International Conference on Computational Linguistics (COLING 2012), pp. 1393–1410 (2012)Google Scholar
- 8.Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 3–12 (1994)Google Scholar
- 9.Lithuanian Parliament official page, http://www3.lrs.lt/pls/inter/w5_sale.kad_ses
- 10.Luyckx, K.: Authorship Attribution of E-mail as a Multi-Class Task – Notebook for PAN at CLEF 2011. In: Petras, V., Forner P., Clough P. (eds.) Cross-Language Evaluation Forum (Notebook Papers/Labs/Workshop) (2011)Google Scholar
- 12.Maciej, E.: Does size matter? Authorship attribution, small samples, big problem. In: Literary and Linguistic Computing (2013)Google Scholar
- 15.Mikros, G.K., Perifanos, K.: Authorship identification in large email collections: Experiments using features that belong to different linguistic levels – Notebook for PAN at CLEF 2011. In: Petras, V., Forner P., Clough P. (eds.) Cross-Language Evaluation Forum (Notebook Papers/Labs/Workshop) (2011)Google Scholar
- 16.Pikčilingis, J.: Kas yra stilius (What is style?). Vaga, Vilnius (1971) (in Lithuanian)Google Scholar
- 19.WEKA Machine Learning Toolkit, http://www.cs.waikato.ac.nz/ml/weka/
- 20.Zinkevičius, V.: Lemuoklis morfologinei analizei (Morphological analysis with Lemuoklis). In: Gudaitis, L. (ed.) Darbai ir Dienos, vol. 24, pp. 246–273 (2000) (in Lithuanian)Google Scholar
- 21.Žalkauskaitė, G.: Idiolekto požymiai elektroniniuose laiškuose (Idiolect signs in the e-mails). PhD dissertation, Vilnius University, Lithuania (2012)Google Scholar