Authorship Attribution in Russian in Real-World Forensics Scenario

Panicheva, Polina; Litvinova, Tatiana

doi:10.1007/978-3-030-31372-2_25

Authorship Attribution in Russian in Real-World Forensics Scenario

Conference paper
First Online: 27 September 2019

686 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11816))

Abstract

Recent demands in authorship attribution, specifically, cross-topic authorship attribution with small numbers of training samples and very short texts, impose new challenges on corpora design, feature and algorithm development. In the current work we address these challenges by performing authorship attribution on a specifically designed dataset in Russian. We present a dataset of short written texts in Russian, where both authorship and topic are controlled. We propose a pairwise classification design closely resembling a real-world forensic task. Semantic coherence features are introduced to supplement well-established n-gram features in challenging cross-topic settings. Distance-based measures are compared with machine learning algorithms. The experiment results support the intuition that for very small datasets, distance-based measures perform better than machine learning techniques. Moreover, pairwise classification results show that in difficult cross-topic cases, content-independent features, i.e., part-of-speech n-grams and semantic coherence, are promising. The results are supported by feature significance analysis for the proposed dataset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The model is made available for free download by the RusVectōrēs project at https://rusvectores.org/ru/models/.
2.
https://scikit-learn.org/.

References

Chaski, C.: The keyboard dilemma and authorship identification. In: Craiger, P., Shenoi, S. (eds.) DigitalForensics 2007. ITIFIP, vol. 242, pp. 133–146. Springer, New York (2007). https://doi.org/10.1007/978-0-387-73742-3_9
Chapter Google Scholar
Corcoran, C.M., et al.: Prediction of psychosis across protocols and risk cohorts using automated language analysis. World Psychiatry 17(1), 67–75 (2018)
Article Google Scholar
Dmitrin, Y., Botov, D., Klenin, J., Nikolaev, I.: Comparison of deep neural network architectures for authorship attribution of Russian social media texts. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2018” (Online articles). RSUH (2018)
Google Scholar
Elvevåg, B., Foltz, P.W., Weinberger, D.R., Goldberg, T.E.: Quantifying incoherence in speech: an automated methodology and novel application to schizophrenia. Schizophr. Res. 93(1–3), 304–316 (2007)
Article Google Scholar
Evert, S., et al.: Understanding and explaining Delta measures for authorship attribution. Digit. Sch. Hum. 32(2), ii4–ii16 (2017)
Article Google Scholar
Gómez-Adorno, H., et al.: Hierarchical clustering analysis: the best-performing approach at PAN 2017 author clustering task. In: Bellot, P., et al. (eds.) CLEF 2018. LNCS, vol. 11018, pp. 216–223. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_20
Chapter Google Scholar
Grant, T.: Txt 4n6: describing and measuring consistency and distinctiveness in the analysis of SMS text messages. J. Law Policy XXI(2), 467–494 (2013)
Google Scholar
Gritta, M.: Distributional Semantics and Authorship Differences (MPhil Diss.). University of Cambridge (2015)
Google Scholar
Herbelot, A., Kochmar, E.: ‘Calling on the classical phone’: a distributional model of adjective-noun errors in learners’ English. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 976–986. COLING (2016)
Google Scholar
Iter, D., Yoon, J., Jurafsky, D.: Automatic detection of incoherent speech for diagnosing schizophrenia. In: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pp. 136–146. Association for Computational Linguistics (2018)
Google Scholar
Juola, P.: The rowling protocol, Steven Bannon, and Rogue POTUS staff: a study in computational authorship attribution. Language and Law/Linguagem e Direito 5(2), 77–94 (2018)
Google Scholar
Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Cappellato, L., et al. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs, pp. 1–25. CEUR-WS.org (2018)
Google Scholar
Kutuzov, A., Kuzmenko, E.: WebVectors: a toolkit for building web interfaces for vector semantic models. In: Ignatov, Dmitry I., et al. (eds.) AIST 2016. CCIS, vol. 661, pp. 155–161. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52920-2_15
Chapter Google Scholar
Litvinova, T., Litvinova, O., Seredin, P.: Assessing the level of stability of idiolectal features across modes, topics and time of text production. In: 23rd Conference of Open Innovations Association: FRUCT 2018, pp. 223–230. IEEE (2018)
Google Scholar
Litvinova, T., Seredin, P., Litvinova, O., Dankova, T., Zagorovskaya, O.: On the stability of some idiolectal features. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 331–336. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_35
Chapter Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 100–108. Association for Computational Linguistics (2010)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Posadas-Durán, J.P., et al.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft. Comput. 21(3), 627–639 (2017)
Article Google Scholar
Queralt, S.: The creation of Base Rate Knowledge of linguistic variables and the implementation of likelihood ratios to authorship attribution in forensic text comparison. Language and Law/Linguagem e Direito 5(2), 59–76 (2018)
Google Scholar
Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2016)
Article Google Scholar
Sapkota, U., Bethard, S., Montes, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–102. Association for Computational Linguistics (2015)
Google Scholar
Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: International Conference on Machine Learning; Models, Technologies and Applications, pp. 273–280. CSREA Press (2003)
Google Scholar
Shutova, E., Kiela, D., Maillard, J.: Black holes and white rabbits: metaphor identification with visual features. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 160–170. Association for Computational Linguistics (2016)
Google Scholar
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)
Article Google Scholar
Soboroff, I.M., Nicholas, C.K., Kukla, J.M., Ebert, D.S.: Visualizing document authorship using n-grams and latent semantic indexing. In: Proceedings of the 1997 Workshop on New Paradigms in Information Visualization and Manipulation, pp. 43–48. ACM (1997)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comput. Humanit. 35(2), 193–214 (2001)
Article Google Scholar
Stamatatos, E.: Masking topic-related information to enhance authorship attribution. J. Assoc. Inf. Sci. Technol. 69(3), 461–473 (2018)
Article Google Scholar
Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. J. Law Policy 21(2), 421–439 (2013)
Google Scholar
Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working Notes of CLEF 2017, CEUR Workshop Proceedings, vol. 1866. CEUR-WS.org (2017)
Google Scholar

Download references

Acknowledgment

Authors acknowledge support of this study by the Russian Science Foundation, grant № 18-78-10081 “Modelling of the idiolect of a modern Russian speaker in the context of the problem of authorship attribution”.

Author information

Authors and Affiliations

National Research University Higher School of Economics, 16 Soyuza Pechatnikov Street, St. Petersburg, 190121, Russia
Polina Panicheva
RusProfiling Lab, Voronezh State Pedagogical University, 86 Lenina Street, Voronezh, 394043, Russia
Polina Panicheva & Tatiana Litvinova
Plekhanov Russian University of Economics, Stremyanny Lane 36, Moscow, 117997, Russia
Tatiana Litvinova

Authors

Polina Panicheva
View author publications
You can also search for this author in PubMed Google Scholar
Tatiana Litvinova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Polina Panicheva .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
Queen Mary University of London, London, UK
Matthew Purver
Jožef Stefan Institute, Ljubljana, Slovenia
Senja Pollak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Panicheva, P., Litvinova, T. (2019). Authorship Attribution in Russian in Real-World Forensics Scenario. In: Martín-Vide, C., Purver, M., Pollak, S. (eds) Statistical Language and Speech Processing. SLSP 2019. Lecture Notes in Computer Science(), vol 11816. Springer, Cham. https://doi.org/10.1007/978-3-030-31372-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-31372-2_25
Published: 27 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31371-5
Online ISBN: 978-3-030-31372-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics