Author Clustering with and Without Topical Features

Panicheva, Polina; Litvinova, Olga; Litvinova, Tatiana

doi:10.1007/978-3-030-26061-3_36

Author Clustering with and Without Topical Features

Conference paper
First Online: 24 July 2019

1200 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Abstract

Typically, the task of authorship attribution has been solved using supervised machine learning methods. It is only recently that unsupervised methods have been applied to authorship attribution, namely author clustering. Clustering could be useful in realistic scenario as it represents natural grouping of documents without a priori authorship information, although the problem of feature selection remains unsolved. That is particularly true for a cross-domain scenario. Studies have shown that in cross-domain settings some domain-specific text features cause noise in authorship attribution. In the current work we introduce a modification of unmasking technique aimed at selecting and removing the features most influenced by topic change. We apply the proposed technique to identify topical features and assess the quality of author clustering with different feature sets in a real-world dataset of forum texts in Russian. The main assumption is that the topical features result in topic-based text instead authorship-based clustering, and removing them could increase the performance of document clustering against authorship ground truth. We test this consideration by first clustering cross-topic documents with state-of-the-art authorship attribution features. Second, we remove the most significant topical features, and cluster texts with resulting feature set. Both clustering results are evaluated against ground truth authorship. The results demonstrate that the described approach of removing some topical features increases author clustering performance, however one should be cautious with the number of removed features.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
Article Google Scholar
Chen, H.: Dark web: exploring and mining the dark side of the web. In: European Intelligence and Security Informatics Conference 2011, pp. 1–2. IEEE Computer Society (2011)
Google Scholar
Darooneh, A.H., Shariati, A.: Metrics for evaluation of the author’s writing styles: who is the best?. Chaos: Interdisc. J. Nonlinear Sci. 24, 033132 (2014)
Google Scholar
Gómez-Adorno, H., Martín-del-Campo-Rodríguez, C., Sidorov, G., Alemán, Y., Vilariño, D., Pinto, D.: Hierarchical clustering analysis: the best-performing approach at PAN 2017 author clustering task. In: Bellot, P., et al. (eds.) CLEF 2018. LNCS, vol. 11018, pp. 216–223. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_20
Chapter Google Scholar
Kestemont, M., Luyckx, K., Daelemans, W., Crombez, T.: Evaluating unmasking for cross-genre authorship verification. In: Meister, J.C. (ed.) Digital Humanities 2012, pp. 249–251. Hamburg University Press (2012)
Google Scholar
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)
MATH Google Scholar
Litvinova, T., Litvinova, O., Seredin, P.: Assessing the level of stability of idiolectal features across modes, topics and time of text production. In: 23rd Conference of Open Innovations Association, FRUCT 2018, pp. 223–230. IEEE (2018)
Google Scholar
Litvinova, T., Panicheva, P., Litvinova, O.: Authorship attribution of Russian forum posts with different types of n-gram features. In: 3rd International Conference on Natural Language Processing and Information Retrieval (NLPIR 2019) Proceedings. ACM (2019, in press)
Google Scholar
Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: The 50th Annual Meeting of the Association for Computational Linguistics, pp. 25–30. The Association for Computer Linguistics (2012)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)
Article MathSciNet Google Scholar
Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are created equal: A study in authorship attribution. In: NAACL HLT 2015, pp. 93–102. The Association for Computational Linguistics (2015)
Google Scholar
Stamatatos, E.: Masking topic-related information to enhance authorship attribution. J. Assoc. Inf. Sci. Technol. 69, 461–473 (2018). https://doi.org/10.1002/asi.23968
Article Google Scholar
Stamatatos, E., et al.: Clustering by authorship within and across documents. In: Working Notes of CLEF 2016, CEUR Workshop Proceedings, vol. 1609, pp. 691–715. CEUR-WS.org (2016)
Google Scholar
Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working Notes of CLEF 2017, CEUR Workshop Proceedings, vol. 1866. CEUR-WS.org (2017)
Google Scholar

Download references

Acknowledgment

Authors acknowledge support of this study by the Russian Science Foundation, grant №18-78-10081 “Modelling of the idiolect of a modern Russian speaker in the context of the problem of authorship attribution”.

Author information

Authors and Affiliations

National Research University Higher School of Economics, 16 Soyuza Pechatnikov st., St. Petersburg, 190121, Russia
Polina Panicheva
RusProfiling Lab, Voronezh State Pedagogical University, 86 Lenina st., Voronezh, 394043, Russia
Polina Panicheva, Olga Litvinova & Tatiana Litvinova
Plekhanov Russian University of Economics, Stremyanny lane 36, Moscow, 117997, Russia
Tatiana Litvinova

Authors

Polina Panicheva
View author publications
You can also search for this author in PubMed Google Scholar
Olga Litvinova
View author publications
You can also search for this author in PubMed Google Scholar
Tatiana Litvinova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tatiana Litvinova .

Editor information

Editors and Affiliations

Utrecht University, Utrecht, The Netherlands
Albert Ali Salah
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Panicheva, P., Litvinova, O., Litvinova, T. (2019). Author Clustering with and Without Topical Features. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-26061-3_36
Published: 24 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics