Abstract
Typically, the task of authorship attribution has been solved using supervised machine learning methods. It is only recently that unsupervised methods have been applied to authorship attribution, namely author clustering. Clustering could be useful in realistic scenario as it represents natural grouping of documents without a priori authorship information, although the problem of feature selection remains unsolved. That is particularly true for a cross-domain scenario. Studies have shown that in cross-domain settings some domain-specific text features cause noise in authorship attribution. In the current work we introduce a modification of unmasking technique aimed at selecting and removing the features most influenced by topic change. We apply the proposed technique to identify topical features and assess the quality of author clustering with different feature sets in a real-world dataset of forum texts in Russian. The main assumption is that the topical features result in topic-based text instead authorship-based clustering, and removing them could increase the performance of document clustering against authorship ground truth. We test this consideration by first clustering cross-topic documents with state-of-the-art authorship attribution features. Second, we remove the most significant topical features, and cluster texts with resulting feature set. Both clustering results are evaluated against ground truth authorship. The results demonstrate that the described approach of removing some topical features increases author clustering performance, however one should be cautious with the number of removed features.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
Chen, H.: Dark web: exploring and mining the dark side of the web. In: European Intelligence and Security Informatics Conference 2011, pp. 1–2. IEEE Computer Society (2011)
Darooneh, A.H., Shariati, A.: Metrics for evaluation of the author’s writing styles: who is the best?. Chaos: Interdisc. J. Nonlinear Sci. 24, 033132 (2014)
Gómez-Adorno, H., Martín-del-Campo-Rodríguez, C., Sidorov, G., Alemán, Y., Vilariño, D., Pinto, D.: Hierarchical clustering analysis: the best-performing approach at PAN 2017 author clustering task. In: Bellot, P., et al. (eds.) CLEF 2018. LNCS, vol. 11018, pp. 216–223. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_20
Kestemont, M., Luyckx, K., Daelemans, W., Crombez, T.: Evaluating unmasking for cross-genre authorship verification. In: Meister, J.C. (ed.) Digital Humanities 2012, pp. 249–251. Hamburg University Press (2012)
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)
Litvinova, T., Litvinova, O., Seredin, P.: Assessing the level of stability of idiolectal features across modes, topics and time of text production. In: 23rd Conference of Open Innovations Association, FRUCT 2018, pp. 223–230. IEEE (2018)
Litvinova, T., Panicheva, P., Litvinova, O.: Authorship attribution of Russian forum posts with different types of n-gram features. In: 3rd International Conference on Natural Language Processing and Information Retrieval (NLPIR 2019) Proceedings. ACM (2019, in press)
Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: The 50th Annual Meeting of the Association for Computational Linguistics, pp. 25–30. The Association for Computer Linguistics (2012)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)
Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are created equal: A study in authorship attribution. In: NAACL HLT 2015, pp. 93–102. The Association for Computational Linguistics (2015)
Stamatatos, E.: Masking topic-related information to enhance authorship attribution. J. Assoc. Inf. Sci. Technol. 69, 461–473 (2018). https://doi.org/10.1002/asi.23968
Stamatatos, E., et al.: Clustering by authorship within and across documents. In: Working Notes of CLEF 2016, CEUR Workshop Proceedings, vol. 1609, pp. 691–715. CEUR-WS.org (2016)
Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working Notes of CLEF 2017, CEUR Workshop Proceedings, vol. 1866. CEUR-WS.org (2017)
Acknowledgment
Authors acknowledge support of this study by the Russian Science Foundation, grant №18-78-10081 “Modelling of the idiolect of a modern Russian speaker in the context of the problem of authorship attribution”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Panicheva, P., Litvinova, O., Litvinova, T. (2019). Author Clustering with and Without Topical Features. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_36
Download citation
DOI: https://doi.org/10.1007/978-3-030-26061-3_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)