Skip to main content

Author Clustering with and Without Topical Features

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Abstract

Typically, the task of authorship attribution has been solved using supervised machine learning methods. It is only recently that unsupervised methods have been applied to authorship attribution, namely author clustering. Clustering could be useful in realistic scenario as it represents natural grouping of documents without a priori authorship information, although the problem of feature selection remains unsolved. That is particularly true for a cross-domain scenario. Studies have shown that in cross-domain settings some domain-specific text features cause noise in authorship attribution. In the current work we introduce a modification of unmasking technique aimed at selecting and removing the features most influenced by topic change. We apply the proposed technique to identify topical features and assess the quality of author clustering with different feature sets in a real-world dataset of forum texts in Russian. The main assumption is that the topical features result in topic-based text instead authorship-based clustering, and removing them could increase the performance of document clustering against authorship ground truth. We test this consideration by first clustering cross-topic documents with state-of-the-art authorship attribution features. Second, we remove the most significant topical features, and cluster texts with resulting feature set. Both clustering results are evaluated against ground truth authorship. The results demonstrate that the described approach of removing some topical features increases author clustering performance, however one should be cautious with the number of removed features.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)

    Article  Google Scholar 

  2. Chen, H.: Dark web: exploring and mining the dark side of the web. In: European Intelligence and Security Informatics Conference 2011, pp. 1–2. IEEE Computer Society (2011)

    Google Scholar 

  3. Darooneh, A.H., Shariati, A.: Metrics for evaluation of the author’s writing styles: who is the best?. Chaos: Interdisc. J. Nonlinear Sci. 24, 033132 (2014)

    Google Scholar 

  4. Gómez-Adorno, H., Martín-del-Campo-Rodríguez, C., Sidorov, G., Alemán, Y., Vilariño, D., Pinto, D.: Hierarchical clustering analysis: the best-performing approach at PAN 2017 author clustering task. In: Bellot, P., et al. (eds.) CLEF 2018. LNCS, vol. 11018, pp. 216–223. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_20

    Chapter  Google Scholar 

  5. Kestemont, M., Luyckx, K., Daelemans, W., Crombez, T.: Evaluating unmasking for cross-genre authorship verification. In: Meister, J.C. (ed.) Digital Humanities 2012, pp. 249–251. Hamburg University Press (2012)

    Google Scholar 

  6. Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)

    MATH  Google Scholar 

  7. Litvinova, T., Litvinova, O., Seredin, P.: Assessing the level of stability of idiolectal features across modes, topics and time of text production. In: 23rd Conference of Open Innovations Association, FRUCT 2018, pp. 223–230. IEEE (2018)

    Google Scholar 

  8. Litvinova, T., Panicheva, P., Litvinova, O.: Authorship attribution of Russian forum posts with different types of n-gram features. In: 3rd International Conference on Natural Language Processing and Information Retrieval (NLPIR 2019) Proceedings. ACM (2019, in press)

    Google Scholar 

  9. Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: The 50th Annual Meeting of the Association for Computational Linguistics, pp. 25–30. The Association for Computer Linguistics (2012)

    Google Scholar 

  10. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  11. Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)

    Article  MathSciNet  Google Scholar 

  12. Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are created equal: A study in authorship attribution. In: NAACL HLT 2015, pp. 93–102. The Association for Computational Linguistics (2015)

    Google Scholar 

  13. Stamatatos, E.: Masking topic-related information to enhance authorship attribution. J. Assoc. Inf. Sci. Technol. 69, 461–473 (2018). https://doi.org/10.1002/asi.23968

    Article  Google Scholar 

  14. Stamatatos, E., et al.: Clustering by authorship within and across documents. In: Working Notes of CLEF 2016, CEUR Workshop Proceedings, vol. 1609, pp. 691–715. CEUR-WS.org (2016)

    Google Scholar 

  15. Tschuggnall, M., et al.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working Notes of CLEF 2017, CEUR Workshop Proceedings, vol. 1866. CEUR-WS.org (2017)

    Google Scholar 

Download references

Acknowledgment

Authors acknowledge support of this study by the Russian Science Foundation, grant №18-78-10081 “Modelling of the idiolect of a modern Russian speaker in the context of the problem of authorship attribution”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tatiana Litvinova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Panicheva, P., Litvinova, O., Litvinova, T. (2019). Author Clustering with and Without Topical Features. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26061-3_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26060-6

  • Online ISBN: 978-3-030-26061-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics