Skip to main content

Author Identification Using Latent Dirichlet Allocation

  • Conference paper
  • First Online:
Book cover Computational Linguistics and Intelligent Text Processing (CICLing 2017)

Abstract

We tackle the task of author identification at PAN 2015 through a Latent Dirichlet Allocation (LDA) model. By using this method, we take into account the vocabulary and context of words at the same time, and after a statistical process find to what extent the relations between words are given in each document; processing a set of documents by LDA returns a set of distributions of topics. Each distribution can be seen as a vector of features and a fingerprint of each document within the collection. We used then a Naïve Bayes classifier on the obtained patterns with different performances. We obtained state-of-the-art performance for English, overtaking the best FS score reported in PAN 2015, while obtaining mixed results for other languages.

The authors wish to thank the support of the Instituto Politécnico Nacional, (COFAA, SIP) and the Mexican Government (CONACYT, SNI). The first author is currently in a research stay at Laboratoire d’Informatique de Paris Nord, CNRS, Université Paris 13.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  2. Nirkhi, S., Dharaskar, R.V.: Comparative study of authorship identification techniques for cyber forensics analysis (2013). arXiv preprint arXiv:1401.6118

  3. Layton, R., Watters, P., Dazeley, R.: Local n-grams for author identification. In: Notebook for PAN at CLEF (2013)

    Google Scholar 

  4. Bergsma, S., Post, M., Yarowsky, D.: Stylometric analysis of scientific articles. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 327–337, June 2012. Association for Computational Linguistics (2012)

    Google Scholar 

  5. Bradley, J.K., Kelley, P.G., Roth, A.: Author identification from citations. Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Technical Report (2008)

    Google Scholar 

  6. Stamatatos, E., et al.: Overview of the author identification task at PAN 2015. In: CLEF (Working Notes) (2015)

    Google Scholar 

  7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  8. Dumais, S.T.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)

    Article  Google Scholar 

  9. Peñas, A., Rodrigo, A.: A simple measure to assess non-response. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1415–1424, June 2011. Association for Computational Linguistics (2011)

    Google Scholar 

  10. Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)

    Article  MathSciNet  Google Scholar 

  11. Pimas, O., Kröll, M., Kern, R.: Know-center at PAN 2015 author identification. In: Working Notes Papers of the CLEF (2015)

    Google Scholar 

  12. Narayanan, A., et al.: On the feasibility of internet-scale author identification, May 2012. In: 2012 IEEE Symposium on Security and Privacy, pp. 300–314. IEEE (2012)

    Google Scholar 

  13. Pateriya, P.K.: A Study on author identification through stylometry. Int. J. Comput. Sci. Commun. Netw. 2(6), 653 (2012)

    Google Scholar 

  14. Madigan, D., Genkin, A., Lewis, D.D., Argamon, S., Fradkin, D., Ye, L.: Author identification on the large scale. In: Proceedings of the Meeting of the Classification Society of North America, p. 13 (2005)

    Google Scholar 

  15. Castro, A., Lindauer, B.: Author identification on Twitter (2012). semanticscholar.org

  16. Pavelec, D., Justino, E., Oliveira, L.S.: Author identification using stylometric features. Inteligencia Artificial: Revista Iberoamericana de Inteligencia Artificial 11(36), 59–66 (2007)

    Article  Google Scholar 

  17. Green, R.M., Sheppard, J.W.: Comparing frequency-and style-based features for twitter author identification. In: FLAIRS Conference, May 2013

    Google Scholar 

  18. Afroz, S., Brennan, M., Greenstadt, R.: Detecting hoaxes, frauds, and deception in writing style online. In: 2012 IEEE Symposium on Security and Privacy, pp. 461–475, May 2012. IEEE (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hiram Calvo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Calvo, H., Hernández-Castañeda, Á., García-Flores, J. (2018). Author Identification Using Latent Dirichlet Allocation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10762. Springer, Cham. https://doi.org/10.1007/978-3-319-77116-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77116-8_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77115-1

  • Online ISBN: 978-3-319-77116-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics