Abstract
We tackle the task of author identification at PAN 2015 through a Latent Dirichlet Allocation (LDA) model. By using this method, we take into account the vocabulary and context of words at the same time, and after a statistical process find to what extent the relations between words are given in each document; processing a set of documents by LDA returns a set of distributions of topics. Each distribution can be seen as a vector of features and a fingerprint of each document within the collection. We used then a Naïve Bayes classifier on the obtained patterns with different performances. We obtained state-of-the-art performance for English, overtaking the best FS score reported in PAN 2015, while obtaining mixed results for other languages.
The authors wish to thank the support of the Instituto Politécnico Nacional, (COFAA, SIP) and the Mexican Government (CONACYT, SNI). The first author is currently in a research stay at Laboratoire d’Informatique de Paris Nord, CNRS, Université Paris 13.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Nirkhi, S., Dharaskar, R.V.: Comparative study of authorship identification techniques for cyber forensics analysis (2013). arXiv preprint arXiv:1401.6118
Layton, R., Watters, P., Dazeley, R.: Local n-grams for author identification. In: Notebook for PAN at CLEF (2013)
Bergsma, S., Post, M., Yarowsky, D.: Stylometric analysis of scientific articles. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 327–337, June 2012. Association for Computational Linguistics (2012)
Bradley, J.K., Kelley, P.G., Roth, A.: Author identification from citations. Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Technical Report (2008)
Stamatatos, E., et al.: Overview of the author identification task at PAN 2015. In: CLEF (Working Notes) (2015)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Dumais, S.T.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)
Peñas, A., Rodrigo, A.: A simple measure to assess non-response. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1415–1424, June 2011. Association for Computational Linguistics (2011)
Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)
Pimas, O., Kröll, M., Kern, R.: Know-center at PAN 2015 author identification. In: Working Notes Papers of the CLEF (2015)
Narayanan, A., et al.: On the feasibility of internet-scale author identification, May 2012. In: 2012 IEEE Symposium on Security and Privacy, pp. 300–314. IEEE (2012)
Pateriya, P.K.: A Study on author identification through stylometry. Int. J. Comput. Sci. Commun. Netw. 2(6), 653 (2012)
Madigan, D., Genkin, A., Lewis, D.D., Argamon, S., Fradkin, D., Ye, L.: Author identification on the large scale. In: Proceedings of the Meeting of the Classification Society of North America, p. 13 (2005)
Castro, A., Lindauer, B.: Author identification on Twitter (2012). semanticscholar.org
Pavelec, D., Justino, E., Oliveira, L.S.: Author identification using stylometric features. Inteligencia Artificial: Revista Iberoamericana de Inteligencia Artificial 11(36), 59–66 (2007)
Green, R.M., Sheppard, J.W.: Comparing frequency-and style-based features for twitter author identification. In: FLAIRS Conference, May 2013
Afroz, S., Brennan, M., Greenstadt, R.: Detecting hoaxes, frauds, and deception in writing style online. In: 2012 IEEE Symposium on Security and Privacy, pp. 461–475, May 2012. IEEE (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Calvo, H., Hernández-Castañeda, Á., García-Flores, J. (2018). Author Identification Using Latent Dirichlet Allocation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10762. Springer, Cham. https://doi.org/10.1007/978-3-319-77116-8_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-77116-8_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77115-1
Online ISBN: 978-3-319-77116-8
eBook Packages: Computer ScienceComputer Science (R0)