Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10260)

Abstract

In this paper, word embeddings are used for the task of supervised authorship attribution. While previous methods have for instance been looking at characters (n-grams), syntax and most importantly token frequencies, the method presented focusses on the implications of semantic relationships between words. With this instead of authors word choices, semantic networks of entities as perceived by authors may come closer into focus. We find that those can be used reliably for authorship attribution. The method is generally applicable as a tool to compare different texts and/or authors through word embeddings which have been trained separately. This is achieved by not comparing vectors directly, but by comparing sets of most similar words for words shared between texts and then aggregating and averaging similarities per text pair. On two literary corpora (German, English), we compute embeddings for each text separately. The similarities are then used to detect the author of an unknown text.

Keywords

Authorship attribution Word embeddings Text distance 

References

  1. 1.
    Argamon, S.: Interpreting Burrows’s delta: geometric and probabilistic foundations. Literary Linguist. Comput. 23(2), 131–147 (2008)CrossRefGoogle Scholar
  2. 2.
    Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguistic Comput. 17(3), 267–287 (2002)CrossRefGoogle Scholar
  3. 3.
    Eder, M.: Does size matter? Authorship attribution, small samples, big problem. Literary Linguist. Comput. 30(2), 167–182 (2013)Google Scholar
  4. 4.
    Evert, S., Proisl, T., Vitt, T., Schöch, C., Jannidis, F., Pielström, S.: Towards a better understanding of Burrows’s Delta in literary authorship attribution. In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 79–88. Association for Computational Linguistics, Denver, Colorado, USA (2015)Google Scholar
  5. 5.
    Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, vol. 69, p. 72 (2003)Google Scholar
  6. 6.
    Marsden, J., Budden, D., Craig, H., Moscato, P.: Language individuation and marker words: Shakespeare and his Maxwells Demon. PLoS ONE 8(6), 63–88 (2013)CrossRefGoogle Scholar
  7. 7.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)Google Scholar
  8. 8.
    Smith, P.W.H., Aldridge, W.: Improving authorship attribution: optimizing Burrows’ Delta method. J. Quant. Linguist. 18(1), 63–88 (2011)CrossRefGoogle Scholar
  9. 9.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Text Technology Lab/CEDIFORGoethe University FrankfurtFrankfurtGermany

Personalised recommendations