Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution
In this paper, word embeddings are used for the task of supervised authorship attribution. While previous methods have for instance been looking at characters (n-grams), syntax and most importantly token frequencies, the method presented focusses on the implications of semantic relationships between words. With this instead of authors word choices, semantic networks of entities as perceived by authors may come closer into focus. We find that those can be used reliably for authorship attribution. The method is generally applicable as a tool to compare different texts and/or authors through word embeddings which have been trained separately. This is achieved by not comparing vectors directly, but by comparing sets of most similar words for words shared between texts and then aggregating and averaging similarities per text pair. On two literary corpora (German, English), we compute embeddings for each text separately. The similarities are then used to detect the author of an unknown text.
KeywordsAuthorship attribution Word embeddings Text distance
- 3.Eder, M.: Does size matter? Authorship attribution, small samples, big problem. Literary Linguist. Comput. 30(2), 167–182 (2013)Google Scholar
- 4.Evert, S., Proisl, T., Vitt, T., Schöch, C., Jannidis, F., Pielström, S.: Towards a better understanding of Burrows’s Delta in literary authorship attribution. In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 79–88. Association for Computational Linguistics, Denver, Colorado, USA (2015)Google Scholar
- 5.Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, vol. 69, p. 72 (2003)Google Scholar
- 7.Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)Google Scholar