Soft Computing

, Volume 21, Issue 3, pp 627–639

Application of the distributed document representation in the authorship attribution task for small corpora

  • Juan-Pablo Posadas-Durán
  • Helena Gómez-Adorno
  • Grigori Sidorov
  • Ildar Batyrshin
  • David Pinto
  • Liliana Chanona-Hernández
Focus

DOI: 10.1007/s00500-016-2446-x

Cite this article as:
Posadas-Durán, JP., Gómez-Adorno, H., Sidorov, G. et al. Soft Comput (2017) 21: 627. doi:10.1007/s00500-016-2446-x
  • 131 Downloads

Abstract

Distributed word representation in a vector space (word embeddings) is a novel technique that allows to represent words in terms of the elements in the neighborhood. Distributed representations can be extended to larger language structures like phrases, sentences, paragraphs and documents. The capability to encode semantic information of texts and the ability to handle high- dimensional datasets are the reasons why this representation is widely used in various natural language processing tasks such as text summarization, sentiment analysis and syntactic parsing. In this paper, we propose to use the distributed representation at the document level to solve the task of the authorship attribution. The proposed method learns distributed vector representations at the document level and then uses the SVM classifier to perform the automatic authorship attribution. We also propose to use the word n-grams (instead of the words) as the input data type for learning the distributed representation model. We conducted experiments over six datasets used in the state-of-the-art works, and for the majority of the datasets, we obtained comparable or better results. Our best results were obtained using the combination of words and n-grams of words as the input data types. Training data are relatively scarce, which did not affect the distributed representation.

Keywords

Distributed representation Authorship attribution Author identification Embeddings Word embeddings Stylometry Machine learning SVM Scarce training data 

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Escuela Superior de Ingeniería Mecánica y Eléctrica Unidad Zacatenco (ESIME-Zacatenco)Instituto Politécnico Nacional (IPN)Mexico CityMexico
  2. 2.Center for Computing Research (CIC)Instituto Politécnico Nacional (IPN)Mexico CityMexico
  3. 3.Faculty of Computer ScienceBenemérita Universidad Autónoma de Puebla (BUAP)PueblaMexico

Personalised recommendations