Skip to main content
Log in

Application of the distributed document representation in the authorship attribution task for small corpora

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Distributed word representation in a vector space (word embeddings) is a novel technique that allows to represent words in terms of the elements in the neighborhood. Distributed representations can be extended to larger language structures like phrases, sentences, paragraphs and documents. The capability to encode semantic information of texts and the ability to handle high- dimensional datasets are the reasons why this representation is widely used in various natural language processing tasks such as text summarization, sentiment analysis and syntactic parsing. In this paper, we propose to use the distributed representation at the document level to solve the task of the authorship attribution. The proposed method learns distributed vector representations at the document level and then uses the SVM classifier to perform the automatic authorship attribution. We also propose to use the word n-grams (instead of the words) as the input data type for learning the distributed representation model. We conducted experiments over six datasets used in the state-of-the-art works, and for the majority of the datasets, we obtained comparable or better results. Our best results were obtained using the combination of words and n-grams of words as the input data types. Training data are relatively scarce, which did not affect the distributed representation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://radimrehurek.com/gensim/.

  2. http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-web/.

  3. https://radimrehurek.com/gensim/models/Doc2vec.html.

References

  • Alzahrani S, Salim N, Abraham A (2012) Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C: Appl Rev 42(2):133–149

    Article  Google Scholar 

  • Argamon S, Juola P (2011) Overview of the international authorship identification competition at pan-2011. In: CLEF (Notebook Papers/Labs/Workshop)

  • Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling the author of an anonymous text. Commun ACM 52(2):119–123

    Article  Google Scholar 

  • Bird R, Wadler P (1988) Introduction to functional programming, vol 1. Prentice Hall, New York

    Google Scholar 

  • Brocardo ML, Traore I, Saad S, Woungang I (2013) Authorship verification for short messages using stylometry. In: IEEE international conference on computer, information and telecommunication systems (CITS), 2013, pp 1–6

  • Chaski CE (2005) Who’s at the keyboard? Authorship attribution in digital evidence investigations. Int J Digit Evid 4(1):1–13

    Google Scholar 

  • Cleofas-Sánchez L, Sánchez J, García V, Valdovinos R (2016) Associative learning on imbalanced environments: an empirical study. Expert Syst Appl 54:387–397

    Article  Google Scholar 

  • Escalante HJ, Solorio T, Montes-y Gómez M (2011) Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , vol 1. Association for Computational Linguistics, Stroudsburg, HLT ’11, pp 288–298

  • Franco-Salvador M, Rosso P, Rangel F (2015) Distributed representations of words and documents for discriminating similar languages. In: Proceeding of the RANLP Joint Workshop on language technology for closely related languages, varieties and dialects (LT4VarDial)

  • Gómez-Adorno H, Sidorov G, Pinto D, Vilariño D, Gelbukh A (2016) Automatic authorship detection using textual patterns extracted from integrated syntactic graphs.Sensors 16(9):1374

  • Holmes DI (1998) The evolution of stylometry in humanities scholarship. Lit Linguist Comput 13(3):111–117

    Article  Google Scholar 

  • Houvardas J, Stamatatos E (2006) Stamatatos e.: N-gram feature selection for authorship identification. In: 12th international conference on artificial intelligence: methodology, systems, applications. Springer, pp 77–86

  • Juola P (2004) Ad-hoc authorship attribution competition. In: Proceedings of the joint conference of the association for computers and the humanities and the association for literary and linguistic computing, pp 175–176

  • Juola P (2012) An overview of the traditional authorship attribution subtask. In: CLEF (Online Working Notes/Labs/Workshop)

  • Kešelj V, Peng F, Cercone N, Thomas C (2003) N-gram-based author profiles for authorship attribution. Proceedings of the conference pacific association for computational linguistics, PACLING 3:255–264

    Google Scholar 

  • Kiros R, Zemel RS, Salakhutdinov RR, (2014) A multiplicative model for learning distributed text-based attribute representations. In: Advances in neural information processing systems 27: annual conference on neural information processing systems 2014. Montreal, pp. 8–13

  • Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th international conference on machine learning, ICML 2014, Beijing, pp 1188–1196

  • Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397

    Google Scholar 

  • Li R, Shindo H (2015) Distributed document representation for document classification. In: Cao T, Lim EP, Zhou ZH, Ho TB, Cheung D, Motoda H (eds) Advances in knowledge discovery and data mining, Lecture Notes in Computer Science, vol 9077, Springer International Publishing, pp 212–225

  • Matthews R, Merriam T (1993) Neural computation in stylometry i: an application to the works of shakespeare and fletcher. Lit Linguist Comput 8(4):203–209

    Article  Google Scholar 

  • Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. CoRR abs/1301.3781, http://arxiv.org/abs/1301.3781

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, pp 3111–3119

  • Mikolov T, Yih WT, Zweig G (2013c) Linguistic regularities in continuous space word representations. In: NAACL HLT, Atlanta, June 9, 14, pp 746–751

  • Miranda S, Gelbukh A, Sidorov G (2014) Generating summaries by means of synthesis of conceptual graphs. Rev Signos 47(86):463

    Article  Google Scholar 

  • Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems 21. Curran Associates Inc., pp 1081–1088

  • Mosteller F, Wallace DL (1963) Inference in an authorship problem: a comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J Am Stat Assoc 58(302):275–309

    MATH  Google Scholar 

  • Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), pp 1532–1543

  • Plakias S, Stamatatos E (2008) Tensor space models for authorship identification. In: Darzentas J, Vouros G, Vosinakis S, Arnellos A (eds) Artificial intelligence: theories, models and applications, Springer, Lecture Notes in Computer Science, vol 5138, pp 239–249

  • Posadas-Durán J, Gómez-Adorno H, Markov I, Sidorov G, Batyrshin I, Gelbukh A , Pichardo- Lagunas O (2015) Syntactic N-grams as Features for the Author Profiling Task. In: Proceedings conference and labs of the evaluation forum

  • Posadas-Duran JP, Sidorov G, Batyrshin I (2014) Complete syntactic N-grams as style markers for authorship attribution. In: Mexican international conference on artificial intelligence, Springer, pp 9–17

  • Potthast M, Braun S, Buz T, Duffhauss F, Friedrich F, Gülzow JM, Köhler J, Lötzsch W, Müller F, Müller ME, Paßmann R, Reinke B, Rettenmeier L, Rometsch T, Sommer T, Träger M, Wilhelm S, Stein B, Stamatatos E, Hagen M (2016) Who wrote the web? Revisiting influential author identification research applicable to information retrieval. In: Proceedings on advances in information retrieval - 38th European conference on IR research, ECIR 2016, Padua, March 20–23, 2016 , pp 393–407

  • Rhodes D (2015) Author attribution with cnns. Tech. rep., CS224, Stanford University

  • Sanchez-Perez MA, Gelbukh A, Sidorov G (2015) Adaptive algorithm for plagiarism detection: the best performing approach at pan 2014 text alignment competition. In: International conference of the cross language evaluation forum for european languages, Springer, pp 402–413

  • Sapkota U, Bethard S, Montes-y Gómez M, Solorio T (2015) Not all character n-grams are created equal: a study in authorship attribution. In: Human language technologies: The 2015 annual conference of the North American chapter of the ACL, pp 93–102

  • Segarra S, Eisen M, Ribeiro A (2013) Authorship attribution using function words adjacency networks. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2013, Vancouver, May 26-31, 2013, pp 5563–5567

  • Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, Chanona-Hernández L (2014) Syntactic n-grams as machine learning features for natural language processing. Expert Syst Appl 41(3):853–860

    Article  Google Scholar 

  • Socher R, Bauer J, Manning CD, Ng AY (2013a) Parsing with compositional vector grammars. In: Proceedings of the ACL conference

  • Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013b) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol 1631, p 1642

  • Stamatatos E (2008) Author identification: using text sampling to handle the class imbalance problem. Inf Process Manag 44(2):790–799

    Article  Google Scholar 

  • Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556

    Article  Google Scholar 

  • Stamatatos E (2011) Plagiarism detection using stopword n-grams. J Am Soc Inf Sci Technol 62(12):2512–2527

    Article  Google Scholar 

  • Stamatatos E (2013) On the robustness of authorship attribution based on character n-gram features. J Law Policy 21(2):421–439

  • Stamatatos E, Fakotakis N, Kokkinakis G (2001) Computer-based authorship attribution without lexical measures. Comput Humanit 35(2):193–214

  • Trejo JVC, Sidorov G, Miranda-Jiménez S, Ibarra MAM, Martínez RC (2015) Latent dirichlet allocation complement in the vector space model for multi-label text classification. IJCOPI 6(1):7–19

    Google Scholar 

  • Turian J, Ratinov L, Bengio Y (2010) Word representations: A simple and general method for semisupervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394

  • Wiemer-Hastings P, Wiemer-Hastings K, Graesser A (2004) Latent semantic analysis. In: Proceedings of the 16th international joint conference on artificial intelligence, pp 1–14

Download references

Acknowledgements

This work was done under the partial support of the Mexican Government (CONACYT PROJECT 240844, SNI, COFAA - IPN, SIP - IPN 20162204, 20161947, 20161958, 20162064 20151406, 20151589, 20144274).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan-Pablo Posadas-Durán.

Ethics declarations

Conflict of interest

The authors report no conflict of interest.

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by H. Ponce.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Posadas-Durán, JP., Gómez-Adorno, H., Sidorov, G. et al. Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput 21, 627–639 (2017). https://doi.org/10.1007/s00500-016-2446-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-016-2446-x

Keywords

Navigation