Soft Computing

, Volume 21, Issue 3, pp 627–639 | Cite as

Application of the distributed document representation in the authorship attribution task for small corpora

  • Juan-Pablo Posadas-Durán
  • Helena Gómez-Adorno
  • Grigori Sidorov
  • Ildar Batyrshin
  • David Pinto
  • Liliana Chanona-Hernández
Focus

Abstract

Distributed word representation in a vector space (word embeddings) is a novel technique that allows to represent words in terms of the elements in the neighborhood. Distributed representations can be extended to larger language structures like phrases, sentences, paragraphs and documents. The capability to encode semantic information of texts and the ability to handle high- dimensional datasets are the reasons why this representation is widely used in various natural language processing tasks such as text summarization, sentiment analysis and syntactic parsing. In this paper, we propose to use the distributed representation at the document level to solve the task of the authorship attribution. The proposed method learns distributed vector representations at the document level and then uses the SVM classifier to perform the automatic authorship attribution. We also propose to use the word n-grams (instead of the words) as the input data type for learning the distributed representation model. We conducted experiments over six datasets used in the state-of-the-art works, and for the majority of the datasets, we obtained comparable or better results. Our best results were obtained using the combination of words and n-grams of words as the input data types. Training data are relatively scarce, which did not affect the distributed representation.

Keywords

Distributed representation Authorship attribution Author identification Embeddings Word embeddings Stylometry Machine learning SVM Scarce training data 

References

  1. Alzahrani S, Salim N, Abraham A (2012) Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C: Appl Rev 42(2):133–149CrossRefGoogle Scholar
  2. Argamon S, Juola P (2011) Overview of the international authorship identification competition at pan-2011. In: CLEF (Notebook Papers/Labs/Workshop)Google Scholar
  3. Argamon S, Koppel M, Pennebaker JW, Schler J (2009) Automatically profiling the author of an anonymous text. Commun ACM 52(2):119–123CrossRefGoogle Scholar
  4. Bird R, Wadler P (1988) Introduction to functional programming, vol 1. Prentice Hall, New YorkGoogle Scholar
  5. Brocardo ML, Traore I, Saad S, Woungang I (2013) Authorship verification for short messages using stylometry. In: IEEE international conference on computer, information and telecommunication systems (CITS), 2013, pp 1–6Google Scholar
  6. Chaski CE (2005) Who’s at the keyboard? Authorship attribution in digital evidence investigations. Int J Digit Evid 4(1):1–13Google Scholar
  7. Cleofas-Sánchez L, Sánchez J, García V, Valdovinos R (2016) Associative learning on imbalanced environments: an empirical study. Expert Syst Appl 54:387–397CrossRefGoogle Scholar
  8. Escalante HJ, Solorio T, Montes-y Gómez M (2011) Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , vol 1. Association for Computational Linguistics, Stroudsburg, HLT ’11, pp 288–298Google Scholar
  9. Franco-Salvador M, Rosso P, Rangel F (2015) Distributed representations of words and documents for discriminating similar languages. In: Proceeding of the RANLP Joint Workshop on language technology for closely related languages, varieties and dialects (LT4VarDial)Google Scholar
  10. Gómez-Adorno H, Sidorov G, Pinto D, Vilariño D, Gelbukh A (2016) Automatic authorship detection using textual patterns extracted from integrated syntactic graphs.Sensors 16(9):1374Google Scholar
  11. Holmes DI (1998) The evolution of stylometry in humanities scholarship. Lit Linguist Comput 13(3):111–117CrossRefGoogle Scholar
  12. Houvardas J, Stamatatos E (2006) Stamatatos e.: N-gram feature selection for authorship identification. In: 12th international conference on artificial intelligence: methodology, systems, applications. Springer, pp 77–86Google Scholar
  13. Juola P (2004) Ad-hoc authorship attribution competition. In: Proceedings of the joint conference of the association for computers and the humanities and the association for literary and linguistic computing, pp 175–176Google Scholar
  14. Juola P (2012) An overview of the traditional authorship attribution subtask. In: CLEF (Online Working Notes/Labs/Workshop)Google Scholar
  15. Kešelj V, Peng F, Cercone N, Thomas C (2003) N-gram-based author profiles for authorship attribution. Proceedings of the conference pacific association for computational linguistics, PACLING 3:255–264Google Scholar
  16. Kiros R, Zemel RS, Salakhutdinov RR, (2014) A multiplicative model for learning distributed text-based attribute representations. In: Advances in neural information processing systems 27: annual conference on neural information processing systems 2014. Montreal, pp. 8–13Google Scholar
  17. Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th international conference on machine learning, ICML 2014, Beijing, pp 1188–1196Google Scholar
  18. Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397Google Scholar
  19. Li R, Shindo H (2015) Distributed document representation for document classification. In: Cao T, Lim EP, Zhou ZH, Ho TB, Cheung D, Motoda H (eds) Advances in knowledge discovery and data mining, Lecture Notes in Computer Science, vol 9077, Springer International Publishing, pp 212–225Google Scholar
  20. Matthews R, Merriam T (1993) Neural computation in stylometry i: an application to the works of shakespeare and fletcher. Lit Linguist Comput 8(4):203–209CrossRefGoogle Scholar
  21. Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. CoRR abs/1301.3781, http://arxiv.org/abs/1301.3781
  22. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, pp 3111–3119Google Scholar
  23. Mikolov T, Yih WT, Zweig G (2013c) Linguistic regularities in continuous space word representations. In: NAACL HLT, Atlanta, June 9, 14, pp 746–751Google Scholar
  24. Miranda S, Gelbukh A, Sidorov G (2014) Generating summaries by means of synthesis of conceptual graphs. Rev Signos 47(86):463CrossRefGoogle Scholar
  25. Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems 21. Curran Associates Inc., pp 1081–1088Google Scholar
  26. Mosteller F, Wallace DL (1963) Inference in an authorship problem: a comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J Am Stat Assoc 58(302):275–309MATHGoogle Scholar
  27. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2014), pp 1532–1543Google Scholar
  28. Plakias S, Stamatatos E (2008) Tensor space models for authorship identification. In: Darzentas J, Vouros G, Vosinakis S, Arnellos A (eds) Artificial intelligence: theories, models and applications, Springer, Lecture Notes in Computer Science, vol 5138, pp 239–249Google Scholar
  29. Posadas-Durán J, Gómez-Adorno H, Markov I, Sidorov G, Batyrshin I, Gelbukh A , Pichardo- Lagunas O (2015) Syntactic N-grams as Features for the Author Profiling Task. In: Proceedings conference and labs of the evaluation forumGoogle Scholar
  30. Posadas-Duran JP, Sidorov G, Batyrshin I (2014) Complete syntactic N-grams as style markers for authorship attribution. In: Mexican international conference on artificial intelligence, Springer, pp 9–17Google Scholar
  31. Potthast M, Braun S, Buz T, Duffhauss F, Friedrich F, Gülzow JM, Köhler J, Lötzsch W, Müller F, Müller ME, Paßmann R, Reinke B, Rettenmeier L, Rometsch T, Sommer T, Träger M, Wilhelm S, Stein B, Stamatatos E, Hagen M (2016) Who wrote the web? Revisiting influential author identification research applicable to information retrieval. In: Proceedings on advances in information retrieval - 38th European conference on IR research, ECIR 2016, Padua, March 20–23, 2016 , pp 393–407Google Scholar
  32. Rhodes D (2015) Author attribution with cnns. Tech. rep., CS224, Stanford UniversityGoogle Scholar
  33. Sanchez-Perez MA, Gelbukh A, Sidorov G (2015) Adaptive algorithm for plagiarism detection: the best performing approach at pan 2014 text alignment competition. In: International conference of the cross language evaluation forum for european languages, Springer, pp 402–413Google Scholar
  34. Sapkota U, Bethard S, Montes-y Gómez M, Solorio T (2015) Not all character n-grams are created equal: a study in authorship attribution. In: Human language technologies: The 2015 annual conference of the North American chapter of the ACL, pp 93–102Google Scholar
  35. Segarra S, Eisen M, Ribeiro A (2013) Authorship attribution using function words adjacency networks. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2013, Vancouver, May 26-31, 2013, pp 5563–5567Google Scholar
  36. Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, Chanona-Hernández L (2014) Syntactic n-grams as machine learning features for natural language processing. Expert Syst Appl 41(3):853–860CrossRefGoogle Scholar
  37. Socher R, Bauer J, Manning CD, Ng AY (2013a) Parsing with compositional vector grammars. In: Proceedings of the ACL conferenceGoogle Scholar
  38. Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013b) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol 1631, p 1642Google Scholar
  39. Stamatatos E (2008) Author identification: using text sampling to handle the class imbalance problem. Inf Process Manag 44(2):790–799CrossRefGoogle Scholar
  40. Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556CrossRefGoogle Scholar
  41. Stamatatos E (2011) Plagiarism detection using stopword n-grams. J Am Soc Inf Sci Technol 62(12):2512–2527CrossRefGoogle Scholar
  42. Stamatatos E (2013) On the robustness of authorship attribution based on character n-gram features. J Law Policy 21(2):421–439Google Scholar
  43. Stamatatos E, Fakotakis N, Kokkinakis G (2001) Computer-based authorship attribution without lexical measures. Comput Humanit 35(2):193–214Google Scholar
  44. Trejo JVC, Sidorov G, Miranda-Jiménez S, Ibarra MAM, Martínez RC (2015) Latent dirichlet allocation complement in the vector space model for multi-label text classification. IJCOPI 6(1):7–19Google Scholar
  45. Turian J, Ratinov L, Bengio Y (2010) Word representations: A simple and general method for semisupervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394Google Scholar
  46. Wiemer-Hastings P, Wiemer-Hastings K, Graesser A (2004) Latent semantic analysis. In: Proceedings of the 16th international joint conference on artificial intelligence, pp 1–14Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Escuela Superior de Ingeniería Mecánica y Eléctrica Unidad Zacatenco (ESIME-Zacatenco)Instituto Politécnico Nacional (IPN)Mexico CityMexico
  2. 2.Center for Computing Research (CIC)Instituto Politécnico Nacional (IPN)Mexico CityMexico
  3. 3.Faculty of Computer ScienceBenemérita Universidad Autónoma de Puebla (BUAP)PueblaMexico

Personalised recommendations