Automatic Adaptation of Author’s Stylometric Features to Document Types

  • Jan Rygl
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)


Many Internet users face the problem of anonymous documents and texts with a counterfeit authorship. The number of questionable documents exceeds the capacity of human experts, therefore a universal automated authorship identification system supporting all types of documents is needed. In this paper, five predominant document types are analysed in the context of the authorship verification: books, blogs, discussions, comments and tweets. A method of an automatic selection of authors’ stylometric features using a double-layer machine learning is proposed and evaluated. Experiments are conducted on ten disjunct train and test sets and a method of an efficient training of large number of machine learning models is introduced (163,700 models were trained).


authorship verification feature selection machine learning stylome stylometric features 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Daelemans, W.: Explanation in computational stylometry. In: Gelbukh, A. (ed.) CICLing 2013, Part II. LNCS, vol. 7817, pp. 451–462. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  2. 2.
    Fitzgerald, J.R.: FBI’s communicated threat assessment database: History, design, and implementation. FBI: Law Enforcement Bulletin 76, 6–9 (2007)Google Scholar
  3. 3.
    Grieve, J.W.: Quantitative authorship attribution: A history and an evaluation of technique. Master’s thesis. Simon Fraser University (2005)Google Scholar
  4. 4.
    Hilton, O.: Scientific examination of questioned documents. Callaghan (1956)Google Scholar
  5. 5.
    Hollingsworth, C.: Using dependency-based annotations for authorship identification. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 314–319. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Holmes, D.I.: The Analysis of Literary Style – A Review. Journal of the Royal Statistical Society 148(4), 328–341 (1985)CrossRefGoogle Scholar
  7. 7.
    Iqbal, F., Khan, L.A., Fung, B.C.M., Debbabi, M.: e-mail authorship verification for forensic investigation. In: Proceedings of the 2010 ACM Symposium on Applied Computing, SAC 2010, pp. 1591–1598. ACM Press, New York (2010)Google Scholar
  8. 8.
    Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, p. 62. ACM, New York (2004)Google Scholar
  9. 9.
    Kovář, V., Horák, A., Jakubíček, M.: Syntactic analysis using finite patterns: A new parsing system for czech. In: Vetulani, Z. (ed.) LTC 2009. LNCS (LNAI), vol. 6562, pp. 161–171. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  10. 10.
    Love, H.: Attributing Authorship: An Introduction. Cambridge University Press (2002)Google Scholar
  11. 11.
    Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics COLING 2008, vol. 1, pp. 513–520. Association for Computational Linguistics, Stroudsburg (2008)Google Scholar
  12. 12.
    McMenamin, G.R., Choi, D.: Forensic Linguistics: Advances in Forensic Stylistics. Crc Press (2002)Google Scholar
  13. 13.
    Morton, A.Q., Michaelson, S.: The Q-Sum Plot. Technical report, Department of Computer Science, University of Edinburgh, CSR-3-90 (1990)Google Scholar
  14. 14.
    Pearl, L., Steyvers, M.: Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC 27(2), 183–196 (2012)Google Scholar
  15. 15.
    Rygl, J., Horák, A.: Authorship Attribution: Comparison of Single-layer and Double-layer Machine Learning. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS, vol. 7499, pp. 282–289. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  16. 16.
    Rygl, J., Zemková, K., Kovář, V.: Authorship Verification based on Syntax Features. In: Proceedings of Sixth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2012, 1st edn., Tribun EU, Brno, Czech Republic, pp. 111–119 (2012)Google Scholar
  17. 17.
    Simpson, E.H.: Measurement of diversity. Nature 163, 688 (1949)CrossRefzbMATHGoogle Scholar
  18. 18.
    van Halteren, H.: Linguistic profiling for author recognition and verification. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL 2004. Association for Computational Linguistics, Stroudsburg (2004)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jan Rygl
    • 1
  1. 1.Natural Language Processing Centre, Faculty of InformaticsMasaryk UniversityBrnoCzech Republic

Personalised recommendations