Automatic Adaptation of Author’s Stylometric Features to Document Types
Many Internet users face the problem of anonymous documents and texts with a counterfeit authorship. The number of questionable documents exceeds the capacity of human experts, therefore a universal automated authorship identification system supporting all types of documents is needed. In this paper, five predominant document types are analysed in the context of the authorship verification: books, blogs, discussions, comments and tweets. A method of an automatic selection of authors’ stylometric features using a double-layer machine learning is proposed and evaluated. Experiments are conducted on ten disjunct train and test sets and a method of an efficient training of large number of machine learning models is introduced (163,700 models were trained).
Keywordsauthorship verification feature selection machine learning stylome stylometric features
Unable to display preview. Download preview PDF.
- 2.Fitzgerald, J.R.: FBI’s communicated threat assessment database: History, design, and implementation. FBI: Law Enforcement Bulletin 76, 6–9 (2007)Google Scholar
- 3.Grieve, J.W.: Quantitative authorship attribution: A history and an evaluation of technique. Master’s thesis. Simon Fraser University (2005)Google Scholar
- 4.Hilton, O.: Scientific examination of questioned documents. Callaghan (1956)Google Scholar
- 7.Iqbal, F., Khan, L.A., Fung, B.C.M., Debbabi, M.: e-mail authorship verification for forensic investigation. In: Proceedings of the 2010 ACM Symposium on Applied Computing, SAC 2010, pp. 1591–1598. ACM Press, New York (2010)Google Scholar
- 8.Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, p. 62. ACM, New York (2004)Google Scholar
- 10.Love, H.: Attributing Authorship: An Introduction. Cambridge University Press (2002)Google Scholar
- 11.Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics COLING 2008, vol. 1, pp. 513–520. Association for Computational Linguistics, Stroudsburg (2008)Google Scholar
- 12.McMenamin, G.R., Choi, D.: Forensic Linguistics: Advances in Forensic Stylistics. Crc Press (2002)Google Scholar
- 13.Morton, A.Q., Michaelson, S.: The Q-Sum Plot. Technical report, Department of Computer Science, University of Edinburgh, CSR-3-90 (1990)Google Scholar
- 14.Pearl, L., Steyvers, M.: Detecting authorship deception: a supervised machine learning approach using author writeprints. LLC 27(2), 183–196 (2012)Google Scholar
- 16.Rygl, J., Zemková, K., Kovář, V.: Authorship Verification based on Syntax Features. In: Proceedings of Sixth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2012, 1st edn., Tribun EU, Brno, Czech Republic, pp. 111–119 (2012)Google Scholar
- 18.van Halteren, H.: Linguistic profiling for author recognition and verification. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL 2004. Association for Computational Linguistics, Stroudsburg (2004)Google Scholar