Abstract
The paper proposes an original methodology of authorship attribution based on the deviations from Zipf distribution and statistical data obtained with the help of a concordance program and computations performed in a table processor. The methodology involves finding distances between input texts and a reference text basing on deviations of stop-words frequencies. The results that have been achieved prove that the proposed methodology allows performing efficient authorship attribution and that it can be used in the educational process to develop student skills and competencies pertaining to natural language processing.
Similar content being viewed by others
Notes
On the approval of the federal state educational standard of higher education in the direction of preparation 03.03.02 Linguistics (bachelor’s level): order of the Ministry of Education and Science of Russia dated 07.08.2014 N 940. - URL: http:// fgosvo.ru/uploadfiles/fgosvob/450302_Lingvistika.pdf (date of the application: 25.06.2020).
REFERENCES
Francis, W.N. and Kucera, H., Computational Analysis of Present Day American English, Providence, RI: Brown Univ. Press, 1967.
Anthony, L., AntConc 3.5.8, Tokyo: Waseda Univ., 2019. https://www.laurenceanthony.net/software. Accessed June 25, 2020.
Yatsko, V.A., Zipf’s law as an indicator of the reference data distribution, in Rol’ i mesto informatsionnykh tekhnologii v sovremennoi nauke (The Role and Place of Information Technology in Modern Science), Omsk, 2016, pp. 48–50. https://os-russia.com/SBORNIKI/KON-129.pdf#page=48. Accessed June 25, 2020.
Yatsko, V.A., Automatic text classification method based on Zipf’s law, Autom. Doc. Math. Linguist., 2015, vol. 49, pp. 83–88.
Amarasinghe, K., Manic, M., and Hruska, R., Optimal stop word selection for text mining in critical infrastructure domain, Resilience Week (RWS), Philadelphia, PA, 2015, pp. 1–6. https://www.researchgate.net/publication/ 281377695_Optimal_Stop_Word_Selection_for_Text_ Mining_in_Critical_Infrastructure_Domain#fullTextFileContent.https://doi.org/10.1109/RWEEK.2015.7287440
Singhal, A., Buckley, C., and Mitra, M., Pivoted document length-normalization, SIGIR Forum, 2017, vol. 51, no. 2, pp. 176–184. https://doi.org/10.1145/3130348.3130365http://singhal.info/pivoted-dln.pdf. Accessed June 25, 2020.
Sinclair, J., Reading Concordances, London: Longman, 2003. http://www.twc.it/rc/readings.htm. Accessed June 25, 2020.
Concapp.rar. https://docs.zoho.com/file/1hhltd2e9dd94a00d4aec88094394b1d42255. Accessed June 25, 2020.
Scott, M., WordSmith Tools Version 8, 2020, Stroud: Lexical Analysis Software. https://lexically.net/wordsmith/?gclid=EAIaIQobChMI-pLbtuSV6gIVkpIYCh208guuEAAYASAAEgKAAvD_BwE. Accessed June 25, 2020.
WordStat, Provalis Research, 2020. https://provalisresearch.com/products/content-analysis-software/.Accessed June 25, 2020.
Free eBooks – Project Gutenberg. https://www.gutenberg.org/. Accessed June 25, 2020.
Dendamrongvit, S., Vateekul, P., and Kubat, M., Irrelevant attributes and imbalanced classes in multi-label text-categorization domains, Intell. Data Anal., 2011, vol. 15, no. 6, pp. 843–859. https://content.iospress.com/articles/intelligent-data-analysis/ida00499. Accessed June 25, 2020.
Yatsko, V., Zonal text processing, Digital Scholarship Humanit., 2016, vol. 31, no. 4, pp. 773–781.
Fox, C., A stop list for general text, SIGIR Forum, 1989, vol. 24, nos. 1–2, pp. 19–21. https://doi.org/10.1145/378881.378888. https://dl.acm.org/doi/pdf/10.1145/378881.378888. Accessed June 25, 2020.
Funding
This research was carried out with the support of the Russian Foundation for Basic Research (project no. 20-07-00124).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors declare that they have no conflicts of interest.
About this article
Cite this article
Yatsko, V.A. A Methodology of Using a Concordancer and Table Processor for Authorship Attribution. Autom. Doc. Math. Linguist. 54, 269–274 (2020). https://doi.org/10.3103/S0005105520050088
Received:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0005105520050088