This paper describes the procedures and specific features of application of a new method of automatic classification based on calculation of deviations of stop-words distribution from Zipfian score. To neutralize discrepancies in texts lengths the author describes and applies the text undersampling methodology. The concept of an iterative threshold level is introduced to reduce text dimensionality to several dozen units. To evaluate the method’s efficiency the author has developed discriminative and similarative powers indicators that underlie the generalized efficiency score. Fourteen tests have been conducted, including comparison with the cosine similarity measure, that proved high efficiency of the proposed method for the solution of the tasks of authorship attribution of texts of fiction and clusterization of political texts.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price includes VAT (USA)
Tax calculation will be finalised during checkout.
Sebastiani, F., Text categorization, in Text Mining and Its Applications, Zanasi, A., Ed., Southampton, UK, 2005, pp. 109–129. http://nmis.isti.cnr.it/sebastiani/Publications/TM05.pdf.
Yatsko, V.A., Automatic text classification method based on Zipf’s law, Autom. Doc. Math. Linguist., 2015, vol. 49, no. 3, pp. 83–88.
Yatsko, V.A., A methodology of using a concordancer and table processor for authorship attribution, Autom. Doc. Math. Linguist., 2020, vol. 54, no. 5, pp. 269–274.
Korde, V. and Mahender, C.N., Text classification and classifiers: A survey, Int. J. Artif. Intell. Appl., 2012, vol. 3, no. 2, pp. 85–99. https://aircconline.com/ijaia/V3N2/3212ijaia08.pdf.
Yatsko, V.A., TF*IDF revisited, Int. J. Comput. Linguist. Nat. Lang. Process., 2013, vol. 2, no. 6, pp. 385–387. https://docs.google.com/file/d/0B306nMx7wiLyZ0tFelo4MzY5SWc/edit.
Keyvanpour, M.R. and Imani, M.B., Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms, Intell. Data Anal., 2013, vol. 17, no. 3, pp. 367–385. https://www.researchgate.net/publication/262426115_Semi-supervised_text_categorization_Exploiting_unlabeled_data_using_ensemble_learning_algorithms.
Haj-Yahia, Z., Sieg, A., and Deleris, L.A., Towards unsupervised text classification leveraging experts and word embeddings, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 371–379. https://www.aclweb.org/anthology/P19-1036.pdf.
Kan, M.-Y. and McKeown, K., Information Extraction and Summarization: Domain Independence through Focus Types, 1999. http://www.comp.nus.edu.sg/~kanmy/papers/sds.pdf.
Fox, C., A stop list for general text, SIGIR Forum Year, 1989, vol. 24, nos. 1–2, pp. 19–21. https://doi.org/10.1145/378881.378888
Dalal, M.K. and Zaveri, M.A., Automatic text classification: A technical review, Int. J. Comput. Appl., 2011, vol. 28, no. 2, pp. 37–40. https://www.researchgate.net/profile/Mukesh_Zaveri/publication/266296879_Automatic_Text_Classification_A_Technical_Review/links/54e74a0a0cf2b199060ae1c5.pdf.
Kowsari, K., Meimandi, K.J., and Heidarysafa, M., et al., Text classification algorithms: A survey, Information, 2019, vol. 10, no. 4, pp. 1–68. https://doi.org/10.3390/info10040150
Piantadosi, S.T., Zipf’s word frequency law in natural language: A critical review and future directions, Psychon. Bull. Rev., 2014, vol. 21, no. 5, pp. 1112–1130. https://eu-ropepmc.org/backend/ptpmcrender.fcgi?accid=PMC4176592&blobtype=pdf.
West, M., The Mystery of Zipf, 2008. https://plus.maths.org/content/mystery-zipf.
Free eBooks – Project Gutenberg, 2021. https:// www.gutenberg.org/.
Madylova, A. and Oguducu, S.G., A taxonomy based semantic similarity of documents using the cosine measure, 24th International Symposium on Computer and Information Sciences, Guzelyurt, 2009, pp. 129–134. https://doi.org/10.1109/ISCIS.2009.5291865
Yatsko, V.A., Starikov, M.S., and Butakov, A.V., Automatic genre recognition and adaptive text summarization, Autom. Doc. Math. Linguist., 2010, vol. 44, no. 3, pp. 111–120.
This study was supported by the Russian Foundation for Basic Research, project no. 20-07-00124.
Translated by L. Solovyova
About this article
Cite this article
Yatsko, V.A. A New Method of Automatic Text Document Classification. Autom. Doc. Math. Linguist. 55, 122–133 (2021). https://doi.org/10.3103/S0005105521030080
- automatic text classification
- methods and algorithms
- Zipf distribution
- reduction of text dimensionality
- threshold levels
- efficiency indices