Skip to main content

A New Method of Automatic Text Document Classification

Abstract

This paper describes the procedures and specific features of application of a new method of automatic classification based on calculation of deviations of stop-words distribution from Zipfian score. To neutralize discrepancies in texts lengths the author describes and applies the text undersampling methodology. The concept of an iterative threshold level is introduced to reduce text dimensionality to several dozen units. To evaluate the method’s efficiency the author has developed discriminative and similarative powers indicators that underlie the generalized efficiency score. Fourteen tests have been conducted, including comparison with the cosine similarity measure, that proved high efficiency of the proposed method for the solution of the tasks of authorship attribution of texts of fiction and clusterization of political texts.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    http://yatsko.zohosites.com/tf-idf-ranker1.html.

  2. 2.

    https://www.laurenceanthony.net/software/antconc/.

  3. 3.

    https://www.kaggle.com/nltkdata/reuters.

REFERENCES

  1. 1

    Sebastiani, F., Text categorization, in Text Mining and Its Applications, Zanasi, A., Ed., Southampton, UK, 2005, pp. 109–129. http://nmis.isti.cnr.it/sebastiani/Publications/TM05.pdf.

    Google Scholar 

  2. 2

    Yatsko, V.A., Automatic text classification method based on Zipf’s law, Autom. Doc. Math. Linguist., 2015, vol. 49, no. 3, pp. 83–88.

    Article  Google Scholar 

  3. 3

    Yatsko, V.A., A methodology of using a concordancer and table processor for authorship attribution, Autom. Doc. Math. Linguist., 2020, vol. 54, no. 5, pp. 269–274.

    Article  Google Scholar 

  4. 4

    Korde, V. and Mahender, C.N., Text classification and classifiers: A survey, Int. J. Artif. Intell. Appl., 2012, vol. 3, no. 2, pp. 85–99. https://aircconline.com/ijaia/V3N2/3212ijaia08.pdf.

  5. 5

    Yatsko, V.A., TF*IDF revisited, Int. J. Comput. Linguist. Nat. Lang. Process., 2013, vol. 2, no. 6, pp. 385–387. https://docs.google.com/file/d/0B306nMx7wiLyZ0tFelo4MzY5SWc/edit.

  6. 6

    Keyvanpour, M.R. and Imani, M.B., Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms, Intell. Data Anal., 2013, vol. 17, no. 3, pp. 367–385. https://www.researchgate.net/publication/262426115_Semi-supervised_text_categorization_Exploiting_unlabeled_data_using_ensemble_learning_algorithms.

    Article  Google Scholar 

  7. 7

    Haj-Yahia, Z., Sieg, A., and Deleris, L.A., Towards unsupervised text classification leveraging experts and word embeddings, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 371–379. https://www.aclweb.org/anthology/P19-1036.pdf.

  8. 8

    Kan, M.-Y. and McKeown, K., Information Extraction and Summarization: Domain Independence through Focus Types, 1999. http://www.comp.nus.edu.sg/~kanmy/papers/sds.pdf.

  9. 9

    Fox, C., A stop list for general text, SIGIR Forum Year, 1989, vol. 24, nos. 1–2, pp. 19–21. https://doi.org/10.1145/378881.378888

  10. 10

    Dalal, M.K. and Zaveri, M.A., Automatic text classification: A technical review, Int. J. Comput. Appl., 2011, vol. 28, no. 2, pp. 37–40. https://www.researchgate.net/profile/Mukesh_Zaveri/publication/266296879_Automatic_Text_Classification_A_Technical_Review/links/54e74a0a0cf2b199060ae1c5.pdf.

    Google Scholar 

  11. 11

    Kowsari, K., Meimandi, K.J., and Heidarysafa, M., et al., Text classification algorithms: A survey, Information, 2019, vol. 10, no. 4, pp. 1–68. https://doi.org/10.3390/info10040150

    Article  Google Scholar 

  12. 12

    Piantadosi, S.T., Zipf’s word frequency law in natural language: A critical review and future directions, Psychon. Bull. Rev., 2014, vol. 21, no. 5, pp. 1112–1130. https://eu-ropepmc.org/backend/ptpmcrender.fcgi?accid=PMC4176592&blobtype=pdf.

    Article  Google Scholar 

  13. 13

    West, M., The Mystery of Zipf, 2008. https://plus.maths.org/content/mystery-zipf.

  14. 14

    Free eBooks – Project Gutenberg, 2021. https:// www.gutenberg.org/.

  15. 15

    Madylova, A. and Oguducu, S.G., A taxonomy based semantic similarity of documents using the cosine measure, 24th International Symposium on Computer and Information Sciences, Guzelyurt, 2009, pp. 129–134. https://doi.org/10.1109/ISCIS.2009.5291865

  16. 16

    Yatsko, V.A., Starikov, M.S., and Butakov, A.V., Automatic genre recognition and adaptive text summarization, Autom. Doc. Math. Linguist., 2010, vol. 44, no. 3, pp. 111–120.

    Article  Google Scholar 

Download references

Funding

This study was supported by the Russian Foundation for Basic Research, project no. 20-07-00124.

Author information

Affiliations

Authors

Corresponding author

Correspondence to V. A. Yatsko.

Additional information

Translated by L. Solovyova

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yatsko, V.A. A New Method of Automatic Text Document Classification. Autom. Doc. Math. Linguist. 55, 122–133 (2021). https://doi.org/10.3103/S0005105521030080

Download citation

Keywords:

  • automatic text classification
  • methods and algorithms
  • Zipf distribution
  • reduction of text dimensionality
  • threshold levels
  • efficiency indices