Skip to main content
Log in

Automatic text classification method based on Zipf’s law

  • Published:
Automatic Documentation and Mathematical Linguistics Aims and scope

Abstract

This paper describes a method for automatic text classification based on analysing the deviation of the word distribution from Zipf’s law, combined with the zonal data-processing approach. Deviation is understood as the difference between the actual numerical score of a word and its score according to Zipf’s law. The proposed method involves the division of input and reference texts into J 0, J 1, and J 2 zones, and the creation of a numerical series using the words that are contained in the J 0 zone. The constructed numerical series shows the difference between the real scores of words and the scores calculated according to Zipf’s law. The proposed method can significantly reduce text dimensionality and thus improve the running speed of automatic text classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Yatsko, V.A., Computational linguistics or linguistic informatics? Autom. Doc. Math. Linguist., 2014, vol. 48, no. 3, pp. 149–157.

    Article  Google Scholar 

  2. Köhler, R. and Rieger, B.B., Preface, in Contributions to quantitative linguistics. Proc. 1st Int. Conf. on Quantitative Linguistics, Dordrecht, 1993, pp. i–ix.

    Google Scholar 

  3. Mikhailov, A.I., Chernyi, A.I., and Gilyarevskii, R.S., Informatics is the new name of the theory of scientific information, Nauchn.-Tekhn. Inform., 1966, no. 12, pp. 35–39.

    Google Scholar 

  4. Piantadosi, S.T., Zipf’s word frequency law in natural language: A critical review and future directions. http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.pdf.

  5. Manning, C.D., Raghavan, P., and Schutze, H., An Introduction to Information Retrieval. Online Edition, Cambridge (UK), 2009. http://nlp.stanford.edu/IRbook/pdf/irbookonlinereading.pdf

    Google Scholar 

  6. Altmann, G., Popescu, I.-I., and Zotta, D., Stratification in texts, Glottometrics, 2013, no. 25, pp. 85–93.

    Google Scholar 

  7. Popescu, I.-I., Mautek, J., and Altmann, G., Aspects of Word Frequencies, Ludenscheid: RAM-Verlag, 2009.

    Google Scholar 

  8. Gabaix, X., Zipf’s law for cities: An explanation, Q. J. Econ., 1999, vol. 114, no. 3, pp. 739–767.

    Article  MathSciNet  MATH  Google Scholar 

  9. Novoviĉová, J. and Malik, A., Information-theoretic feature selection algorithms for text classification, Proc. Int. Joint Conf. on Neural Networks, Montreal, 2005. http://staff.utia.cas.cz/novovic/files/1483.pdf

    Google Scholar 

  10. Nicolosi, N., Feature selection methods for text classification. http://www.cs.rit.edu/~nan2563/feature_ selection.pdf

  11. Oakes, M.P., Gaizauskas, R., and Fowkes, H., A method based on the chi-square test for document classification, SIGIR '01 Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2001. http://perswww.wlv.ac.uk/~in4326/old/2001_Oakes_SIGIR.pdf

  12. Yatsko, V.A., The method of zonal text analysis, V Mire Nauchn. Otkryt., 2013, no. 6.1, pp. 166–182.

    Google Scholar 

  13. Yatsko, V.A., The method of zonal correlation text analysis, Autom. Doc. Math. Linguist., 2014, vol. 48, no. 5, pp. 259–263.

    Article  Google Scholar 

  14. West, M., The mystery of Zipf. http://plus.maths.org/content/mystery-zipf

  15. Ahlgren, O., Malo, P., Sinha, A., et al. A dimensionality reduction approach for semantic document classification. http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/SPIM/spim2011_paper6.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. A. Yatsko.

Additional information

Original Russian Text © V.A. Yatsko, 2015, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2, 2015, No. 5, pp. 7–18.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yatsko, V.A. Automatic text classification method based on Zipf’s law. Autom. Doc. Math. Linguist. 49, 83–88 (2015). https://doi.org/10.3103/S0005105515030048

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0005105515030048

Keywords

Navigation