Abstract
This paper describes a method for automatic text classification based on analysing the deviation of the word distribution from Zipf’s law, combined with the zonal data-processing approach. Deviation is understood as the difference between the actual numerical score of a word and its score according to Zipf’s law. The proposed method involves the division of input and reference texts into J 0, J 1, and J 2 zones, and the creation of a numerical series using the words that are contained in the J 0 zone. The constructed numerical series shows the difference between the real scores of words and the scores calculated according to Zipf’s law. The proposed method can significantly reduce text dimensionality and thus improve the running speed of automatic text classification.
Similar content being viewed by others
References
Yatsko, V.A., Computational linguistics or linguistic informatics? Autom. Doc. Math. Linguist., 2014, vol. 48, no. 3, pp. 149–157.
Köhler, R. and Rieger, B.B., Preface, in Contributions to quantitative linguistics. Proc. 1st Int. Conf. on Quantitative Linguistics, Dordrecht, 1993, pp. i–ix.
Mikhailov, A.I., Chernyi, A.I., and Gilyarevskii, R.S., Informatics is the new name of the theory of scientific information, Nauchn.-Tekhn. Inform., 1966, no. 12, pp. 35–39.
Piantadosi, S.T., Zipf’s word frequency law in natural language: A critical review and future directions. http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.pdf.
Manning, C.D., Raghavan, P., and Schutze, H., An Introduction to Information Retrieval. Online Edition, Cambridge (UK), 2009. http://nlp.stanford.edu/IRbook/pdf/irbookonlinereading.pdf
Altmann, G., Popescu, I.-I., and Zotta, D., Stratification in texts, Glottometrics, 2013, no. 25, pp. 85–93.
Popescu, I.-I., Mautek, J., and Altmann, G., Aspects of Word Frequencies, Ludenscheid: RAM-Verlag, 2009.
Gabaix, X., Zipf’s law for cities: An explanation, Q. J. Econ., 1999, vol. 114, no. 3, pp. 739–767.
Novoviĉová, J. and Malik, A., Information-theoretic feature selection algorithms for text classification, Proc. Int. Joint Conf. on Neural Networks, Montreal, 2005. http://staff.utia.cas.cz/novovic/files/1483.pdf
Nicolosi, N., Feature selection methods for text classification. http://www.cs.rit.edu/~nan2563/feature_ selection.pdf
Oakes, M.P., Gaizauskas, R., and Fowkes, H., A method based on the chi-square test for document classification, SIGIR '01 Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 2001. http://perswww.wlv.ac.uk/~in4326/old/2001_Oakes_SIGIR.pdf
Yatsko, V.A., The method of zonal text analysis, V Mire Nauchn. Otkryt., 2013, no. 6.1, pp. 166–182.
Yatsko, V.A., The method of zonal correlation text analysis, Autom. Doc. Math. Linguist., 2014, vol. 48, no. 5, pp. 259–263.
West, M., The mystery of Zipf. http://plus.maths.org/content/mystery-zipf
Ahlgren, O., Malo, P., Sinha, A., et al. A dimensionality reduction approach for semantic document classification. http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/SPIM/spim2011_paper6.pdf
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © V.A. Yatsko, 2015, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2, 2015, No. 5, pp. 7–18.
About this article
Cite this article
Yatsko, V.A. Automatic text classification method based on Zipf’s law. Autom. Doc. Math. Linguist. 49, 83–88 (2015). https://doi.org/10.3103/S0005105515030048
Received:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0005105515030048