Abstract
We consider a method for automatic (i.e., unmanned) text classification based on methods of universal source coding (or “data compression”). We show that under certain restrictions the proposed method is consistent, i.e., the classification error tends to zero with increasing text lengths. As an example of practical use of the method we consider the classification problem for scientific texts (research papers, books, etc.). The proposed method is experimentally shown to be highly efficient.
Similar content being viewed by others
References
Thapar, N., Using Compression for Source-Based Classification of Text, Master Thesis, Dept. of Electrical Engineering and Computer Science, MIT, Cambridge, USA, 2001.
Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Using Literal and Grammatical Statistics for Authorship Attribution, Probl. Peredachi Inf., 2001, vol. 37, no. 2, pp. 96–109 [Probl. Inf. Trans. (Engl. Transl.), 2001, vol. 37, no. 2, pp. 172–184].
Khmelev, D.V., Complexity Approach to Disputed Authorship Attribution, in Russian Language: Contemporaneity and Fates in History (Proc. Int. Congress of Russian Language Researchers, Moscow, Mar. 13–16, 2001), pp. 426–427.
Cilibrasi, R. and Vitányi, P.M.B., Clustering by Compression, IEEE Trans. Inform. Theory, 2005, vol. 51, no. 4, pp. 1523–1545.
Cilibrasi, R., Vitányi, P., and deWolf, R., Algorithmic Clustering of Music Based on String Compression, Computer Music J., 2004, vol. 28, no. 4, pp. 49–67.
Li, M., Chen, X., Li, X., Ma, B., and Vitányi, P.M.B., The Similarity Metric, IEEE Trans. Inform. Theory, 2004, vol. 50, no. 12, pp. 3250–3264.
Ryabko, B., Astola, J., and Malyutov, M., Compression-Based Methods of Statistical Analysis and Prediction of Time Series, New York: Springer, 2016.
Teahan, W.J. and Harper, D.J., Using Compression-Based Language Models for Text Categorization, Language Modeling for Information Retrieval, Croft, W.B. and Lafferty, J., Eds., Dordrecht: Kluwer, 2003, pp. 141–165.
Cover, T.M. and Thomas, J.A., Elements of Information Theory, New York: Wiley, 1991.
Győrfi, L., Morvai, G., and Yakowitz, S.J., Limits to Consistent On-line Forecasting for Ergodic Time Series, IEEE Trans. Inform. Theory, 1998, vol. 44, no. 2, pp. 886–892.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © B.Ya. Ryabko, A.E. Gus’kov, I.V. Selivanova, 2017, published in Problemy Peredachi Informatsii, 2017, Vol. 53, No. 3, pp. 100–111.
Rights and permissions
About this article
Cite this article
Ryabko, B.Y., Gus’kov, A.E. & Selivanova, I.V. Information-Theoretic method for classification of texts. Probl Inf Transm 53, 294–304 (2017). https://doi.org/10.1134/S0032946017030115
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0032946017030115