Information-Theoretic method for classification of texts

Ryabko, B. Ya.; Gus’kov, A. E.; Selivanova, I. V.

doi:10.1134/S0032946017030115

Information-Theoretic method for classification of texts

Source Coding
Published: 13 May 2017

Volume 53, pages 294–304, (2017)
Cite this article

Problems of Information Transmission Aims and scope Submit manuscript

B. Ya. Ryabko^1,2,
A. E. Gus’kov^1,3 &
I. V. Selivanova^2,3

93 Accesses
7 Citations
Explore all metrics

Abstract

We consider a method for automatic (i.e., unmanned) text classification based on methods of universal source coding (or “data compression”). We show that under certain restrictions the proposed method is consistent, i.e., the classification error tends to zero with increasing text lengths. As an example of practical use of the method we consider the classification problem for scientific texts (research papers, books, etc.). The proposed method is experimentally shown to be highly efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Siamese Neural Networks: An Overview

Testing of detection tools for AI-generated text

Article Open access 25 December 2023

References

Thapar, N., Using Compression for Source-Based Classification of Text, Master Thesis, Dept. of Electrical Engineering and Computer Science, MIT, Cambridge, USA, 2001.
Google Scholar
Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Using Literal and Grammatical Statistics for Authorship Attribution, Probl. Peredachi Inf., 2001, vol. 37, no. 2, pp. 96–109 [Probl. Inf. Trans. (Engl. Transl.), 2001, vol. 37, no. 2, pp. 172–184].
MATH MathSciNet Google Scholar
Khmelev, D.V., Complexity Approach to Disputed Authorship Attribution, in Russian Language: Contemporaneity and Fates in History (Proc. Int. Congress of Russian Language Researchers, Moscow, Mar. 13–16, 2001), pp. 426–427.
Cilibrasi, R. and Vitányi, P.M.B., Clustering by Compression, IEEE Trans. Inform. Theory, 2005, vol. 51, no. 4, pp. 1523–1545.
Article MATH MathSciNet Google Scholar
Cilibrasi, R., Vitányi, P., and deWolf, R., Algorithmic Clustering of Music Based on String Compression, Computer Music J., 2004, vol. 28, no. 4, pp. 49–67.
Article Google Scholar
Li, M., Chen, X., Li, X., Ma, B., and Vitányi, P.M.B., The Similarity Metric, IEEE Trans. Inform. Theory, 2004, vol. 50, no. 12, pp. 3250–3264.
Article MATH MathSciNet Google Scholar
Ryabko, B., Astola, J., and Malyutov, M., Compression-Based Methods of Statistical Analysis and Prediction of Time Series, New York: Springer, 2016.
Book MATH Google Scholar
Teahan, W.J. and Harper, D.J., Using Compression-Based Language Models for Text Categorization, Language Modeling for Information Retrieval, Croft, W.B. and Lafferty, J., Eds., Dordrecht: Kluwer, 2003, pp. 141–165.
Chapter Google Scholar
Cover, T.M. and Thomas, J.A., Elements of Information Theory, New York: Wiley, 1991.
Book MATH Google Scholar
Győrfi, L., Morvai, G., and Yakowitz, S.J., Limits to Consistent On-line Forecasting for Ergodic Time Series, IEEE Trans. Inform. Theory, 1998, vol. 44, no. 2, pp. 886–892.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computational Technologies, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
B. Ya. Ryabko & A. E. Gus’kov
Novosibirsk State University, Novosibirsk, Russia
B. Ya. Ryabko & I. V. Selivanova
Russian National Public Library for Science and Technnology, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
A. E. Gus’kov & I. V. Selivanova

Authors

B. Ya. Ryabko
View author publications
You can also search for this author in PubMed Google Scholar
A. E. Gus’kov
View author publications
You can also search for this author in PubMed Google Scholar
I. V. Selivanova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Ya. Ryabko.

Additional information

Original Russian Text © B.Ya. Ryabko, A.E. Gus’kov, I.V. Selivanova, 2017, published in Problemy Peredachi Informatsii, 2017, Vol. 53, No. 3, pp. 100–111.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ryabko, B.Y., Gus’kov, A.E. & Selivanova, I.V. Information-Theoretic method for classification of texts. Probl Inf Transm 53, 294–304 (2017). https://doi.org/10.1134/S0032946017030115

Download citation

Received: 21 October 2015
Published: 13 May 2017
Issue Date: July 2017
DOI: https://doi.org/10.1134/S0032946017030115

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information-Theoretic method for classification of texts

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Siamese Neural Networks: An Overview

Testing of detection tools for AI-generated text

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Navigation

Information-Theoretic method for classification of texts

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Siamese Neural Networks: An Overview

Testing of detection tools for AI-generated text

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation