Skip to main content
Log in

Information-Theoretic method for classification of texts

  • Source Coding
  • Published:
Problems of Information Transmission Aims and scope Submit manuscript

Abstract

We consider a method for automatic (i.e., unmanned) text classification based on methods of universal source coding (or “data compression”). We show that under certain restrictions the proposed method is consistent, i.e., the classification error tends to zero with increasing text lengths. As an example of practical use of the method we consider the classification problem for scientific texts (research papers, books, etc.). The proposed method is experimentally shown to be highly efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Thapar, N., Using Compression for Source-Based Classification of Text, Master Thesis, Dept. of Electrical Engineering and Computer Science, MIT, Cambridge, USA, 2001.

    Google Scholar 

  2. Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Using Literal and Grammatical Statistics for Authorship Attribution, Probl. Peredachi Inf., 2001, vol. 37, no. 2, pp. 96–109 [Probl. Inf. Trans. (Engl. Transl.), 2001, vol. 37, no. 2, pp. 172–184].

    MATH  MathSciNet  Google Scholar 

  3. Khmelev, D.V., Complexity Approach to Disputed Authorship Attribution, in Russian Language: Contemporaneity and Fates in History (Proc. Int. Congress of Russian Language Researchers, Moscow, Mar. 13–16, 2001), pp. 426–427.

  4. Cilibrasi, R. and Vitányi, P.M.B., Clustering by Compression, IEEE Trans. Inform. Theory, 2005, vol. 51, no. 4, pp. 1523–1545.

    Article  MATH  MathSciNet  Google Scholar 

  5. Cilibrasi, R., Vitányi, P., and deWolf, R., Algorithmic Clustering of Music Based on String Compression, Computer Music J., 2004, vol. 28, no. 4, pp. 49–67.

    Article  Google Scholar 

  6. Li, M., Chen, X., Li, X., Ma, B., and Vitányi, P.M.B., The Similarity Metric, IEEE Trans. Inform. Theory, 2004, vol. 50, no. 12, pp. 3250–3264.

    Article  MATH  MathSciNet  Google Scholar 

  7. Ryabko, B., Astola, J., and Malyutov, M., Compression-Based Methods of Statistical Analysis and Prediction of Time Series, New York: Springer, 2016.

    Book  MATH  Google Scholar 

  8. Teahan, W.J. and Harper, D.J., Using Compression-Based Language Models for Text Categorization, Language Modeling for Information Retrieval, Croft, W.B. and Lafferty, J., Eds., Dordrecht: Kluwer, 2003, pp. 141–165.

    Chapter  Google Scholar 

  9. Cover, T.M. and Thomas, J.A., Elements of Information Theory, New York: Wiley, 1991.

    Book  MATH  Google Scholar 

  10. Győrfi, L., Morvai, G., and Yakowitz, S.J., Limits to Consistent On-line Forecasting for Ergodic Time Series, IEEE Trans. Inform. Theory, 1998, vol. 44, no. 2, pp. 886–892.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. Ya. Ryabko.

Additional information

Original Russian Text © B.Ya. Ryabko, A.E. Gus’kov, I.V. Selivanova, 2017, published in Problemy Peredachi Informatsii, 2017, Vol. 53, No. 3, pp. 100–111.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ryabko, B.Y., Gus’kov, A.E. & Selivanova, I.V. Information-Theoretic method for classification of texts. Probl Inf Transm 53, 294–304 (2017). https://doi.org/10.1134/S0032946017030115

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0032946017030115

Navigation