Advertisement

Problems of Information Transmission

, Volume 53, Issue 3, pp 294–304 | Cite as

Information-Theoretic method for classification of texts

  • B. Ya. Ryabko
  • A. E. Gus’kov
  • I. V. Selivanova
Source Coding
  • 34 Downloads

Abstract

We consider a method for automatic (i.e., unmanned) text classification based on methods of universal source coding (or “data compression”). We show that under certain restrictions the proposed method is consistent, i.e., the classification error tends to zero with increasing text lengths. As an example of practical use of the method we consider the classification problem for scientific texts (research papers, books, etc.). The proposed method is experimentally shown to be highly efficient.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Thapar, N., Using Compression for Source-Based Classification of Text, Master Thesis, Dept. of Electrical Engineering and Computer Science, MIT, Cambridge, USA, 2001.Google Scholar
  2. 2.
    Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Using Literal and Grammatical Statistics for Authorship Attribution, Probl. Peredachi Inf., 2001, vol. 37, no. 2, pp. 96–109 [Probl. Inf. Trans. (Engl. Transl.), 2001, vol. 37, no. 2, pp. 172–184].MATHMathSciNetGoogle Scholar
  3. 3.
    Khmelev, D.V., Complexity Approach to Disputed Authorship Attribution, in Russian Language: Contemporaneity and Fates in History (Proc. Int. Congress of Russian Language Researchers, Moscow, Mar. 13–16, 2001), pp. 426–427.Google Scholar
  4. 4.
    Cilibrasi, R. and Vitányi, P.M.B., Clustering by Compression, IEEE Trans. Inform. Theory, 2005, vol. 51, no. 4, pp. 1523–1545.CrossRefMATHMathSciNetGoogle Scholar
  5. 5.
    Cilibrasi, R., Vitányi, P., and deWolf, R., Algorithmic Clustering of Music Based on String Compression, Computer Music J., 2004, vol. 28, no. 4, pp. 49–67.CrossRefGoogle Scholar
  6. 6.
    Li, M., Chen, X., Li, X., Ma, B., and Vitányi, P.M.B., The Similarity Metric, IEEE Trans. Inform. Theory, 2004, vol. 50, no. 12, pp. 3250–3264.CrossRefMATHMathSciNetGoogle Scholar
  7. 7.
    Ryabko, B., Astola, J., and Malyutov, M., Compression-Based Methods of Statistical Analysis and Prediction of Time Series, New York: Springer, 2016.CrossRefMATHGoogle Scholar
  8. 8.
    Teahan, W.J. and Harper, D.J., Using Compression-Based Language Models for Text Categorization, Language Modeling for Information Retrieval, Croft, W.B. and Lafferty, J., Eds., Dordrecht: Kluwer, 2003, pp. 141–165.CrossRefGoogle Scholar
  9. 9.
    Cover, T.M. and Thomas, J.A., Elements of Information Theory, New York: Wiley, 1991.CrossRefMATHGoogle Scholar
  10. 10.
    Győrfi, L., Morvai, G., and Yakowitz, S.J., Limits to Consistent On-line Forecasting for Ergodic Time Series, IEEE Trans. Inform. Theory, 1998, vol. 44, no. 2, pp. 886–892.CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Pleiades Publishing, Inc. 2017

Authors and Affiliations

  • B. Ya. Ryabko
    • 1
    • 2
  • A. E. Gus’kov
    • 1
    • 3
  • I. V. Selivanova
    • 2
    • 3
  1. 1.Institute of Computational TechnologiesSiberian Branch of the Russian Academy of SciencesNovosibirskRussia
  2. 2.Novosibirsk State UniversityNovosibirskRussia
  3. 3.Russian National Public Library for Science and TechnnologySiberian Branch of the Russian Academy of SciencesNovosibirskRussia

Personalised recommendations