Skip to main content
Log in

Classification by compression: Application of information-theory methods for the identification of themes of scientific texts

  • Information Analysis
  • Published:
Automatic Documentation and Mathematical Linguistics Aims and scope

Abstract

A method for automatic classification of scientific texts based on data compression is proposed. The method is implemented and investigated based on the data from an archive of scientific texts (arXiv.org) and in the CyberLeninka scientific electronic library (CyberLeninka.ru). Experiments showed that the method correctly identified the themes of scientific texts with a probability of 75–95%; its accuracy depends on the quality of the original data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Baghel, R. and Dhir, R., A frequent concepts based document clustering algorithm, Int. J. Comput. Appl., 2010, vol. 4, no. 5, pp. 6–12.

    Google Scholar 

  2. Beil, F., Ester, M., and Xu, X., Frequent term-based text clustering, Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD '2002), Edmonton, Alberta, 2002, pp. 436–442.

    Google Scholar 

  3. Miao, Y., Keselj, V., and Milios, E., Document clustering using character n-grams: A comparative evaluation with term-based and word-based clustering, CIKM '05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, New York, 2005, pp. 357–358.

    Chapter  Google Scholar 

  4. Schaeffer, S.E., Graph clustering, Comput. Sci. Rev., 2007, vol. 1, no. 1, pp. 27–64.

    Article  MATH  Google Scholar 

  5. Kim, S., Han, K., Rim, H., and Myaeng, S.H., Some effective techniques for naïve Bayes text classification, IEEE Trans. Knowl. Data Eng., 2006, vol. 18, no. 11, pp. 1457–1466.

    Article  Google Scholar 

  6. Shevelev, O.G. and Petrakov, A.V., Classification of texts with decision trees and neural networks of direct propagation, Vestn. Tomsk. Gos. Univ., 2006, vol. 290, pp. 300–307.

    Google Scholar 

  7. Wang, Z., He, Y., and Jiang, M., A comparison among three neural networks for text classification, Proceedings of the IEEE 8th International Conference on Signal Processing, 2006, no. 3, pp. 1883–1886.

    Google Scholar 

  8. Matyasko, A.A. and Khaustov, V.A., Classification of documents in vector space. Comparison of the Roccio methods and the k-nearest neighbor method, Informatsionnye tekhnologii i sistemy 2012 (ITS 2012): Materialy mezhdunarodnoi nauchnoi konferentsii (g. Minsk, Belarus’, 24 oktyabrya 2012 g.) (Information Technologies and Systems 2012 (ITS 2012): Proceeding of the International Conference, BSUIR, Minsk, October 24, 2012), Minsk, 2012, pp. 140–141.

    Google Scholar 

  9. Li, M. and Vitanyi, P.M.B., An Introduction to Kolmogorov Complexity and Its Applications, New York: Springer-Verlag, 1997, 2nd ed., p. 637.

    Book  MATH  Google Scholar 

  10. Cilibrasi, R. and Vitanyi, P.M.B., Clustering by compression, IEEE Trans. Inf. Theory, 2005, vol. 51, no. 4, pp. 1523–1545.

    Article  MathSciNet  MATH  Google Scholar 

  11. Cilibrasi, R., Vitanyi, P.M.B., and de Wolf, R., Algorithmic clustering of music based on string compression, Comp. Music J., 2004, vol. 28, no. 4, pp. 49–67.

    Article  Google Scholar 

  12. Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P.M.B., The similarity metric, IEEE Trans. Inf. Theory, 2004, vol. 50, no. 12, pp. 3250–3264.

    Article  MathSciNet  MATH  Google Scholar 

  13. Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Determination of the authorship of the text using alphabetic and grammatical information, Probl. Peredachi Inf., 2001, vol. 37, no. 2, pp. 96–109.

    MathSciNet  MATH  Google Scholar 

  14. Khmelev, D.V., A complex approach to the problem of determining the authorship of the text, Trudy i materialy Mezhdunarodnogo kongressa Russkii yazyk: Istoricheskie sud’by i sovremennost' (13–16 marta 2001 goda) (Proc. Int. Congress The Russian Language: Historical Fates and the Present (March 13–16, 2001), Moscow: MGU, 2001, pp. 426–427.

    Google Scholar 

  15. Malyutov, M.B., Authorship attribution of texts: A review, Springer Lect. Notes Comput. Sci., 2007, vol. 4123, pp. 362–380.

    Article  MATH  Google Scholar 

  16. Malyutov, M.B., Wickramasinghe, C.I., and Li, S., Conditional Complexity of Compression for Authorship Attribution. SFB 649 Discussion Paper No. 57, Berlin: Humboldt University, 2007, p. 38.

    Google Scholar 

  17. Ryabko, B., Astola, J., and Malyutov, M., Compression-Based Methods of Statistical Analysis and Prediction of Time Series, Springer, 2016.

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I. V. Selivanova.

Additional information

Original Russian Text © I.V. Selivanova, B.Ya. Ryabko, A.E. Guskov, 2017, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2: Informatsionnye Protsessy i Sistemy, 2017, No. 6, pp. 8–15.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Selivanova, I.V., Ryabko, B.Y. & Guskov, A.E. Classification by compression: Application of information-theory methods for the identification of themes of scientific texts. Autom. Doc. Math. Linguist. 51, 120–126 (2017). https://doi.org/10.3103/S0005105517030116

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0005105517030116

Keywords

Navigation