Classification by compression: Application of information-theory methods for the identification of themes of scientific texts

Selivanova, I. V.; Ryabko, B. Ya.; Guskov, A. E.

doi:10.3103/S0005105517030116

Classification by compression: Application of information-theory methods for the identification of themes of scientific texts

Information Analysis
Published: 19 August 2017

Volume 51, pages 120–126, (2017)
Cite this article

Automatic Documentation and Mathematical Linguistics Aims and scope

I. V. Selivanova¹,
B. Ya. Ryabko^2,3 &
A. E. Guskov^2,3

56 Accesses
6 Citations
Explore all metrics

Abstract

A method for automatic classification of scientific texts based on data compression is proposed. The method is implemented and investigated based on the data from an archive of scientific texts (arXiv.org) and in the CyberLeninka scientific electronic library (CyberLeninka.ru). Experiments showed that the method correctly identified the themes of scientific texts with a probability of 75–95%; its accuracy depends on the quality of the original data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Qualitative Content Analysis: Theoretical Background and Procedures

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Artificial intelligence to automate the systematic review of scientific literature

Article Open access 11 May 2023

References

Baghel, R. and Dhir, R., A frequent concepts based document clustering algorithm, Int. J. Comput. Appl., 2010, vol. 4, no. 5, pp. 6–12.
Google Scholar
Beil, F., Ester, M., and Xu, X., Frequent term-based text clustering, Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD '2002), Edmonton, Alberta, 2002, pp. 436–442.
Google Scholar
Miao, Y., Keselj, V., and Milios, E., Document clustering using character n-grams: A comparative evaluation with term-based and word-based clustering, CIKM '05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, New York, 2005, pp. 357–358.
Chapter Google Scholar
Schaeffer, S.E., Graph clustering, Comput. Sci. Rev., 2007, vol. 1, no. 1, pp. 27–64.
Article MATH Google Scholar
Kim, S., Han, K., Rim, H., and Myaeng, S.H., Some effective techniques for naïve Bayes text classification, IEEE Trans. Knowl. Data Eng., 2006, vol. 18, no. 11, pp. 1457–1466.
Article Google Scholar
Shevelev, O.G. and Petrakov, A.V., Classification of texts with decision trees and neural networks of direct propagation, Vestn. Tomsk. Gos. Univ., 2006, vol. 290, pp. 300–307.
Google Scholar
Wang, Z., He, Y., and Jiang, M., A comparison among three neural networks for text classification, Proceedings of the IEEE 8th International Conference on Signal Processing, 2006, no. 3, pp. 1883–1886.
Google Scholar
Matyasko, A.A. and Khaustov, V.A., Classification of documents in vector space. Comparison of the Roccio methods and the k-nearest neighbor method, Informatsionnye tekhnologii i sistemy 2012 (ITS 2012): Materialy mezhdunarodnoi nauchnoi konferentsii (g. Minsk, Belarus’, 24 oktyabrya 2012 g.) (Information Technologies and Systems 2012 (ITS 2012): Proceeding of the International Conference, BSUIR, Minsk, October 24, 2012), Minsk, 2012, pp. 140–141.
Google Scholar
Li, M. and Vitanyi, P.M.B., An Introduction to Kolmogorov Complexity and Its Applications, New York: Springer-Verlag, 1997, 2nd ed., p. 637.
Book MATH Google Scholar
Cilibrasi, R. and Vitanyi, P.M.B., Clustering by compression, IEEE Trans. Inf. Theory, 2005, vol. 51, no. 4, pp. 1523–1545.
Article MathSciNet MATH Google Scholar
Cilibrasi, R., Vitanyi, P.M.B., and de Wolf, R., Algorithmic clustering of music based on string compression, Comp. Music J., 2004, vol. 28, no. 4, pp. 49–67.
Article Google Scholar
Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P.M.B., The similarity metric, IEEE Trans. Inf. Theory, 2004, vol. 50, no. 12, pp. 3250–3264.
Article MathSciNet MATH Google Scholar
Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Determination of the authorship of the text using alphabetic and grammatical information, Probl. Peredachi Inf., 2001, vol. 37, no. 2, pp. 96–109.
MathSciNet MATH Google Scholar
Khmelev, D.V., A complex approach to the problem of determining the authorship of the text, Trudy i materialy Mezhdunarodnogo kongressa Russkii yazyk: Istoricheskie sud’by i sovremennost' (13–16 marta 2001 goda) (Proc. Int. Congress The Russian Language: Historical Fates and the Present (March 13–16, 2001), Moscow: MGU, 2001, pp. 426–427.
Google Scholar
Malyutov, M.B., Authorship attribution of texts: A review, Springer Lect. Notes Comput. Sci., 2007, vol. 4123, pp. 362–380.
Article MATH Google Scholar
Malyutov, M.B., Wickramasinghe, C.I., and Li, S., Conditional Complexity of Compression for Authorship Attribution. SFB 649 Discussion Paper No. 57, Berlin: Humboldt University, 2007, p. 38.
Google Scholar
Ryabko, B., Astola, J., and Malyutov, M., Compression-Based Methods of Statistical Analysis and Prediction of Time Series, Springer, 2016.
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

The State Public Scientific Technological Library, Siberian Branch, Russian Academy of Sciences, Novosibirsk, 123298, Russia
I. V. Selivanova
Novosibirsk State University, Novosibirsk, 630090, Russia
B. Ya. Ryabko & A. E. Guskov
Institute of Computational Technologies, Siberian Branch, Russian Academy of Sciences, Novosibirsk, 630090, Russia
B. Ya. Ryabko & A. E. Guskov

Authors

I. V. Selivanova
View author publications
You can also search for this author in PubMed Google Scholar
B. Ya. Ryabko
View author publications
You can also search for this author in PubMed Google Scholar
A. E. Guskov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to I. V. Selivanova.

Additional information

Original Russian Text © I.V. Selivanova, B.Ya. Ryabko, A.E. Guskov, 2017, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2: Informatsionnye Protsessy i Sistemy, 2017, No. 6, pp. 8–15.

About this article

Cite this article

Selivanova, I.V., Ryabko, B.Y. & Guskov, A.E. Classification by compression: Application of information-theory methods for the identification of themes of scientific texts. Autom. Doc. Math. Linguist. 51, 120–126 (2017). https://doi.org/10.3103/S0005105517030116

Download citation

Received: 03 February 2017
Published: 19 August 2017
Issue Date: June 2017
DOI: https://doi.org/10.3103/S0005105517030116

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classification by compression: Application of information-theory methods for the identification of themes of scientific texts

Abstract

Access this article

Similar content being viewed by others

Qualitative Content Analysis: Theoretical Background and Procedures

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Artificial intelligence to automate the systematic review of scientific literature

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Keywords

Navigation

Classification by compression: Application of information-theory methods for the identification of themes of scientific texts

Abstract

Access this article

Similar content being viewed by others

Qualitative Content Analysis: Theoretical Background and Procedures

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Artificial intelligence to automate the systematic review of scientific literature

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Keywords

Search

Navigation