Abstract
A method for automatic classification of scientific texts based on data compression is proposed. The method is implemented and investigated based on the data from an archive of scientific texts (arXiv.org) and in the CyberLeninka scientific electronic library (CyberLeninka.ru). Experiments showed that the method correctly identified the themes of scientific texts with a probability of 75–95%; its accuracy depends on the quality of the original data.
Similar content being viewed by others
References
Baghel, R. and Dhir, R., A frequent concepts based document clustering algorithm, Int. J. Comput. Appl., 2010, vol. 4, no. 5, pp. 6–12.
Beil, F., Ester, M., and Xu, X., Frequent term-based text clustering, Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD '2002), Edmonton, Alberta, 2002, pp. 436–442.
Miao, Y., Keselj, V., and Milios, E., Document clustering using character n-grams: A comparative evaluation with term-based and word-based clustering, CIKM '05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, New York, 2005, pp. 357–358.
Schaeffer, S.E., Graph clustering, Comput. Sci. Rev., 2007, vol. 1, no. 1, pp. 27–64.
Kim, S., Han, K., Rim, H., and Myaeng, S.H., Some effective techniques for naïve Bayes text classification, IEEE Trans. Knowl. Data Eng., 2006, vol. 18, no. 11, pp. 1457–1466.
Shevelev, O.G. and Petrakov, A.V., Classification of texts with decision trees and neural networks of direct propagation, Vestn. Tomsk. Gos. Univ., 2006, vol. 290, pp. 300–307.
Wang, Z., He, Y., and Jiang, M., A comparison among three neural networks for text classification, Proceedings of the IEEE 8th International Conference on Signal Processing, 2006, no. 3, pp. 1883–1886.
Matyasko, A.A. and Khaustov, V.A., Classification of documents in vector space. Comparison of the Roccio methods and the k-nearest neighbor method, Informatsionnye tekhnologii i sistemy 2012 (ITS 2012): Materialy mezhdunarodnoi nauchnoi konferentsii (g. Minsk, Belarus’, 24 oktyabrya 2012 g.) (Information Technologies and Systems 2012 (ITS 2012): Proceeding of the International Conference, BSUIR, Minsk, October 24, 2012), Minsk, 2012, pp. 140–141.
Li, M. and Vitanyi, P.M.B., An Introduction to Kolmogorov Complexity and Its Applications, New York: Springer-Verlag, 1997, 2nd ed., p. 637.
Cilibrasi, R. and Vitanyi, P.M.B., Clustering by compression, IEEE Trans. Inf. Theory, 2005, vol. 51, no. 4, pp. 1523–1545.
Cilibrasi, R., Vitanyi, P.M.B., and de Wolf, R., Algorithmic clustering of music based on string compression, Comp. Music J., 2004, vol. 28, no. 4, pp. 49–67.
Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P.M.B., The similarity metric, IEEE Trans. Inf. Theory, 2004, vol. 50, no. 12, pp. 3250–3264.
Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Determination of the authorship of the text using alphabetic and grammatical information, Probl. Peredachi Inf., 2001, vol. 37, no. 2, pp. 96–109.
Khmelev, D.V., A complex approach to the problem of determining the authorship of the text, Trudy i materialy Mezhdunarodnogo kongressa Russkii yazyk: Istoricheskie sud’by i sovremennost' (13–16 marta 2001 goda) (Proc. Int. Congress The Russian Language: Historical Fates and the Present (March 13–16, 2001), Moscow: MGU, 2001, pp. 426–427.
Malyutov, M.B., Authorship attribution of texts: A review, Springer Lect. Notes Comput. Sci., 2007, vol. 4123, pp. 362–380.
Malyutov, M.B., Wickramasinghe, C.I., and Li, S., Conditional Complexity of Compression for Authorship Attribution. SFB 649 Discussion Paper No. 57, Berlin: Humboldt University, 2007, p. 38.
Ryabko, B., Astola, J., and Malyutov, M., Compression-Based Methods of Statistical Analysis and Prediction of Time Series, Springer, 2016.
Author information
Authors and Affiliations
Corresponding author
Additional information
Original Russian Text © I.V. Selivanova, B.Ya. Ryabko, A.E. Guskov, 2017, published in Nauchno-Tekhnicheskaya Informatsiya, Seriya 2: Informatsionnye Protsessy i Sistemy, 2017, No. 6, pp. 8–15.
About this article
Cite this article
Selivanova, I.V., Ryabko, B.Y. & Guskov, A.E. Classification by compression: Application of information-theory methods for the identification of themes of scientific texts. Autom. Doc. Math. Linguist. 51, 120–126 (2017). https://doi.org/10.3103/S0005105517030116
Received:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0005105517030116