Abstract—
In this article we consider a fundamentally new information-theoretic approach to the classification of scientific texts based on compression algorithms. An analysis using the example of the comparative classification of full-text documents from arXiv.org and short annotations from Scopus showed that the accuracy of the proposed method is 87–92% and, in general, is not inferior to the existing ones. These conclusions were confirmed by an expert assessment.
Similar content being viewed by others
REFERENCES
Yu, B., An evaluation of text classification methods for literary study, Lit. Linguist. Comput., 2008, vol. 23, no. 3, pp. 327–343. https://doi.org/10.1093/llc/fqn015
Barakhnin, V.B., Kozhemyakina, O.Yu., Pastushkov, I.S., and Rychkova, E.V., Computer classification of russian poetic texts by genres and styles, Vestn. Novosib. Gos. Univ., Ser. Lingvist. Interkult. Commun., 2017, vol. 15, no. 3, pp. 13–23. https://doi.org/10.25205/1818-7935-2017-15-3-13-23
Can, E.F., Can, F., Duygulu, P., and Kalpakli, M., Automatic categorization of Ottoman literary texts by poet and time period, in Computer and Information Sciences II, Gelenbe, E., Lent, R., and Sakellari, G., Eds., London: Springer, 2011, pp. 51–57. https://doi.org/10.1007/978-1-4471-2155-8_6
Oliveira, E. and Filho, D.B., Automatic classification of journalistic documents on the internet, Transinformação, 2017, vol. 29, no. 3, pp. 245–255. https://doi.org/10.1590/2318-08892017000300003
Hasan, M., Rundensteiner, E., and Agu, E., Emotex: Detecting emotions in Twitter messages, in ASE BIGDATA/SOCIALCOM/CYBERSECURITY Conf., Stanford, 2014, Stanford: Stanford Univ., 2014, pp. 27–31.
Rubtsova, Y.V., Research and development of domain independent sentiment classifier, Tr. SPIIRAN, 2014, vol. 5, no. 36, pp. 59–77. https://doi.org/10.15622/sp.36.4
Zantout, R., Osman, Z., and Hamandi, L., A universal method for author identification using statistical properties of text, in Proc. 2nd Int. Conf. on Vision, Image and Signal Processing, Las Vegas, 2018, New York: Association for Computing Machinery, 2018, p. 20. https://doi.org/10.1145/3271553.3271561
Tang, X., Liang, S., and Liu, Z., Authorship attribution of the golden lotus based on text classification methods, in Proc. 3rd Int. Conf. on Innovation in Artificial Intelligence, Suzhou, China, 2019, New York: Association for Computing Machinery, 2019, pp. 69–72. https://doi.org/10.1145/3319921.3319958
Miao, Y., Keselj, V., and Milios, E., Document clustering using character n-grams: a comparative evaluation with term-based and word-based clustering, in Proc. 14th ACM Int. Conf. on Information and Knowledge Management, Bremen, 2005, New York: Association for Computing Machinery, 2005, pp. 357–358. https://doi.org/10.1145/1099554.1099665
Volkova, L.L. and Stroganov, Yu.V., On associative binary measures of proximity of documents: Classification and application to clusterization, Novye Inf. Tekhnol. Avtom. Sist., 2014, no. 17, pp. 421–432.
Baghel, R. and Dhir, R., A frequent concepts based document clustering algorithm, Int. J. Comput. Appl., 2010, vol. 4, no. 5, pp. 6–12.
Beil, F., Ester, M., and Xu, X., Frequent term-based text clustering, in Proc. Eighth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, New York: Association for Computing Machinery, 2002, pp. 436–442. https://doi.org/10.1145/775047.775110
Deng, Z.H., Tang, S.W., Yang, D.Q., Li-Yu, M.Z., and Xie, K.-Q., A comparative study on feature weight in text categorization, in Advanced Web Technologies and Applications. APWeb 2004, Yu, J.X., Lin, X., Lu, H., and Zhang, Y., Eds., Lecture Notes in Computer Science, vol. 3007, Berlin: Springer, 2004, pp. 588–597. https://doi.org/10.1007/978-3-540-24655-8_64
Lunh, H.P., The automatic creation of literature abstracts, IBM J. Res. Dev., 1958, vol. 2, no. 2, pp. 159–165. https://doi.org/10.1147/rd.22.0159
Riloff, E., Little words can make a big difference for text classification, in Proc. 18th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1995, pp. 130–136.
Hu, L.Y., Huang, M.W., Ke, S.W., and Tsai, C.F., The distance function effect on k-nearest neighbor classification for medical datasets, SpringerPlus, 2016, vol. 5, p. 1304. https://doi.org/10.1186/s40064-016-2941-7
Zhang, S. and Pan, X., A novel text classification based on Mahalanobis distance, in 3rd Int. Conf. on Computer Research and Development, Shanghai, 2011, IEEE, 2011, pp. 156–158. https://doi.org/10.1109/ICCRD.2011.5764268
Dhar, A., Dash, N., and Roy, K., Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents, in 3rd Int. Conf. on Advances in Computing, Communication & Automation (ICACCA) (Fall), Dehradun, India, 2017, IEEE, 2017, pp. 1–6. https://doi.org/10.1109/ICACCAF.2017.8344721
Walkowiak, T., Datko, S., and Maciejewski, H., Distance metrics in open-set classification of text documents by local outlier factor and doc2vec, in Advances and Trends in Artificial Intelligence. From Theory to Practice. IEA/AIE 2019, Wotawa, F., Friedrich, G., Pill, I., Koitz-Hristov, R., and Ali, M., Eds., Lecture Notes in Computer Science, vol. 11606, Cham: Springer, 2019, pp. 102–109. https://doi.org/10.1007/978-3-030-22999-3_10
Zu, G., Ohyama, W., Wakabayashi, T., and Kumura, F., Automatic text classification of english newswire articles based on statistical classification techniques, Electr. Eng. Jpn., 2005, vol. 152, no. 1, pp. 50–60. https://doi.org/10.1002/eej.20108
Forman, G., An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res., 2003, vol. 3, pp. 1289–1305.
Nearest neighbor method. http://www.machinelearning. ru/wiki/index.php?title=Метод_ближайшего_соседа. Cited May 8, 2020.
Wang, X. and Yao, P., A fuzzy KNN algorithm based on weighted chi-square distance, in Proc. 2nd Int. Conf. on Computer Science and Application, Hohhot, China, 2018, New York: Association for Computing Machinery, 2018, p. 4. https://doi.org/10.1145/3207677.3277973
Wang, C.-Y., Zhang K., Yan, Y.-G., Li, J.-G., A k-nearest neighbor algorithm based on cluster in text classification, in Int. Conf. on Computer, Mechatronics, Control and Electronic Engineering, Changchun, China, 2010, IEEE, 2010, vol. 1, pp. 225–228. https://doi.org/10.1109/CMCE.2010.5610477
Zhang, X., Li, B., and Sun, X., A k-nearest neighbor text classification algorithm based on fuzzy integral, in Sixth Int. Conf. on Natural Computation, Yantai, China, 2010, IEEE, 2010, vol. 5, pp. 2238–2231. https://doi.org/10.1109/ICNC.2010.5584406
Tan, S., Neighbor-weighted K-nearest neighbor for unbalanced text corpus, Expert Syst. Appl., 2005, vol. 28, no. 4, pp. 667–671. https://doi.org/10.1016/j.eswa.2004.12.023
Denœux, T., A k-nearest neighbor classification rule based on Dempster-Shafer theory, in Classic Works of the Dempster-Shafer Theory of Belief Functions, Yager, R.R. and Liu, L., Eds., Studies in Fuzziness and Soft Computing, vol. 219, Berlin: Springer, 2008, pp. 737–760. https://doi.org/10.1007/978-3-540-44792-4_29
Garg, A. and Roth, D., Understanding probabilistic classifiers, in Machine Learning: ECML 2001, De Raedt, L. and Flach, P., Eds., Lecture Notes in Computer Science, vol. 2167, Berlin: Springer, 2001, pp. 179–191. https://doi.org/10.1007/3-540-44795-4_16
Jiang, L., Li, C., Wang, S., and Zhang, L., Deep feature weighting for naive Bayes and its application to text classification, Eng. Appl. Artif. Intell., 2016, vol. 52, pp. 26–39. https://doi.org/10.1016/j.engappai.2016.02.002
Howedi, F. and Mohd, M., Text classification for authorship attribution using naive Bayes classifier with limited training data, Comput. Eng., Intell. Syst., 2014, vol. 5, no. 4, pp. 48–56.
Xu, S., Li, Y., and Wang, Z., Bayesian multinomial naive bayes classifier to text classification, in Advanced Multimedia and Ubiquitous Engineering. FutureTech 2017, MUE 2017, Park, J., Chen., S.C., and Raymond Choo, K.K., Eds., Lecture Notes in Electrical Engineering, vol. 448, Singapore: Springer, 2017, pp. 347–352. https://doi.org/10.1007/978-981-10-5041-1_57
Narayanan, V., Arora, I., and Bhatia, A., Fast and accurate sentiment classification using an enhanced naive bayes model, in Intelligent Data Engineering and Automated Learning – IDEAL 2013, Yin H. , Eds., Lecture Notes in Computer Science, vol. 8206, Berlin: Springer, 2013, pp. 194–201. https://doi.org/10.1007/978-3-642-41278-3_24
Bi, Z., Han, Y., Huang, C., and Wang, M., Gaussian naive Bayesian data classification model based on clustering algorithm, in Proc. Int. Conf. on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019), Atlantis Press, 2019, pp. 396–400. https://doi.org/10.2991/masta-19.2019.67
Myaeng, S.H., Han, K.S., and Rim, H.C., Some effective techniques for naive Bayes text classification, IEEE Trans. Knowl. Data Eng., 2006, vol. 18, no. 11, pp. 1457–1466. https://doi.org/10.1109/TKDE.2006.180
Cortes, C. and Vapnik, V., Support-vector networks, Mach. Learn., 1995, vol. 20, no. 3, pp. 273–297. https://doi.org/10.1007/BF00994018
Wang, Z.Q., Sun, X., Zhang, D.-X., and Li, X., An optimal SVM-based text classification algorithm, in Int. Conf. on Machine Learning and Cybernetics, Dalian, China, 2006, IEEE, 2006, pp. 1378–1381. https://doi.org/10.1109/ICMLC.2006.258708
Ji, L., Cheng, X., Kang, L., Li, Daoliang, Li, Daiyi, Wang, K., and Chen, Y., A SVM-based text classification system for knowledge organization method of crop cultivation, in Computer and Computing Technologies in Agriculture V. CCTA 2011, Li, D. and Chen, Y., Eds., IFIP Advances in Information and Communication Technology, vol. 368, Berlin: Springer, 2012, pp. 318–324. https://doi.org/10.1007/978-3-642-27281-3_38
Yang, Y., Zhang, J., and Kisiel, B., A scalability analysis of classifiers in text categorization, in Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, 2003, New York: Association for Computing Machinery, 2003, pp. 96–103. https://doi.org/10.1145/860435.860455
Aborisade, O.M. and Anwar, M., Classification for authorship of tweets by comparing logistic regression and naive bayes classifiers, in IEEE Int. Conf. on Information Reuse and Integration (IRI), Salt Lake City, Utah, 2018, IEEE, 2018, pp. 269–276. https://doi.org/10.1109/IRI.2018.00049
Chistiakov, S.P., Random forests: An overview, Tr. Karel. Nauchn. Tsentra Ross. Akad. Nauk, 2013, no. 1, pp. 117–136.
Xu, B., Guo, X., Ye, Y., and Cheng, J., An improved random forest classifier for text categorization, J. Comput., 2012, vol. 7, no. 12, pp. 2913–2920.
Islam, M.Z., Liu, J., Li, J., Liu, L., and Kang, W., A semantics aware random forest for text classification, in Proc. 28th ACM Int. Conf. on Information and Knowledge Management, Beijing, 2019, New York, Association for Computing Machinery, 2019, pp. 1061–1070. https://doi.org/10.1145/3357384.3357891
Bouaziz, A., Dartigues-Pallez, C., da Costa Pereira, C., Precioso, F., and Lloret, P., Short text classification using semantic random forest, in Data Warehousing and Knowledge Discovery. DaWaK 2014, Bellatreche, L. and Mohania, M.K., Eds., Lecture Notes in Computer Science, vol. 8646, Cham: Springer, 2014, pp. 288–299. https://doi.org/10.1007/978-3-319-10160-6_26
Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, in Proc. Twenty-Ninth AAAI Conf. on Artificial Intelligence, Austin, Tex., 2015, AAAI Press, 2015, pp. 2267–2273.
Alqaraleh, S., Classification of Turkish text using machine learning: a case study using disasters tweets, Int. J. Sci. Technol. Res., 2020, vol. 9, no. 3, pp. 4953–4956.
Li, Y.H. and Jain, A.K., Classification of text documents, Comput. J., 1998, vol. 41, no. 8, pp. 537–546. https://doi.org/10.1093/comjnl/41.8.537
Xia, R., Zong, C., and Li, S., Ensemble of feature sets and classification algorithms for sentiment classification, Inf. Sci., 2011, vol. 181, no. 6, pp. 1138–1152. https://doi.org/10.1016/j.ins.2010.11.023
Pratama, B.Y. and Sarno, R., Personality classification based on Twitter text using naive Bayes, KNN and SVM, in Int. Conf. on Data and Software Engineering (ICoDSE), Yogyakarta, Indonesia, 2015, IEEE, 2015, pp. 170–174. https://doi.org/10.1109/ICODSE.2015.7436992
Telnoni, P.A., Budiawan, R., and Qana’a, M., Comparison of machine learning classification method on text-based case in Twitter, in Int. Conf. on ICT for Smart Society (ICISS), Bandung, Indonesia, 2019, IEEE, 2019, vol. 7, pp. 1–5. https://doi.org/10.1109/ICISS48059.2019.8969850
Liu, Z., Lv, X., Liu, K., and Shi, S., Study on SVM compared with the other text classification methods, in Second Int. Workshop on Education Technology and Computer Science, Wuhan, China, 2010, IEEE, 2010, vol. 1, pp. 219–222. https://doi.org/10.1109/ETCS.2010.248
Liu, C. and Wang, X., Quality-related English text classification based on recurrent neural network, J. Visual Commun. Image Representation, 2020, vol. 71, p. 102724. https://doi.org/10.1016/j.jvcir.2019.102724
Selivanova, I.V., Kosyakov, D.V., and Guskov, A.E., Classification of scientific texts based on the compression of annotations to publications, Autom. Doc. Math. Linguist., 2019, vol. 53, no. 6, pp. 329–342. https://doi.org/10.3103/S0005105519060062
Subelj, L., van Eck, N.J., and Waltman, L., Clustering scientific publications based on citation relations: a systematic comparison of different methods, PLoS ONE, 2016, vol. 11, no. 4, p. e0154404. https://doi.org/10.1371/journal.pone.0154404
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K.A., Ceder, G., and Jain, A., Unsupervised word embeddings capture latent knowledge from materials science literature, Nature, 2019, vol. 571, no. 7763, pp. 95–98. https://doi.org/10.1038/s41586-019-1335-8
Borrajo, L., Romero, R., Iglesias, E.L., and Redondo Marey, C.M., Improving imbalanced scientific text classification using sampling strategies and dictionaries, J. Integr. Bioinf., 2011, vol. 8, no. 3, p. 176. https://doi.org/10.1515/jib-2011-176
Sinclair, G. and Webber, B., Classification from full text: a comparison of canonical sections of scientific papers, in Proc. Int. Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (NLPBA/BioNLP), Geneva, 2004, COLING, 2004, pp. 66–69. https://aclanthology.org/W04-1212.
Ryabko, B.Y., Gus’kov, A.E., and Selivanova, I.V., Information-theoretic method for classification of texts, Probl. Inf. Transm., 2017, vol. 53, no. 3, pp. 294–304. https://doi.org/10.1134/S0032946017030115
Selivanova, I.V., Ryabko, B.Ya., and Guskov, A.E., Classification by compression: Application of information-theory methods for the identification of themes of scientific texts, Autom. Doc. Math. Linguist., 2017, vol. 51, no. 3, pp. 120–126. https://doi.org/10.3103/S0005105517030116
Cilibrasi, R. and Vitanyi, P.M.B., Clustering by compression, IEEE Trans. Inf. Theory, 2005, vol. 51, no. 4, pp. 1523–1545. https://doi.org/10.1109/TIT.2005.844059
Cilibrasi, R., Vitanyi, P., and de Wolf, R., Algorithmic clustering of music based on string compression, Comput. Music J., 2004, vol. 28, no. 4, pp. 49–67.
Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Using literal and grammatical statistics for authorship attribution, Probl. Inf. Transm., 2001, vol. 37, no. 2, pp. 172–184. https://doi.org/10.1023/A:1010478226705
Scikit-learn: Machine learning in Python. https://scikit-learn.org/stable/. Cited July 31, 2020.
Journal Geologiya i Geofizika.https://www.sibran.ru/journals/GiG/. Cited July 30, 2020.
ACKNOWLEDGMENTS
For the expert assessment and comments, the authors are grateful to:
Vyacheslav Nikolaevich Glinskikh, Doctor of Physical and Mathematical Sciences, Corresponding Member of the Russian Academy of Sciences, Head of the Laboratory of Multiscale Geophysics;
Dmitry Vasilyevich Metelkin, Doctor of Geological and Mineralogical Sciences, Associate Professor, Chief Researcher of the Laboratory of Geodynamics and Paleomagnetism, Chief Researcher of the Laboratory of Geodynamics and Paleomagnetism of the Central and Eastern Arctic, Faculty of Geoelogy and Geophysics, Novosibirsk State University;
Nikolay Valerianovich Sennikov, Doctor of Geological and Mineralogical Sciences, Professor, Head of the Laboratory of Paleontology and Stratigraphy of the Paleozoic, Head of the Department of Historical Geology and Paleontology, Faculty of Geoelogy and Geophysics, Novosibirsk State University;
Tatyana Mikhailovna Parfenova, Candidate of Geological and Mineralogical Sciences, Deputy Director for Research, Senior Researcher, Laboratory of Oil and Gas Geochemistry;
Irina Viktorovna Filimonova, Doctor of Economics, Professor and Head, Center for the Economics of Subsoil Use of Oil and Gas, Trofimuk Institute of Petroleum-Gas Geology and Geophysics, Siberian Branch, Russian Academy of Sciences, Head, Department of Political Economy, Faculty of Economics, Novosibirsk State University.
Author information
Authors and Affiliations
Corresponding authors
About this article
Cite this article
Selivanova, I.V., Kosyakov, D.V., Dubovitskii, D.A. et al. Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles. Autom. Doc. Math. Linguist. 55, 178–189 (2021). https://doi.org/10.3103/S0005105521040075
Received:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S0005105521040075