Skip to main content
Log in

Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles

  • TEXT PROCESSING AUTOMATION
  • Published:
Automatic Documentation and Mathematical Linguistics Aims and scope

Abstract

In this article we consider a fundamentally new information-theoretic approach to the classification of scientific texts based on compression algorithms. An analysis using the example of the comparative classification of full-text documents from arXiv.org and short annotations from Scopus showed that the accuracy of the proposed method is 87–92% and, in general, is not inferior to the existing ones. These conclusions were confirmed by an expert assessment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.

Similar content being viewed by others

REFERENCES

  1. Yu, B., An evaluation of text classification methods for literary study, Lit. Linguist. Comput., 2008, vol. 23, no. 3, pp. 327–343. https://doi.org/10.1093/llc/fqn015

    Article  Google Scholar 

  2. Barakhnin, V.B., Kozhemyakina, O.Yu., Pastushkov, I.S., and Rychkova, E.V., Computer classification of russian poetic texts by genres and styles, Vestn. Novosib. Gos. Univ., Ser. Lingvist. Interkult. Commun., 2017, vol. 15, no. 3, pp. 13–23. https://doi.org/10.25205/1818-7935-2017-15-3-13-23

    Article  Google Scholar 

  3. Can, E.F., Can, F., Duygulu, P., and Kalpakli, M., Automatic categorization of Ottoman literary texts by poet and time period, in Computer and Information Sciences II, Gelenbe, E., Lent, R., and Sakellari, G., Eds., London: Springer, 2011, pp. 51–57. https://doi.org/10.1007/978-1-4471-2155-8_6

    Book  Google Scholar 

  4. Oliveira, E. and Filho, D.B., Automatic classification of journalistic documents on the internet, Transinformação, 2017, vol. 29, no. 3, pp. 245–255. https://doi.org/10.1590/2318-08892017000300003

    Article  Google Scholar 

  5. Hasan, M., Rundensteiner, E., and Agu, E., Emotex: Detecting emotions in Twitter messages, in ASE BIGDATA/SOCIALCOM/CYBERSECURITY Conf., Stanford, 2014, Stanford: Stanford Univ., 2014, pp. 27–31.

  6. Rubtsova, Y.V., Research and development of domain independent sentiment classifier, Tr. SPIIRAN, 2014, vol. 5, no. 36, pp. 59–77. https://doi.org/10.15622/sp.36.4

    Article  Google Scholar 

  7. Zantout, R., Osman, Z., and Hamandi, L., A universal method for author identification using statistical properties of text, in Proc. 2nd Int. Conf. on Vision, Image and Signal Processing, Las Vegas, 2018, New York: Association for Computing Machinery, 2018, p. 20.  https://doi.org/10.1145/3271553.3271561

  8. Tang, X., Liang, S., and Liu, Z., Authorship attribution of the golden lotus based on text classification methods, in Proc. 3rd Int. Conf. on Innovation in Artificial Intelligence, Suzhou, China, 2019, New York: Association for Computing Machinery, 2019, pp. 69–72. https://doi.org/10.1145/3319921.3319958

  9. Miao, Y., Keselj, V., and Milios, E., Document clustering using character n-grams: a comparative evaluation with term-based and word-based clustering, in Proc. 14th ACM Int. Conf. on Information and Knowledge Management, Bremen, 2005, New York: Association for Computing Machinery, 2005, pp. 357–358.  https://doi.org/10.1145/1099554.1099665

  10. Volkova, L.L. and Stroganov, Yu.V., On associative binary measures of proximity of documents: Classification and application to clusterization, Novye Inf. Tekhnol. Avtom. Sist., 2014, no. 17, pp. 421–432.

  11. Baghel, R. and Dhir, R., A frequent concepts based document clustering algorithm, Int. J. Comput. Appl., 2010, vol. 4, no. 5, pp. 6–12.

    Google Scholar 

  12. Beil, F., Ester, M., and Xu, X., Frequent term-based text clustering, in Proc. Eighth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, New York: Association for Computing Machinery, 2002, pp. 436–442. https://doi.org/10.1145/775047.775110

  13. Deng, Z.H., Tang, S.W., Yang, D.Q., Li-Yu, M.Z., and Xie, K.-Q., A comparative study on feature weight in text categorization, in Advanced Web Technologies and Applications. APWeb 2004, Yu, J.X., Lin, X., Lu, H., and Zhang, Y., Eds., Lecture Notes in Computer Science, vol. 3007, Berlin: Springer, 2004, pp. 588–597.  https://doi.org/10.1007/978-3-540-24655-8_64

    Book  Google Scholar 

  14. Lunh, H.P., The automatic creation of literature abstracts, IBM J. Res. Dev., 1958, vol. 2, no. 2, pp. 159–165. https://doi.org/10.1147/rd.22.0159

    Article  MathSciNet  Google Scholar 

  15. Riloff, E., Little words can make a big difference for text classification, in Proc. 18th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1995, pp. 130–136.

  16. Hu, L.Y., Huang, M.W., Ke, S.W., and Tsai, C.F., The distance function effect on k-nearest neighbor classification for medical datasets, SpringerPlus, 2016, vol. 5, p. 1304. https://doi.org/10.1186/s40064-016-2941-7

    Article  Google Scholar 

  17. Zhang, S. and Pan, X., A novel text classification based on Mahalanobis distance, in 3rd Int. Conf. on Computer Research and Development, Shanghai, 2011, IEEE, 2011, pp. 156–158.  https://doi.org/10.1109/ICCRD.2011.5764268

  18. Dhar, A., Dash, N., and Roy, K., Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents, in 3rd Int. Conf. on Advances in Computing, Communication & Automation (ICACCA) (Fall), Dehradun, India, 2017, IEEE, 2017, pp. 1–6.  https://doi.org/10.1109/ICACCAF.2017.8344721

  19. Walkowiak, T., Datko, S., and Maciejewski, H., Distance metrics in open-set classification of text documents by local outlier factor and doc2vec, in Advances and Trends in Artificial Intelligence. From Theory to Practice. IEA/AIE 2019, Wotawa, F., Friedrich, G., Pill, I., Koitz-Hristov, R., and Ali, M., Eds., Lecture Notes in Computer Science, vol. 11606, Cham: Springer, 2019, pp. 102–109. https://doi.org/10.1007/978-3-030-22999-3_10

    Book  Google Scholar 

  20. Zu, G., Ohyama, W., Wakabayashi, T., and Kumura, F., Automatic text classification of english newswire articles based on statistical classification techniques, Electr. Eng. Jpn., 2005, vol. 152, no. 1, pp. 50–60.  https://doi.org/10.1002/eej.20108

    Article  Google Scholar 

  21. Forman, G., An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res., 2003, vol. 3, pp. 1289–1305.

    MATH  Google Scholar 

  22. Nearest neighbor method. http://www.machinelearning. ru/wiki/index.php?title=Метод_ближайшего_соседа. Cited May 8, 2020.

  23. Wang, X. and Yao, P., A fuzzy KNN algorithm based on weighted chi-square distance, in Proc. 2nd Int. Conf. on Computer Science and Application, Hohhot, China, 2018, New York: Association for Computing Machinery, 2018, p. 4. https://doi.org/10.1145/3207677.3277973

  24. Wang, C.-Y., Zhang K., Yan, Y.-G., Li, J.-G., A k-nearest neighbor algorithm based on cluster in text classification, in Int. Conf. on Computer, Mechatronics, Control and Electronic Engineering, Changchun, China, 2010, IEEE, 2010, vol. 1, pp. 225–228. https://doi.org/10.1109/CMCE.2010.5610477

  25. Zhang, X., Li, B., and Sun, X., A k-nearest neighbor text classification algorithm based on fuzzy integral, in Sixth Int. Conf. on Natural Computation, Yantai, China, 2010, IEEE, 2010, vol. 5, pp. 2238–2231.  https://doi.org/10.1109/ICNC.2010.5584406

  26. Tan, S., Neighbor-weighted K-nearest neighbor for unbalanced text corpus, Expert Syst. Appl., 2005, vol. 28, no. 4, pp. 667–671. https://doi.org/10.1016/j.eswa.2004.12.023

    Article  Google Scholar 

  27. Denœux, T., A k-nearest neighbor classification rule based on Dempster-Shafer theory, in Classic Works of the Dempster-Shafer Theory of Belief Functions, Yager, R.R. and Liu, L., Eds., Studies in Fuzziness and Soft Computing, vol. 219, Berlin: Springer, 2008, pp. 737–760. https://doi.org/10.1007/978-3-540-44792-4_29

  28. Garg, A. and Roth, D., Understanding probabilistic classifiers, in Machine Learning: ECML 2001, De Raedt, L. and Flach, P., Eds., Lecture Notes in Computer Science, vol. 2167, Berlin: Springer, 2001, pp. 179–191.  https://doi.org/10.1007/3-540-44795-4_16

    Book  Google Scholar 

  29. Jiang, L., Li, C., Wang, S., and Zhang, L., Deep feature weighting for naive Bayes and its application to text classification, Eng. Appl. Artif. Intell., 2016, vol. 52, pp. 26–39. https://doi.org/10.1016/j.engappai.2016.02.002

    Article  Google Scholar 

  30. Howedi, F. and Mohd, M., Text classification for authorship attribution using naive Bayes classifier with limited training data, Comput. Eng., Intell. Syst., 2014, vol. 5, no. 4, pp. 48–56.

    Google Scholar 

  31. Xu, S., Li, Y., and Wang, Z., Bayesian multinomial naive bayes classifier to text classification, in Advanced Multimedia and Ubiquitous Engineering. FutureTech 2017, MUE 2017, Park, J., Chen., S.C., and Raymond Choo, K.K., Eds., Lecture Notes in Electrical Engineering, vol. 448, Singapore: Springer, 2017, pp. 347–352.  https://doi.org/10.1007/978-981-10-5041-1_57

  32. Narayanan, V., Arora, I., and Bhatia, A., Fast and accurate sentiment classification using an enhanced naive bayes model, in Intelligent Data Engineering and Automated Learning – IDEAL 2013, Yin H. , Eds., Lecture Notes in Computer Science, vol. 8206, Berlin: Springer, 2013, pp. 194–201.  https://doi.org/10.1007/978-3-642-41278-3_24

    Book  Google Scholar 

  33. Bi, Z., Han, Y., Huang, C., and Wang, M., Gaussian naive Bayesian data classification model based on clustering algorithm, in Proc. Int. Conf. on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019), Atlantis Press, 2019, pp. 396–400.  https://doi.org/10.2991/masta-19.2019.67

  34. Myaeng, S.H., Han, K.S., and Rim, H.C., Some effective techniques for naive Bayes text classification, IEEE Trans. Knowl. Data Eng., 2006, vol. 18, no. 11, pp. 1457–1466. https://doi.org/10.1109/TKDE.2006.180

    Article  Google Scholar 

  35. Cortes, C. and Vapnik, V., Support-vector networks, Mach. Learn., 1995, vol. 20, no. 3, pp. 273–297.  https://doi.org/10.1007/BF00994018

    Article  MATH  Google Scholar 

  36. Wang, Z.Q., Sun, X., Zhang, D.-X., and Li, X., An optimal SVM-based text classification algorithm, in Int. Conf. on Machine Learning and Cybernetics, Dalian, China, 2006, IEEE, 2006, pp. 1378–1381.  https://doi.org/10.1109/ICMLC.2006.258708

  37. Ji, L., Cheng, X., Kang, L., Li, Daoliang, Li, Daiyi, Wang, K., and Chen, Y., A SVM-based text classification system for knowledge organization method of crop cultivation, in Computer and Computing Technologies in Agriculture V. CCTA 2011, Li, D. and Chen, Y., Eds., IFIP Advances in Information and Communication Technology, vol. 368, Berlin: Springer, 2012, pp. 318–324. https://doi.org/10.1007/978-3-642-27281-3_38

  38. Yang, Y., Zhang, J., and Kisiel, B., A scalability analysis of classifiers in text categorization, in Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, 2003, New York: Association for Computing Machinery, 2003, pp. 96–103.  https://doi.org/10.1145/860435.860455

  39. Aborisade, O.M. and Anwar, M., Classification for authorship of tweets by comparing logistic regression and naive bayes classifiers, in IEEE Int. Conf. on Information Reuse and Integration (IRI), Salt Lake City, Utah, 2018, IEEE, 2018, pp. 269–276. https://doi.org/10.1109/IRI.2018.00049

  40. Chistiakov, S.P., Random forests: An overview, Tr. Karel. Nauchn. Tsentra Ross. Akad. Nauk, 2013, no. 1, pp. 117–136.

  41. Xu, B., Guo, X., Ye, Y., and Cheng, J., An improved random forest classifier for text categorization, J. Comput., 2012, vol. 7, no. 12, pp. 2913–2920.

    Google Scholar 

  42. Islam, M.Z., Liu, J., Li, J., Liu, L., and Kang, W., A semantics aware random forest for text classification, in Proc. 28th ACM Int. Conf. on Information and Knowledge Management, Beijing, 2019, New York, Association for Computing Machinery, 2019, pp. 1061–1070.  https://doi.org/10.1145/3357384.3357891

  43. Bouaziz, A., Dartigues-Pallez, C., da Costa Pereira, C., Precioso, F., and Lloret, P., Short text classification using semantic random forest, in Data Warehousing and Knowledge Discovery. DaWaK 2014, Bellatreche, L. and Mohania, M.K., Eds., Lecture Notes in Computer Science, vol. 8646, Cham: Springer, 2014, pp. 288–299.  https://doi.org/10.1007/978-3-319-10160-6_26

    Book  Google Scholar 

  44. Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, in Proc. Twenty-Ninth AAAI Conf. on Artificial Intelligence, Austin, Tex., 2015, AAAI Press, 2015, pp. 2267–2273.

  45. Alqaraleh, S., Classification of Turkish text using machine learning: a case study using disasters tweets, Int. J. Sci. Technol. Res., 2020, vol. 9, no. 3, pp. 4953–4956.

    Google Scholar 

  46. Li, Y.H. and Jain, A.K., Classification of text documents, Comput. J., 1998, vol. 41, no. 8, pp. 537–546.  https://doi.org/10.1093/comjnl/41.8.537

    Article  MATH  Google Scholar 

  47. Xia, R., Zong, C., and Li, S., Ensemble of feature sets and classification algorithms for sentiment classification, Inf. Sci., 2011, vol. 181, no. 6, pp. 1138–1152.  https://doi.org/10.1016/j.ins.2010.11.023

    Article  Google Scholar 

  48. Pratama, B.Y. and Sarno, R., Personality classification based on Twitter text using naive Bayes, KNN and SVM, in Int. Conf. on Data and Software Engineering (ICoDSE), Yogyakarta, Indonesia, 2015, IEEE, 2015, pp. 170–174. https://doi.org/10.1109/ICODSE.2015.7436992

  49. Telnoni, P.A., Budiawan, R., and Qana’a, M., Comparison of machine learning classification method on text-based case in Twitter, in Int. Conf. on ICT for Smart Society (ICISS), Bandung, Indonesia, 2019, IEEE, 2019, vol. 7, pp. 1–5.  https://doi.org/10.1109/ICISS48059.2019.8969850

  50. Liu, Z., Lv, X., Liu, K., and Shi, S., Study on SVM compared with the other text classification methods, in Second Int. Workshop on Education Technology and Computer Science, Wuhan, China, 2010, IEEE, 2010, vol. 1, pp. 219–222. https://doi.org/10.1109/ETCS.2010.248

  51. Liu, C. and Wang, X., Quality-related English text classification based on recurrent neural network, J. Visual Commun. Image Representation, 2020, vol. 71, p. 102724. https://doi.org/10.1016/j.jvcir.2019.102724

    Article  Google Scholar 

  52. Selivanova, I.V., Kosyakov, D.V., and Guskov, A.E., Classification of scientific texts based on the compression of annotations to publications, Autom. Doc. Math. Linguist., 2019, vol. 53, no. 6, pp. 329–342.  https://doi.org/10.3103/S0005105519060062

    Article  Google Scholar 

  53. Subelj, L., van Eck, N.J., and Waltman, L., Clustering scientific publications based on citation relations: a systematic comparison of different methods, PLoS ONE, 2016, vol. 11, no. 4, p. e0154404.  https://doi.org/10.1371/journal.pone.0154404

    Article  Google Scholar 

  54. Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K.A., Ceder, G., and Jain, A., Unsupervised word embeddings capture latent knowledge from materials science literature, Nature, 2019, vol. 571, no. 7763, pp. 95–98.  https://doi.org/10.1038/s41586-019-1335-8

    Article  Google Scholar 

  55. Borrajo, L., Romero, R., Iglesias, E.L., and Redondo Marey, C.M., Improving imbalanced scientific text classification using sampling strategies and dictionaries, J. Integr. Bioinf., 2011, vol. 8, no. 3, p. 176.  https://doi.org/10.1515/jib-2011-176

    Article  Google Scholar 

  56. Sinclair, G. and Webber, B., Classification from full text: a comparison of canonical sections of scientific papers, in Proc. Int. Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (NLPBA/BioNLP), Geneva, 2004, COLING, 2004, pp. 66–69. https://aclanthology.org/W04-1212.

  57. Ryabko, B.Y., Gus’kov, A.E., and Selivanova, I.V., Information-theoretic method for classification of texts, Probl. Inf. Transm., 2017, vol. 53, no. 3, pp. 294–304.  https://doi.org/10.1134/S0032946017030115

    Article  MathSciNet  MATH  Google Scholar 

  58. Selivanova, I.V., Ryabko, B.Ya., and Guskov, A.E., Classification by compression: Application of information-theory methods for the identification of themes of scientific texts, Autom. Doc. Math. Linguist., 2017, vol. 51, no. 3, pp. 120–126. https://doi.org/10.3103/S0005105517030116

    Article  Google Scholar 

  59. Cilibrasi, R. and Vitanyi, P.M.B., Clustering by compression, IEEE Trans. Inf. Theory, 2005, vol. 51, no. 4, pp. 1523–1545. https://doi.org/10.1109/TIT.2005.844059

    Article  MathSciNet  MATH  Google Scholar 

  60. Cilibrasi, R., Vitanyi, P., and de Wolf, R., Algorithmic clustering of music based on string compression, Comput. Music J., 2004, vol. 28, no. 4, pp. 49–67.

    Article  Google Scholar 

  61. Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Using literal and grammatical statistics for authorship attribution, Probl. Inf. Transm., 2001, vol. 37, no. 2, pp. 172–184.  https://doi.org/10.1023/A:1010478226705

    Article  MathSciNet  MATH  Google Scholar 

  62. Scikit-learn: Machine learning in Python. https://scikit-learn.org/stable/. Cited July 31, 2020.

  63. Journal Geologiya i Geofizika.https://www.sibran.ru/journals/GiG/. Cited July 30, 2020.

Download references

ACKNOWLEDGMENTS

For the expert assessment and comments, the authors are grateful to:

Vyacheslav Nikolaevich Glinskikh, Doctor of Physical and Mathematical Sciences, Corresponding Member of the Russian Academy of Sciences, Head of the Laboratory of Multiscale Geophysics;

Dmitry Vasilyevich Metelkin, Doctor of Geological and Mineralogical Sciences, Associate Professor, Chief Researcher of the Laboratory of Geodynamics and Paleomagnetism, Chief Researcher of the Laboratory of Geodynamics and Paleomagnetism of the Central and Eastern Arctic, Faculty of Geoelogy and Geophysics, Novosibirsk State University;

Nikolay Valerianovich Sennikov, Doctor of Geological and Mineralogical Sciences, Professor, Head of the Laboratory of Paleontology and Stratigraphy of the Paleozoic, Head of the Department of Historical Geology and Paleontology, Faculty of Geoelogy and Geophysics, Novosibirsk State University;

Tatyana Mikhailovna Parfenova, Candidate of Geological and Mineralogical Sciences, Deputy Director for Research, Senior Researcher, Laboratory of Oil and Gas Geochemistry;

Irina Viktorovna Filimonova, Doctor of Economics, Professor and Head, Center for the Economics of Subsoil Use of Oil and Gas, Trofimuk Institute of Petroleum-Gas Geology and Geophysics, Siberian Branch, Russian Academy of Sciences, Head, Department of Political Economy, Faculty of Economics, Novosibirsk State University.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to I. V. Selivanova, D. V. Kosyakov, D. A. Dubovitskii or A. E. Guskov.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Selivanova, I.V., Kosyakov, D.V., Dubovitskii, D.A. et al. Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles. Autom. Doc. Math. Linguist. 55, 178–189 (2021). https://doi.org/10.3103/S0005105521040075

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S0005105521040075

Keywords:

Navigation