Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles

Selivanova, I. V.; Kosyakov, D. V.; Dubovitskii, D. A.; Guskov, A. E.

doi:10.3103/S0005105521040075

Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles

TEXT PROCESSING AUTOMATION
Published: 02 November 2021

Volume 55, pages 178–189, (2021)
Cite this article

Automatic Documentation and Mathematical Linguistics Aims and scope

I. V. Selivanova¹,
D. V. Kosyakov¹,
D. A. Dubovitskii² &
…
A. E. Guskov¹

124 Accesses
3 Citations
Explore all metrics

Abstract—

In this article we consider a fundamentally new information-theoretic approach to the classification of scientific texts based on compression algorithms. An analysis using the example of the comparative classification of full-text documents from arXiv.org and short annotations from Scopus showed that the accuracy of the proposed method is 87–92% and, in general, is not inferior to the existing ones. These conclusions were confirmed by an expert assessment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classification of Scientific Texts Based on the Compression of Annotations to Publications

Article 01 November 2019

Classification by compression: Application of information-theory methods for the identification of themes of scientific texts

Article 01 June 2017

Text Mining with the Stanford CoreNLP

REFERENCES

Yu, B., An evaluation of text classification methods for literary study, Lit. Linguist. Comput., 2008, vol. 23, no. 3, pp. 327–343. https://doi.org/10.1093/llc/fqn015
Article Google Scholar
Barakhnin, V.B., Kozhemyakina, O.Yu., Pastushkov, I.S., and Rychkova, E.V., Computer classification of russian poetic texts by genres and styles, Vestn. Novosib. Gos. Univ., Ser. Lingvist. Interkult. Commun., 2017, vol. 15, no. 3, pp. 13–23. https://doi.org/10.25205/1818-7935-2017-15-3-13-23
Article Google Scholar
Can, E.F., Can, F., Duygulu, P., and Kalpakli, M., Automatic categorization of Ottoman literary texts by poet and time period, in Computer and Information Sciences II, Gelenbe, E., Lent, R., and Sakellari, G., Eds., London: Springer, 2011, pp. 51–57. https://doi.org/10.1007/978-1-4471-2155-8_6
Book Google Scholar
Oliveira, E. and Filho, D.B., Automatic classification of journalistic documents on the internet, Transinformação, 2017, vol. 29, no. 3, pp. 245–255. https://doi.org/10.1590/2318-08892017000300003
Article Google Scholar
Hasan, M., Rundensteiner, E., and Agu, E., Emotex: Detecting emotions in Twitter messages, in ASE BIGDATA/SOCIALCOM/CYBERSECURITY Conf., Stanford, 2014, Stanford: Stanford Univ., 2014, pp. 27–31.
Rubtsova, Y.V., Research and development of domain independent sentiment classifier, Tr. SPIIRAN, 2014, vol. 5, no. 36, pp. 59–77. https://doi.org/10.15622/sp.36.4
Article Google Scholar
Zantout, R., Osman, Z., and Hamandi, L., A universal method for author identification using statistical properties of text, in Proc. 2nd Int. Conf. on Vision, Image and Signal Processing, Las Vegas, 2018, New York: Association for Computing Machinery, 2018, p. 20. https://doi.org/10.1145/3271553.3271561
Tang, X., Liang, S., and Liu, Z., Authorship attribution of the golden lotus based on text classification methods, in Proc. 3rd Int. Conf. on Innovation in Artificial Intelligence, Suzhou, China, 2019, New York: Association for Computing Machinery, 2019, pp. 69–72. https://doi.org/10.1145/3319921.3319958
Miao, Y., Keselj, V., and Milios, E., Document clustering using character n-grams: a comparative evaluation with term-based and word-based clustering, in Proc. 14th ACM Int. Conf. on Information and Knowledge Management, Bremen, 2005, New York: Association for Computing Machinery, 2005, pp. 357–358. https://doi.org/10.1145/1099554.1099665
Volkova, L.L. and Stroganov, Yu.V., On associative binary measures of proximity of documents: Classification and application to clusterization, Novye Inf. Tekhnol. Avtom. Sist., 2014, no. 17, pp. 421–432.
Baghel, R. and Dhir, R., A frequent concepts based document clustering algorithm, Int. J. Comput. Appl., 2010, vol. 4, no. 5, pp. 6–12.
Google Scholar
Beil, F., Ester, M., and Xu, X., Frequent term-based text clustering, in Proc. Eighth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002, New York: Association for Computing Machinery, 2002, pp. 436–442. https://doi.org/10.1145/775047.775110
Deng, Z.H., Tang, S.W., Yang, D.Q., Li-Yu, M.Z., and Xie, K.-Q., A comparative study on feature weight in text categorization, in Advanced Web Technologies and Applications. APWeb 2004, Yu, J.X., Lin, X., Lu, H., and Zhang, Y., Eds., Lecture Notes in Computer Science, vol. 3007, Berlin: Springer, 2004, pp. 588–597. https://doi.org/10.1007/978-3-540-24655-8_64
Book Google Scholar
Lunh, H.P., The automatic creation of literature abstracts, IBM J. Res. Dev., 1958, vol. 2, no. 2, pp. 159–165. https://doi.org/10.1147/rd.22.0159
Article MathSciNet Google Scholar
Riloff, E., Little words can make a big difference for text classification, in Proc. 18th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1995, pp. 130–136.
Hu, L.Y., Huang, M.W., Ke, S.W., and Tsai, C.F., The distance function effect on k-nearest neighbor classification for medical datasets, SpringerPlus, 2016, vol. 5, p. 1304. https://doi.org/10.1186/s40064-016-2941-7
Article Google Scholar
Zhang, S. and Pan, X., A novel text classification based on Mahalanobis distance, in 3rd Int. Conf. on Computer Research and Development, Shanghai, 2011, IEEE, 2011, pp. 156–158. https://doi.org/10.1109/ICCRD.2011.5764268
Dhar, A., Dash, N., and Roy, K., Classification of text documents through distance measurement: An experiment with multi-domain Bangla text documents, in 3rd Int. Conf. on Advances in Computing, Communication & Automation (ICACCA) (Fall), Dehradun, India, 2017, IEEE, 2017, pp. 1–6. https://doi.org/10.1109/ICACCAF.2017.8344721
Walkowiak, T., Datko, S., and Maciejewski, H., Distance metrics in open-set classification of text documents by local outlier factor and doc2vec, in Advances and Trends in Artificial Intelligence. From Theory to Practice. IEA/AIE 2019, Wotawa, F., Friedrich, G., Pill, I., Koitz-Hristov, R., and Ali, M., Eds., Lecture Notes in Computer Science, vol. 11606, Cham: Springer, 2019, pp. 102–109. https://doi.org/10.1007/978-3-030-22999-3_10
Book Google Scholar
Zu, G., Ohyama, W., Wakabayashi, T., and Kumura, F., Automatic text classification of english newswire articles based on statistical classification techniques, Electr. Eng. Jpn., 2005, vol. 152, no. 1, pp. 50–60. https://doi.org/10.1002/eej.20108
Article Google Scholar
Forman, G., An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res., 2003, vol. 3, pp. 1289–1305.
MATH Google Scholar
Nearest neighbor method. http://www.machinelearning. ru/wiki/index.php?title=Метод_ближайшего_соседа. Cited May 8, 2020.
Wang, X. and Yao, P., A fuzzy KNN algorithm based on weighted chi-square distance, in Proc. 2nd Int. Conf. on Computer Science and Application, Hohhot, China, 2018, New York: Association for Computing Machinery, 2018, p. 4. https://doi.org/10.1145/3207677.3277973
Wang, C.-Y., Zhang K., Yan, Y.-G., Li, J.-G., A k-nearest neighbor algorithm based on cluster in text classification, in Int. Conf. on Computer, Mechatronics, Control and Electronic Engineering, Changchun, China, 2010, IEEE, 2010, vol. 1, pp. 225–228. https://doi.org/10.1109/CMCE.2010.5610477
Zhang, X., Li, B., and Sun, X., A k-nearest neighbor text classification algorithm based on fuzzy integral, in Sixth Int. Conf. on Natural Computation, Yantai, China, 2010, IEEE, 2010, vol. 5, pp. 2238–2231. https://doi.org/10.1109/ICNC.2010.5584406
Tan, S., Neighbor-weighted K-nearest neighbor for unbalanced text corpus, Expert Syst. Appl., 2005, vol. 28, no. 4, pp. 667–671. https://doi.org/10.1016/j.eswa.2004.12.023
Article Google Scholar
Denœux, T., A k-nearest neighbor classification rule based on Dempster-Shafer theory, in Classic Works of the Dempster-Shafer Theory of Belief Functions, Yager, R.R. and Liu, L., Eds., Studies in Fuzziness and Soft Computing, vol. 219, Berlin: Springer, 2008, pp. 737–760. https://doi.org/10.1007/978-3-540-44792-4_29
Garg, A. and Roth, D., Understanding probabilistic classifiers, in Machine Learning: ECML 2001, De Raedt, L. and Flach, P., Eds., Lecture Notes in Computer Science, vol. 2167, Berlin: Springer, 2001, pp. 179–191. https://doi.org/10.1007/3-540-44795-4_16
Book Google Scholar
Jiang, L., Li, C., Wang, S., and Zhang, L., Deep feature weighting for naive Bayes and its application to text classification, Eng. Appl. Artif. Intell., 2016, vol. 52, pp. 26–39. https://doi.org/10.1016/j.engappai.2016.02.002
Article Google Scholar
Howedi, F. and Mohd, M., Text classification for authorship attribution using naive Bayes classifier with limited training data, Comput. Eng., Intell. Syst., 2014, vol. 5, no. 4, pp. 48–56.
Google Scholar
Xu, S., Li, Y., and Wang, Z., Bayesian multinomial naive bayes classifier to text classification, in Advanced Multimedia and Ubiquitous Engineering. FutureTech 2017, MUE 2017, Park, J., Chen., S.C., and Raymond Choo, K.K., Eds., Lecture Notes in Electrical Engineering, vol. 448, Singapore: Springer, 2017, pp. 347–352. https://doi.org/10.1007/978-981-10-5041-1_57
Narayanan, V., Arora, I., and Bhatia, A., Fast and accurate sentiment classification using an enhanced naive bayes model, in Intelligent Data Engineering and Automated Learning – IDEAL 2013, Yin H. , Eds., Lecture Notes in Computer Science, vol. 8206, Berlin: Springer, 2013, pp. 194–201. https://doi.org/10.1007/978-3-642-41278-3_24
Book Google Scholar
Bi, Z., Han, Y., Huang, C., and Wang, M., Gaussian naive Bayesian data classification model based on clustering algorithm, in Proc. Int. Conf. on Modeling, Analysis, Simulation Technologies and Applications (MASTA 2019), Atlantis Press, 2019, pp. 396–400. https://doi.org/10.2991/masta-19.2019.67
Myaeng, S.H., Han, K.S., and Rim, H.C., Some effective techniques for naive Bayes text classification, IEEE Trans. Knowl. Data Eng., 2006, vol. 18, no. 11, pp. 1457–1466. https://doi.org/10.1109/TKDE.2006.180
Article Google Scholar
Cortes, C. and Vapnik, V., Support-vector networks, Mach. Learn., 1995, vol. 20, no. 3, pp. 273–297. https://doi.org/10.1007/BF00994018
Article MATH Google Scholar
Wang, Z.Q., Sun, X., Zhang, D.-X., and Li, X., An optimal SVM-based text classification algorithm, in Int. Conf. on Machine Learning and Cybernetics, Dalian, China, 2006, IEEE, 2006, pp. 1378–1381. https://doi.org/10.1109/ICMLC.2006.258708
Ji, L., Cheng, X., Kang, L., Li, Daoliang, Li, Daiyi, Wang, K., and Chen, Y., A SVM-based text classification system for knowledge organization method of crop cultivation, in Computer and Computing Technologies in Agriculture V. CCTA 2011, Li, D. and Chen, Y., Eds., IFIP Advances in Information and Communication Technology, vol. 368, Berlin: Springer, 2012, pp. 318–324. https://doi.org/10.1007/978-3-642-27281-3_38
Yang, Y., Zhang, J., and Kisiel, B., A scalability analysis of classifiers in text categorization, in Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, 2003, New York: Association for Computing Machinery, 2003, pp. 96–103. https://doi.org/10.1145/860435.860455
Aborisade, O.M. and Anwar, M., Classification for authorship of tweets by comparing logistic regression and naive bayes classifiers, in IEEE Int. Conf. on Information Reuse and Integration (IRI), Salt Lake City, Utah, 2018, IEEE, 2018, pp. 269–276. https://doi.org/10.1109/IRI.2018.00049
Chistiakov, S.P., Random forests: An overview, Tr. Karel. Nauchn. Tsentra Ross. Akad. Nauk, 2013, no. 1, pp. 117–136.
Xu, B., Guo, X., Ye, Y., and Cheng, J., An improved random forest classifier for text categorization, J. Comput., 2012, vol. 7, no. 12, pp. 2913–2920.
Google Scholar
Islam, M.Z., Liu, J., Li, J., Liu, L., and Kang, W., A semantics aware random forest for text classification, in Proc. 28th ACM Int. Conf. on Information and Knowledge Management, Beijing, 2019, New York, Association for Computing Machinery, 2019, pp. 1061–1070. https://doi.org/10.1145/3357384.3357891
Bouaziz, A., Dartigues-Pallez, C., da Costa Pereira, C., Precioso, F., and Lloret, P., Short text classification using semantic random forest, in Data Warehousing and Knowledge Discovery. DaWaK 2014, Bellatreche, L. and Mohania, M.K., Eds., Lecture Notes in Computer Science, vol. 8646, Cham: Springer, 2014, pp. 288–299. https://doi.org/10.1007/978-3-319-10160-6_26
Book Google Scholar
Lai, S., Xu, L., Liu, K., and Zhao, J., Recurrent convolutional neural networks for text classification, in Proc. Twenty-Ninth AAAI Conf. on Artificial Intelligence, Austin, Tex., 2015, AAAI Press, 2015, pp. 2267–2273.
Alqaraleh, S., Classification of Turkish text using machine learning: a case study using disasters tweets, Int. J. Sci. Technol. Res., 2020, vol. 9, no. 3, pp. 4953–4956.
Google Scholar
Li, Y.H. and Jain, A.K., Classification of text documents, Comput. J., 1998, vol. 41, no. 8, pp. 537–546. https://doi.org/10.1093/comjnl/41.8.537
Article MATH Google Scholar
Xia, R., Zong, C., and Li, S., Ensemble of feature sets and classification algorithms for sentiment classification, Inf. Sci., 2011, vol. 181, no. 6, pp. 1138–1152. https://doi.org/10.1016/j.ins.2010.11.023
Article Google Scholar
Pratama, B.Y. and Sarno, R., Personality classification based on Twitter text using naive Bayes, KNN and SVM, in Int. Conf. on Data and Software Engineering (ICoDSE), Yogyakarta, Indonesia, 2015, IEEE, 2015, pp. 170–174. https://doi.org/10.1109/ICODSE.2015.7436992
Telnoni, P.A., Budiawan, R., and Qana’a, M., Comparison of machine learning classification method on text-based case in Twitter, in Int. Conf. on ICT for Smart Society (ICISS), Bandung, Indonesia, 2019, IEEE, 2019, vol. 7, pp. 1–5. https://doi.org/10.1109/ICISS48059.2019.8969850
Liu, Z., Lv, X., Liu, K., and Shi, S., Study on SVM compared with the other text classification methods, in Second Int. Workshop on Education Technology and Computer Science, Wuhan, China, 2010, IEEE, 2010, vol. 1, pp. 219–222. https://doi.org/10.1109/ETCS.2010.248
Liu, C. and Wang, X., Quality-related English text classification based on recurrent neural network, J. Visual Commun. Image Representation, 2020, vol. 71, p. 102724. https://doi.org/10.1016/j.jvcir.2019.102724
Article Google Scholar
Selivanova, I.V., Kosyakov, D.V., and Guskov, A.E., Classification of scientific texts based on the compression of annotations to publications, Autom. Doc. Math. Linguist., 2019, vol. 53, no. 6, pp. 329–342. https://doi.org/10.3103/S0005105519060062
Article Google Scholar
Subelj, L., van Eck, N.J., and Waltman, L., Clustering scientific publications based on citation relations: a systematic comparison of different methods, PLoS ONE, 2016, vol. 11, no. 4, p. e0154404. https://doi.org/10.1371/journal.pone.0154404
Article Google Scholar
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K.A., Ceder, G., and Jain, A., Unsupervised word embeddings capture latent knowledge from materials science literature, Nature, 2019, vol. 571, no. 7763, pp. 95–98. https://doi.org/10.1038/s41586-019-1335-8
Article Google Scholar
Borrajo, L., Romero, R., Iglesias, E.L., and Redondo Marey, C.M., Improving imbalanced scientific text classification using sampling strategies and dictionaries, J. Integr. Bioinf., 2011, vol. 8, no. 3, p. 176. https://doi.org/10.1515/jib-2011-176
Article Google Scholar
Sinclair, G. and Webber, B., Classification from full text: a comparison of canonical sections of scientific papers, in Proc. Int. Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (NLPBA/BioNLP), Geneva, 2004, COLING, 2004, pp. 66–69. https://aclanthology.org/W04-1212.
Ryabko, B.Y., Gus’kov, A.E., and Selivanova, I.V., Information-theoretic method for classification of texts, Probl. Inf. Transm., 2017, vol. 53, no. 3, pp. 294–304. https://doi.org/10.1134/S0032946017030115
Article MathSciNet MATH Google Scholar
Selivanova, I.V., Ryabko, B.Ya., and Guskov, A.E., Classification by compression: Application of information-theory methods for the identification of themes of scientific texts, Autom. Doc. Math. Linguist., 2017, vol. 51, no. 3, pp. 120–126. https://doi.org/10.3103/S0005105517030116
Article Google Scholar
Cilibrasi, R. and Vitanyi, P.M.B., Clustering by compression, IEEE Trans. Inf. Theory, 2005, vol. 51, no. 4, pp. 1523–1545. https://doi.org/10.1109/TIT.2005.844059
Article MathSciNet MATH Google Scholar
Cilibrasi, R., Vitanyi, P., and de Wolf, R., Algorithmic clustering of music based on string compression, Comput. Music J., 2004, vol. 28, no. 4, pp. 49–67.
Article Google Scholar
Kukushkina, O.V., Polikarpov, A.A., and Khmelev, D.V., Using literal and grammatical statistics for authorship attribution, Probl. Inf. Transm., 2001, vol. 37, no. 2, pp. 172–184. https://doi.org/10.1023/A:1010478226705
Article MathSciNet MATH Google Scholar
Scikit-learn: Machine learning in Python. https://scikit-learn.org/stable/. Cited July 31, 2020.
Journal Geologiya i Geofizika.https://www.sibran.ru/journals/GiG/. Cited July 30, 2020.

Download references

ACKNOWLEDGMENTS

For the expert assessment and comments, the authors are grateful to:

Vyacheslav Nikolaevich Glinskikh, Doctor of Physical and Mathematical Sciences, Corresponding Member of the Russian Academy of Sciences, Head of the Laboratory of Multiscale Geophysics;

Dmitry Vasilyevich Metelkin, Doctor of Geological and Mineralogical Sciences, Associate Professor, Chief Researcher of the Laboratory of Geodynamics and Paleomagnetism, Chief Researcher of the Laboratory of Geodynamics and Paleomagnetism of the Central and Eastern Arctic, Faculty of Geoelogy and Geophysics, Novosibirsk State University;

Nikolay Valerianovich Sennikov, Doctor of Geological and Mineralogical Sciences, Professor, Head of the Laboratory of Paleontology and Stratigraphy of the Paleozoic, Head of the Department of Historical Geology and Paleontology, Faculty of Geoelogy and Geophysics, Novosibirsk State University;

Tatyana Mikhailovna Parfenova, Candidate of Geological and Mineralogical Sciences, Deputy Director for Research, Senior Researcher, Laboratory of Oil and Gas Geochemistry;

Irina Viktorovna Filimonova, Doctor of Economics, Professor and Head, Center for the Economics of Subsoil Use of Oil and Gas, Trofimuk Institute of Petroleum-Gas Geology and Geophysics, Siberian Branch, Russian Academy of Sciences, Head, Department of Political Economy, Faculty of Economics, Novosibirsk State University.

Author information

Authors and Affiliations

State Public Scientific Technological Library, Siberian Branch, Russian Academy of Sciences, Novosibirsk, Russia
I. V. Selivanova, D. V. Kosyakov & A. E. Guskov
Novosibirsk State University, Novosibirsk, Russia
D. A. Dubovitskii

Authors

I. V. Selivanova
View author publications
You can also search for this author in PubMed Google Scholar
D. V. Kosyakov
View author publications
You can also search for this author in PubMed Google Scholar
D. A. Dubovitskii
View author publications
You can also search for this author in PubMed Google Scholar
A. E. Guskov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to I. V. Selivanova, D. V. Kosyakov, D. A. Dubovitskii or A. E. Guskov.

About this article

Cite this article

Selivanova, I.V., Kosyakov, D.V., Dubovitskii, D.A. et al. Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles. Autom. Doc. Math. Linguist. 55, 178–189 (2021). https://doi.org/10.3103/S0005105521040075

Download citation

Received: 11 May 2021
Published: 02 November 2021
Issue Date: July 2021
DOI: https://doi.org/10.3103/S0005105521040075

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles

Abstract—

Access this article

Similar content being viewed by others

Classification of Scientific Texts Based on the Compression of Annotations to Publications

Classification by compression: Application of information-theory methods for the identification of themes of scientific texts

Text Mining with the Stanford CoreNLP

REFERENCES

ACKNOWLEDGMENTS

Author information

Authors and Affiliations

Corresponding authors

About this article

Cite this article

Keywords:

Navigation

Expert, Journal, and Automatic Classification of Full Texts and Annotations of Scientific Articles

Abstract—

Access this article

Similar content being viewed by others

Classification of Scientific Texts Based on the Compression of Annotations to Publications

Classification by compression: Application of information-theory methods for the identification of themes of scientific texts

Text Mining with the Stanford CoreNLP

REFERENCES

ACKNOWLEDGMENTS

Author information

Authors and Affiliations

Corresponding authors

About this article

Cite this article

Share this article

Keywords:

Search

Navigation