Introduction

Citations should be classified according to their use within the text, not only based on the bibliography, as is currently mostly the case (Moravcsik & Murugesan, 1975; Swales, 2004). Citation analysis has been widely used for evaluating research performance (Aksnes et al., 2019; Lukman et al., 2018), rankings (Aksnes et al., 2012; Massucci & Docampo, 2019), studies on scientific developments (Murillo et al., 2021; Pallottino et al., 2018), and plagiarism detection (Gipp et al., 2013), among others. However, these analyses are still primarily based on references in the bibliography. This method has been criticized as being biased, subjective, inconsistent and non-standardized, widely misused, and invalid (Anninos, 2014; Belter, 2015; Molas-Gallart & Ràfols, 2018; Wallin, 2005). Understanding citations’ textual contexts helps improve the accuracy of analyses.

Citations in the text can be examined via their intensity (frequency of citation), location (in the introduction, method, or result section), and textual context (Boyack et al., 2018; Lu et al., 2017; Nazir et al., 2020; Yaniasih & Budi, 2021a; Zhao & Strotmann, 2020). The context can reveal the author’s intent when citing an article in their writing is often referred to as the function or purpose of the citation. There are many citation function categories, such as “introducing,” “relating to,” “using,” and “comparing with” other literature (Lin, 2018; Teufel et al., 2006). In addition, the author’s opinion on the article can be addressed through sentiment, i.e., “positive,” “negative,” or “neutral” polarity (Ikram & Afzal, 2019; Yousif et al., 2019a, 2019b). Furthermore, the role of the cited article can be identified, be it “data,” “method,” or “supplemental” (Zhao et al., 2019).

Figure 1 presents an example of a citation in the text and its meanings. The sentence in the figure reads, “Bertin et al. analyses 45.000 articles from PLOS journals. Their research found that the citation distribution in the text varies by journal series…”. This sentence does not indicate the author’s polarity, so the sentiment is recorded as “neutral”. The sentence provides information about the research’s finding, so the role is “result.” Based on the sentence, it can also be seen that the author’s purpose in citing these articles is to “relating” with the cited article.

Fig. 1
figure 1

Examples of citation context, in-text citation, and citation in a bibliography

In recent years, the citation context has been evaluated utilizing various data, methods, and discussions. Most of the evaluated data were articles from journals written in English and published in developed countries. Most of the topics in the journal are related to computer science (computational linguistics, bioinformatics, neural information) (Bakhti et al., 2018b; Cohan et al., 2019; Ikram & Afzal, 2019; Mercier et al., 2018; Rachman et al., 2019; Su et al., 2019; Tuarob et al., 2019; Wang et al., 2019; Yousif et al., 2019a, 2019b; Zhao et al., 2019), health sciences and medicine (Kilicoglu et al., 2019; Xu et al., 2015; Yan et al., 2019), library and information science (Aljuaid et al., 2020; Taskin & Al, 2017), and some natural science topics. Few studies utilize multiple domains since the majority employ a single domain. Regarding approach, citation contexts have been analyzed using manual and rule-based methods (Dehdarirad & Yaghtin, 2022), traditional machine learning (Aljuaid et al., 2020; Amjad & Ihsan, 2020), and deep learning (Muppidi et al., 2020; Zhang et al., 2022). Analyses simultaneously of two meanings have also been performed, such as sentiment and function (Huang et al., 2021; Jha et al., 2017; Jia, 2018; Yousif et al., 2019a, 2019b), as well as functions and role (Zhao et al., 2019). To fully comprehend the relationship and significance of a citation, it is necessary to recognize its three meanings together (Moravcsik & Murugesan, 1975). However, no single approach and discussion of three citation meanings have yet been discovered.

This paper aims to analyze three citation contexts, i.e., sentiment, role, and function. The goal is to address the following research problems: (1) the pattern of three citation meanings in different scientific domains has not been extensively studied, and (2) there is currently no automatic model that can examine three citation meanings concurrently. The analysis is carried out in five fields of science: food, energy, health, computer, and social sciences. These five fields represent significant, yet substantially different, fields of science. From a technological perspective, this study proposes to perform simultaneous, automatic classification using a deep learning multi-output model and compare it to the existing state-of-the-art model (single-output approach). The multi-output model can provide more efficient and accurate classifications than the separate classification models. The novelty and contribution summary of this paper is presented in Table 1.

Table 1 The novelty and contribution summary

Literature review

Citation context

Citation analysis has been widely discussed and implemented in library and information science, computer science, and quantitative science studies. This analysis examines the number, pattern, and network of citations in published documents. Citation analysis arose from the assumption that citations can provide information about the relationship between articles, the history of the idea development, and the discovery of specific research topics (De Bellis, 2010). The typical citation analysis so far is calculating citation numbers in the bibliography. This traditional method was considered less valid because it only measures the quantity, not the quality of citations (Shahid et al., 2015).

The in-text citation analysis has become a recommendation to improve the citation analysis method. There are three variables of in-text citation: intensity, location, and sentence context. The earliest reference of in-text citation research found that perfunctory citations were in the introduction section. Meanwhile, the essential citations were in the methodology, results, and discussion sections (Maricic et al., 1998). Another finding showed that citations in the methodology section were more relevant than those in the literature review section (Athar & Teufel, 2012).

The citation context variable analyzes the language meaning of the sentence containing citations. Moravcsik and Murugesan (1975) initially described the citation context analysis scheme. Based on the connection and the significance of a citation, they questioned what a citation meant. The relationship’s meaning can be determined by (1) whether what is cited is conceptual or operational, and (2) whether it is a research base or an alternative (evolutionary or juxtapositional). Furthermore, (3) whether a citation is necessary or only for recognition (organic or perfunctory), and (4) whether it is accepted or rejected (confirmative or negational), determine the citation’s value. This idea has become the main reference point for almost all literature on citation context classification. Point four evolves into citation sentiment analysis. Points two and three lead to an examination of the citation function. Sentiment analysis and citation functions are frequently investigated, discussed, and developed. Point one has become an analysis of citation role, but it hasn’t been examined as much as citation sentiment and function analysis.

Citation sentiment

Sentiment analysis identifies and classifies opinions in text or image documents. This subject was placed in the early 2000s and experienced substantial growth after 2009 (Piryani et al., 2017). Product review sentiment, social media dialogues, news, and blogs are the most frequently evaluated areas. According to Yousif et al., (2019b) citation sentiment analysis on scientific articles was detected for the first time in 2011.

Citation sentiment analysis has emerged and is expanding. There are at least two key reasons why citation sentiment analysis is essential. The first is to improve bibliometric metrics by accounting for quality rather than quantity, minimizing citation bias, and offering authorship support based on scientific evidence. The second goal is to detect non-reproducible research, particularly in the biomedical field, where unfavorable attitudes might be an early indicator of research that is not reproducible, thereby saving research time and resources (Xu et al., 2015). However, Catalini et al. (2015) identified that even negative citations have a specific role in the scientific community. In some cases, negative citations can assist refine original discoveries and contribute to the overall development of a field.

Since its inception, manual and automatic classification using traditional machine learning has been done. Recent research was conducted by (Dehdarirad & Yaghtin, 2022), who classified citation sentiment manually in life science and biomedicine citations. Sentiment results were compared statistically between males and females, showing a scientific communication pattern. Several studies have demonstrated that the support vector machine (SVM) model outperforms other machine learning methods to classify citation sentiment. Xu et al. (2015) classified 4182 sentences in clinical trial papers using SVM and obtained an F1 value of 0.71. Mercier et al. (2018) also got an F1 value of 0.71 using a combined multi-classifier between SVM and a perceptron on 2100 computer data sentences. SVM is also used by (Aljuaid et al., 2020) to classify 8736 sentences in the field of information science and got the highest F1 of 0.83. Another machine learning model used to classify a massive number of citation sentiments (762,355 datasets) is Naive Bayes. Still, the performance evaluation of the model used is not shown (Catalini et al., 2015).

The preprocessing method and manual feature selection substantially influence the results of classical machine learning models. Furthermore, citation sentiment analysis is challenging because the data is highly uneven, with the number of negative citations being far lower than in the other two classes (Ravi et al., 2018). This limitation promotes the use of deep learning approaches to solve current issues.

The deep learning model that is most frequently used for categorizing citation sentiment is convolutional neural networks (CNN). Kilicoglu et al. (2019) examined SVM, CNN, and BiLSTM rule-based models. The CNN model produced the most excellent results in the health area, with an F1 value of 0.72 on the 4182 datasets. Yousif et al., (2019a) acquired an F1 value of 0.88 on 5568 datasets in computer science utilizing a mixture of CNN and BiLSTM. Muppidi et al. (2020) used a combination of CNN, LSTM, and word2vec to perform sentiment classification on 7640 sentence data and obtained an F1 value of 0.85. Wang et al. (2019) achieved the best result of 0.93 utilizing CRF and CNN on 3500 computer science datasets. Table 2 shows some existing studies in citation sentiment analysis.

Table 2 Existing citation sentiment literature

Citation function

Citation function analysis is well-studied. Most focus on category schemes and classification models. The function category scheme varies based on data attributes, classification goals, and use. Since the classified data was algorithm sentences, Tuarob et al. (2019) picked the function scheme which consisted of “utilized” and “not utilized.” “Utilized” consisted of “use,” “extend,” and “not utilized” consisted of “mention” and “not algorithm.” Cohan et al. (2019) picked three schema classes (“background/information,” “technique comparisons,” and “outcome comparisons”) because they were necessary for exploring subjects, connected to the scientific article structure, and easy to execute using machine learning. Bakhti et al. (2018a) introduced a citation system with five functions: “useful,” “contrast,” “mathematical,” “accurate,” and “neutral.” This generic categorization approach was relevant to many scientific disciplines and easily recognized by humans. Rachman et al. (2019) altered four citation functions (“problem,” “other,” “use data,” “use model,” “use tool”) to construct a document-summarizing system. Yaniasih and Budi (2021b) used Indonesian journal types to quantify citation value for ranking science using five schemes (“background,” “use,” “extend,” “compare,” and “related”). This study adapts these schemes due to the data’s comparability and the implementation’s objective.

The automatic citation function categorization method extensively uses traditional machine learning and deep learning. However, most have been using a single output model approach, in which a model performs only one classification. While multiple citation meanings use the same data, it is possible to process them simultaneously using a multi-output model. The existing state of the art of citation function classification using both single output and multi-output models is presented in Table 3.

Table 3 Existing citation function literature

Table 3 compares single- and multi-output models. The best result for the single output model employing Naïve Bayes has an F1 score (0.78) (Taskin & Al, 2017), and using SVM has the most outstanding (0.90) (Tuarob et al., 2019). The multi-output model mostly used automatic feature deep learning and performed exceptionally well. Cohan et al. (2019) simultaneously achieved citation function and location classification using the structural scaffold features, Glove, and Elmo in multi-task learning bi-directional long-short term memory (BiLSTM). The result obtained an F1 score of 0.84. Another study was conducted by Su et al. (2019) for citation function and provenance using a convolutional neural network (CNN). Function accuracy was obtained at 0.69 and provenance at 0.79. Yousif et al., (2019a) got the experiment with the highest yield for citation sentiment and purpose classification. The model used combined CNN and BiLSTM, resulting in an F1 value of 0.88 for sentiment and 0.84 for citation purposes. Research by Zhao et al. (2019) used multi-task learning to classify roles and citation functions. The Recurrent Neural Network (RNN) with the BERT pre-trained model produced an F1 value of 78% better than some single-task models.

Citation role

Some studies did not distinguish between citation role and function meanings, combined them, or used them interchangeably as words with the same meaning. Kwan and Chan (2014) stated that the role of citation is identical to its function. Agarwal et al. (2010) designed a class schema of citation meanings and referred to them as role labels. Nevertheless, the label category encompassed a combination of roles (material/method) and functions (contemporary, contrast, evaluation, explanation, modality, and similarity). The phrase citation role by Jurgens et al. (2016) was of a higher level and can be separated into two meanings: centrality and citation function. The centrality of a reference reveals whether it is quoted because it plays a vital role or because the context is broader. This method resembles the citation role scheme by Bedi et al. (2022), which categorizes citations as baseline or non-baseline. When a source is cited, it belongs to the baseline class since it serves as the basis or comparison for the study.

Different from the research above, the term citation role in this article relates to the category of citation context by answering the issue of whether the meaning of the cited article is conceptual or operational (Moravcsik & Murugesan, 1975). Numerous studies have produced several classification schemes based on this principle. Considering the nature of computer journal articles, Guo et al. (2014) modified the idea and then divided the operational class into a method, dataset, and performance evaluation. They employed its scheme to classify 2156 sentences and yielded an F1 score of 0.53 using Random forest. This study advances a scheme from Zhao et al. (2019) that classified citation roles as data, tool, code, algorithm, document, website, paper, license, and media. These nine fine-grained classes are then aggregated into three more general categories: materials (data), techniques (tools, code, algorithms), and supplements (documents, websites, papers, licenses, and media). This category pertains to writing styles, particularly in computer science and engineering, where identifying tasks, techniques, and materials is crucial when attributing sources (Augenstein et al., 2017; Luan et al., 2017). Zhao et al. (2019) proposed a multi-tasking model called SciResCLF and obtained an F1 score of 0.78. Since the data in the study is not limited to the fields of computers or engineering, the citation scheme comprises data (material), method, result, and supplement.

The above review of the previous research identified some shortcomings of citation context analysis. The first drawback is that the data used are still limited in number, the scope of the domain, and language. Most of the research used citation data from computer science journals and a small part from health and medicine. Even though fields of science strongly influence citation characteristics (Levitt & Thelwall, 2008, 2009). If there is only scientific evidence from one or two specific domains, it will give a significant gap in the development of in-text citation analysis. In addition, almost all data sets were in English journals. This study attempts to fill the data gaps mentioned above. The data used in this study are citations in Indonesian journal articles of five science fields, namely food, energy, health, computer, and social. The second shortcoming of in-text citation analysis literature is that most existing research performs a manual or separate automatic classification of citation contexts. Few studies classify two citation meanings simultaneously. To the best of our knowledge, this paper is the first in library and information science and computer science to analyze three citation meanings: sentiment, role, and function at once.

Methods

The study consisted of four phases: data collection, manual classification, selection of automatic classification models, and model performance evaluation. Following is a detailed description of the process at each stage. Figure 2 illustrates all phases of the process.

Fig. 2
figure 2

The sequence of research phases

Collection of datasets

The data analyzed were sentences containing citations in Indonesian scientific journal articles published in 2019. The journals came from five disciplines: food, energy, health, social, and computer science. They were processed using the Grobit parsing tool (Lopez, 2009), which converts PDF documents into lists of sentences ready to be classified. A total of 852 articles were processed, consisting of 9173 sentences. The statistics for the dataset are presented in Table 4.

Table 4 Statistics for the dataset

The number of journals and articles analyzed was limited because the data set only included journals in the SINTA 1 and 2 categories. SINTA is a journal indexer that evaluates the quality of Indonesian journals (https://sinta.kemdikbud.go.id/). Despite SINTA’s selection, there were still journals whose writing structure and format did not meet scientific writing standards and thus could not be processed further.

Classifying citation contexts employs smaller data sets than studying citation frequency and location. Citation context datasets require significant preprocessing, such as manual annotation, which may reduce the number of datasets due to label inequality or limited processing resources. The majority of previous studies examined less than 10,000 sentences. Raza et al. (2020) classified 5161 and 4989 sentences; Ikram and Afzal (2019) classified 8736 and 4182; Kilicoglu et al. (2019) classified 4182. Only Yan et al. (2019) used over 12,000 sentences. Perier-Camby et al. (2019) employed 3000 phrases for function classification. As for the classification of two citation meanings, Zhao et al. (2019) used 2814 phrases for roles and functions, while Yousif et al., (2019a) utilized 3568 and 1768 for sentiment and function, Su et al. (2019) classified 1432 and 1492 for source and function, and Cohan et al. (2019) used 1941 and 11,020 for location and function. Previous tables (1 and 2) present variations in the amount of data in the citation context analysis. Based on this circumstance, although the number of data sets in this study is limited, it is comparable and very substantial compared to most previous studies.

Manual classification

Big data is often involved in citation analyses, meaning that manually classifying this number of citations is impossible. Hence, they must be classified automatically via computer. Small data sets with human annotations are needed as data training for computer algorithms to do automatic categorization.

In this stage, the collected dataset was first classified manually, i.e., class labels were assigned. Three people with similar educational backgrounds carried out the manual classification according to the scientific field. The labels for the sentiment were “positive” when the citation confirms the cited article, “negative” when the citation criticizes or rejects the cited article, and “neutral” when no polarity arises (Yousif et al., 2019b). The role labels consisted of “data,” “method,” “result,” and “supplemental”. The function labels included “introducing,” “relating,” “utilizing,” “explaining,” and “comparing.” The function scheme improved on the previous research scheme (Yaniasih & Budi, 2021b), resulting in more balanced data.

The degree of agreement between the three annotators was measured using Fleiss Kappa values. The value for the sentiment was 0.69, the value for the role was 0.78, and the value for the function was 0.61, indicating substantial agreement between annotators (Landis & Koch, 1977). The data used were approved by at least two annotators and amounted to 8566 sentences. The result from manual classification, called actual class label, is presented in Fig. 3. An algorithm then learned the labeled data until it could correctly classify and predict the instances. The hope is that large amounts of actual data can be classified accurately.

Fig. 3
figure 3

Number per class label of citation context data set

Selection of automatic classification model

The model proposed for automatic classification involved a convolutional neural network (CNN). CNNs have been used extensively and successfully for processing images, text, and speech (Alom et al., 2019; Khamparia & Singh, 2019; Shrestha & Mahmood, 2019). Several studies have also shown that CNNs can classify citation meanings well (Bakhti et al. 2018b; Kilicoglu et al., 2019).

CNN consists of two main parts, namely feature extraction and classification. The feature extraction section consists of convolution and pooling (sub-sampling) layers. The convolution layer extracts data from a specific input part (in this study, the input is sentences). Each section’s information is then mapped as features. Features are transmitted and passed on to the subsequent convolution layers, and a subsampling layer is utilized to obtain a more accurate representation of the features. The feature extraction layer’s output becomes the classification layer’s input. The classification layer is a fully connected network that uses multiple parameters to determine the score for each class. The network is trained to utilize gradient descent and backpropagation. The calculation uses a soft-max layer in which the class is determined by the highest score from each input (Alom et al., 2019; Khamparia & Singh, 2019; Shrestha & Mahmood, 2019). The fundamental structure of CNN is depicted in Fig. 4.

Fig. 4
figure 4

CNN basic architecture for sentence classification

The model selection stage consisted of compiling the basic model, optimizing the hyperparameters, and evaluating the optimized model. The basic model consisted of the input, embedding, CNN, max-pooling, flattening, dense, dropout, and output layers. The input was a citation context sentence with three labels: sentiment, role, and function. In the embedding process, each word in the sentence was then represented as a numeric vector. Before classification, the word embedding was convoluted, and its dimensions were reduced.

The basic model had several hyperparameters that needed to be optimized to increase the model performance (Wu et al., 2019; Yang & Shami, 2020). The optimized hyperparameters included embedding, filter, kernel, dense unit, dropout rate, learning rate, and batch size. Optuna software was utilized for the optimization process because it can be used for both single-output and multi-output models, produces good performance outcomes, and provides various supporting features (Akiba et al., 2019). The optimal model was determined by the value for each hyperparameter that yields the lowest validation loss value.

After optimization was carried out on the basic model, the single and multi-output CNN models were obtained. The hyperparameter values and the best optimization results for the single- and multi-output models are presented in Table 5, and the architectures of these models are shown in Fig. 5.

Table 5 Choices and best hyperparameters
Fig. 5
figure 5

Hyperparameter optimized models

Model performance evaluation

The optimized model was then evaluated for its performance and compared with several methods used in previous studies to classify citation sentiment, role, and function. The baseline models used for comparison were Nave Bayes (NB), Random Forrest (RF), Support Vector Machine (SVM), Long-short Term Memory (LSTM), and its Bidirectional model (Bi-LSTM). The training and machine validation process employed cross-validation. Classification ability was measured using the following metrics: accuracy, precision, recall, and macro F1 score (Lever et al., 2016). The mean macro takes the average across all classes regardless of class weight. In unbalanced data, the macro average will show whether the model can detect minority classes well or not. The metric formula is depicted in Fig. 6 and Eqs. 1–4.

Fig. 6
figure 6

Formula for calculating model performance metrics

Results and discussion

Citation meanings in five fields of science

The citation sentiment patterns from the manual classification are almost identical across disciplines. On average, “neutral” has the highest percentage (92.60%), followed by “positive” (4.93%), then “negative” (2.47%). The “neutral” category accounts for 87–94% of citations across all disciplines, while “positive” ranks second with 4–9% of citations, except in computer science, where it ranks third with 1.21%. “Negative” classes contain few citations, with around 1% in food science, 2% in energy science, and 3% in both the health and social sciences. The citations per class are presented in Table 6.

Table 6 Citations per class

These results of the five disciplines are generally the same as in previous studies, where most polarity classifications are “neutral”, and the number of “negative” citations is always the smallest (Raza et al., 2020). A percentage of “negative” citations below 10% was also found by Xu et al. (2015) in a clinical trial paper, Jia (2018) using biomedical data, Catalini et al. (2015) specific in an immunology journal, and Huang et al. (2021) using a biological dataset. However, the number of “negative” citations in these Indonesian journals is lower than that found in the computer science field by Jha et al. (2017), where the percentage of “negative” sentiments reached 12%. In addition, a study by Yan et al. (2020) found that 15% of the citations were negative in the biomedical field.

The percentage of “negative” sentiments in scientific articles is low, presumably because researchers do not want to show their polarity to avoid confrontations with peers directly. Linguistically, sentences in scientific papers are official, so it is not easy to find sentences with polarity, unlike in product reviews and social media, where the language is more relaxed and expresses the authors’ feelings (Hernandez-Alvarez et al., 2017; Jia, 2018).

Role assignments across the five disciplines are similar: the class with the highest number of citations is “supplemental”, whereas the lowest number of citations is “data.” The “supplemental” class contains 63–73% of the citations, while the “data” class contains a mere 1% across all five disciplines. Somewhat balanced percentages are seen for the “result” and “method” classes. The food, computer, and energy sciences have more “method” citations at 17.55%, 25.47%, and 25.86%, respectively, whereas the health and social sciences have more “result” citations at 24.36% and 15.66%, respectively. These findings differ from the research conducted by Zhao et al. (2019), with citations in the computer and health sciences in the “data” class at 31%, while those in the “supplement” class were at 30%.

A role can be related to citation location. For example, citations in the methods section usually cite “method” or “data,” but citations in the results chapter cite the “results” of articles for comparison. Previous research that analyzed food journals showed that the percentage of citations in the methods section was around 9–16%, whereas the results and discussion section contained 43–54% of the citations (Yaniasih & Budi, 2021a). Manual classification of this research assigned approximately 17% of the citations to the “method” class and 15% to the “result” class. The “method” percentage is not much different from that in the previous research, but the “result” percentage is much lower. This difference is probably due to the authors using supplements to explain material in the results and discussion section. The low number of citations in the “data” class shows that citing data is still rarely done in various fields of science, as also reported by Liu (2015). However, with the increasing amount of data available in the digital era, citing data has become very important (Silvello, 2018).

The pattern for citation function is the same for the highest and second-highest percentages in the health, social, computer, and energy disciplines. The highest percentages for “introducing” in these three disciplines are recorded at 45.50%, 41.09%, 46.41%, and 46.35%, respectively. The second-highest percentages are assigned to “relating,” with percentages in the range of 27–40%. The third-largest health and social sciences class is “explaining” at 12–14%. As for energy and computer science, the “utilizing” class holds the third position at 6–10%. “Utilizing” occupies the fourth position in the social sciences (9.00%) and the fifth (lowest) position in the health sciences (4.28%). The “comparing” class occupies the lowest position in the social, energy, computer, and food sciences. In food science, the order is “relating” (40.81%), “introducing” (32.59%), “utilizing” (10.68%), and “explaining” (9.52%).

The citation function pattern in the five disciplines shows that citations function more as an “introducing” and “relating” with the cited literature. Based on the typology of citation quality (Moravcsik & Murugesan, 1975), the dominant number of “introducing” reveals that many citations are perfunctory and not entirely needed in the research process of citing articles. This pattern is also found in several studies where the number of perfunctory citations was quite large (Jurgens et al., 2018; Shu et al., 2019). The function of “relating” is higher than “using,” indicating that the articles’ connection is more conceptual or theoretical than operational. This finding is reinforced by the number of “comparing”, which is lower than “explaining”. “Comparing” functioned to compare the value of the results, while “explaining” worked to discuss the results using concepts or theories. Another research included functions other than “introducing” into essential citations (Lin, 2018). Consequently, the total percentage of essential citations was higher than perfunctory citations. However, functions other than “introducing” should not be given equal value because they have different levels of importance.

Models performance evaluations

Previously, traditional machine learning and single output deep learning models were frequently used in citation context research. On the other hand, this paper proposes a novel multi-output approach for classifying three citation meanings. The findings of this study indicate that the multi-output model employing CNN architecture performs better than the classic models. Table 7 compares the proposed model’s performance to the existing state-of-the-art models as the baseline comparison.

Table 7 Performance comparation between classic models and proposed model

All models achieved between 0.90 and 0.97 accuracies when classifying sentiment data. The accuracy value describes the classification accuracy of the model. Precision and recall are essential since sentiment data is imbalanced between classes. Precision describes the accuracy between requested and projected results. Recall value is the system’s retrieval success rate. Precision and recall for the NB, LSTM (single- and multi-output), and BiLSTM (single-output) models were poor (< 0.60), indicating the model might not classify reliably. The LR, SVM, and multi-output Bi-LSTM models had good precision but low recall (< 0.60), meaning they might classify well but did not locate much accurate information. Single- and multi-output CNNs were accurate and reliable. These two models got 0.85 and 0.80 F1 values. Single-output models obtained higher recall and F1 values. However, the multi-output model was more precise.

Single- and multi-output CNN models had higher accuracy, precision, recall, and F1 for role classification. The F1 values for the CNN multi-output model and CNN single-output model were 0.84 and 0.81, respectively. Unlike the sentiment classification, all models’ evaluation measure values were fairly good (> 0.60). The classic machine learning models performed well, particularly LR, SVM, and RF, achieving accuracy, precision, recall, and F1 values of 0.83. With the maximum accuracy and precision values, the single-output Bi-LSTM model also worked well. However, because its recall value was low, the F1 value was lower than that of the CNN models.

Deep learning models did better than classical machine learning at function categorization. All traditional machine learning models scored F1 below 0.60. All deep learning models had an F1 value over 0.80. The proposed model, multi-output CNN, got the highest F1 score of 0.88.

The multi-output CNN model is superior for role and function classification, while the single-output CNN model best does sentiment classification. Multi-output models are increasingly being used because there are many instances in which a single input is to complete several tasks simultaneously (Xu et al., 2020). One of the goals of any multi-output model is efficiency. The experimental results show that the multi-output model is more efficient in terms of training time, taking about 10% of the single output model’s time for completion.

A more in-depth analysis was conducted on the multi-output model, which had the best performance. The investigation centered on classification performance per class. Because the categories were unbalanced, attention was given to the model’s ability to classify minority classes. For example, “positive” citations should receive a higher weight than “neutral” citations, whereas “negative” citations should be given lower weight than “neutral” citations in citation analyses (Abu-jbara et al., 2013; Kazi & Patwardhan, 2016). For role, “data” and “method” citations should receive greater weight than “supplemental” citations. The multi-output model successfully classified all classes well, including the minority classes (> 50%). Categories with large amounts of data, such as the “neutral” and "supplemental" categories, and all categories in the function classification, obtained F1 scores above 80%. The smallest class got the recall value, the lowest F1 score was negative, and the data both got an F1 score of 0.66. Figure 7 shows the evaluation metric values for the classes.

Fig. 7
figure 7

The evaluation metrics for the classes

In this study, three citation meanings were analyzed using manual classification, followed by constructing a multi-output model for automatic classification. The findings included details on citation patterns in five academic fields and successfully proposed a deep learning model that performs better than the classic model. However, there are still some shortcomings with this study. First off, the scope and quantity of data are restricted to Indonesian journals. This coverage has disadvantages because citation patterns are influenced by culture, scientific fields, and other factors. Still, it also has benefits because, up to now, data from international journals from affluent nations have dominated citation research. This study’s results can enhance non-developed country citation portraits. The second is the class category of role sources referring to Zhao et al. (2019) without doing a preliminary study on its appropriateness with the Indonesian journal writing style. Actually, it’s possible that the proper approach for Indonesian journals differs from the one discussed. The annotators’ agreement was moderate because many sentences don’t fit the existing class structure. Increasing data’s scope and altering the role source category schema can fix the problems. Meanwhile, procedures and models that have been developed can be employed again because the outcomes have been successful.

Conclusion

The analysis of sentences containing citations can identify the author’s purpose in citing these articles, the author’s opinions concerning the cited articles, and the roles of the articles being cited. To date, the analyses of these three meanings of citations have been carried out separately. It is essential to that a simultaneous analysis be carried out to improve the quality and efficiency of the citation analysis method.

The manual classification of the sentiment, role, and function of citations provided information on the meanings of the citations in several fields of science. Citation sentiment had the same pattern in the five disciplines analyzed: most of the citations were “neutral,” only a few were “positive,” and very few were “negative.” Role classification followed the same pattern, where most of the citations were “supplemental,” and very few were for “data.” Citation function varied between disciplines, but it can be concluded that most fall under “introducing” and “relating,” while few fall under “utilizing” and “comparing.” The analysis above reveals that it is still rare for authors to show polarity in citing articles, data citation is rare, and authors use citations for introducing and relating more than for comparing and utilizing.

Automatic classification of three meanings can be done using traditional machine learning, single-output and multi-output deep learning models. The evaluation results show that the multi-output model utilizing CNN architecture outperforms the classic models for role and function classification but turns in slightly lower performance for sentiment classification. The capability of the multi-output CNN model is also quite good for minority classes, so it can be concluded that the model has good performance.