1 Introduction

The ever-increasing size of digital libraries confronts researchers with the problem of information overload (Mahdi et al. 2020). Finding the most relevant research articles is still challenging for the researchers, especially when exploring any new research domain. To address the problem, the present work proposes a citation recommendation system to help researchers find relevant articles from the huge and complex landscape. Recommender systems are based on three main models, namely: content-based filtering (CBF), collaborative filtering (CF) and graph-based (GB) have emerged as a solution for finding similar articles (Ma et al. 2020). Collaborative filtering-based approaches employ information from user profiles such as past interactions, feedback or ratings and friends network to make recommendations about papers (Martins et al. 2020). CF suppose that users with a common interest will like similar items. Several variants of the CF-based approaches were proposed to improve the accuracy of the provided recommendations (Wang et al. 2020b). The quality of CF-based approaches is highly dependent on available user rating or feedback information. Unavailability or partial availability of such information leads to a sparsity problem due to many missing values in the user-paper matrix (Ali et al. 2020a). Moreover, collaborative filtering hardly addresses the recommendations for a new research problem. Similarly, graph-based models (GB) represents articles/citation information as nodes and edges connected to form a network (Wang et al. 2020a). Article recommendations are made through graph traversal or the link prediction method. Graph-based approaches inhabit the problem of over-weighting where old and outdated articles remain the same in the network. However, the new articles are not linked to the network because they do not have a direct link with the existing nodes. In contrast, CBF approaches exploit the content of the research article to produce recommendations (Habib and Afzal 2019). CBF only performs well when user preferences and article descriptions are provided; otherwise, such techniques are prone to well-known cold-start problem (Ali et al. 2020a; Martins et al. 2020). Cold-start problem concerns that the system cannot draw any conclusion for articles about which it has not yet gathered enough information.

The present work addresses the cold-start problem by computing content-based similarity among articles, even if user preferences and article descriptions are not provided. The traditional coarse-grained similarity computation does not consider multi-facets or more concisely semantic facets that reflect actual similarity among documents. The current proposal follows the argument made by Bär et al. (2011) that items are similar if a given facet of similarity relates. In the case of research articles, the similarity would be the multi-facets of research, e.g., goal, methodology, findings, results and conclusion. Linguistically, these facets are called rhetoric zones. The present work provides tailored recommendations according to the rhetoric zones. For example, a recommendation shall be made with a similar problem statement but different methods or similar methodology but different findings.

The proposal presented here utilizes the deep learning (a part of machine learning) method to classify rhetoric zones of the research articles and compute zone-wise similarities to get a ranked list of relevant citations (shown in Fig. 1). Formally, a query article (\({D}_{q}\)) and set of articles (\(D\)) containing both relevant and irrelevant documents with respect to \({D}_{q}\) are provided. All these articles containing a set of sentences (\({S}_{1}\),\({S}_{2},..\)) are transformed to their rhetoric zones representation (\({D}_{i}^{RZ} \left| i=1\; to\;n \; \&\;{D}_{q}^{RZ}\right)\). This transformation is carried out by the proposed deep learning model. Finally, the goal is to retrieve a ranked list of articles based on similarity scores between the query and candidate articles.

Fig. 1
figure 1

Overview of the rhetoric zone classification and similarity

Given \(D=\{{D}_{1},{D}_{2},{D}_{3},\dots ,{D}_{n}\}\) & \({D}_{q}\)

$${D}_{1}^{RZ}=\{{\left({S}_{1},{S}_{2},\dots ,{S}_{w}\right)}^{BACK},{\left({S}_{1},{S}_{2},\dots ,{S}_{x}\right)}^{MOTIV},{\left({S}_{1},{S}_{2},\dots ,{S}_{y}\right)}^{PROB},{\left({S}_{1},{S}_{2},\dots ,{S}_{z}\right)}^{GOAL},\dots \}$$
$${D}_{2}^{RZ}=\{{\left({S}_{1},{S}_{2},\dots ,{S}_{w}\right)}^{BACK},{\left({S}_{1},{S}_{2},\dots ,{S}_{x}\right)}^{MOTIV},{\left({S}_{1},{S}_{2},\dots ,{S}_{y}\right)}^{PROB},{\left({S}_{1},{S}_{2},\dots ,{S}_{z}\right)}^{GOAL},\dots \}$$

$${D}_{n}^{RZ}=\{{\left({S}_{1},{S}_{2},\dots ,{S}_{w}\right)}^{BACK},{\left({S}_{1},{S}_{2},\dots ,{S}_{x}\right)}^{MOTIV},{\left({S}_{1},{S}_{2},\dots ,{S}_{y}\right)}^{PROB},{\left({S}_{1},{S}_{2},\dots ,{S}_{z}\right)}^{GOAL},\dots \}$$
$${D}_{q}^{RZ}=\{{\left({S}_{1},{S}_{2},..,{S}_{w`}\right)}^{BACK},{\left({S}_{1},{S}_{2},..,{S}_{x`}\right)}^{MOTIV},{\left({S}_{1},{S}_{2},..,{S}_{y`}\right)}^{PROB},{\left({S}_{1},{S}_{2},..,{S}_{z`}\right)}^{GOAL},\dots \}$$
$$List of Articles \leftarrow Rank( Similarity\left({D}_{q}^{RZ},\left({D}_{1}^{RZ},{D}_{2}^{RZ},\dots ,{D}_{n}^{RZ}\right)\right) )$$

Recently, deep learning methods for research papers recommendations have shown significant improvements due to their ability to capture the contextual information and semantic representations of the facets of the research articles (Bai et al. 2019; Bansal et al. 2016; Zeng and Acuna 2020). However, very few researchers using deep learning have addressed the problem of cold-start and nearly all the research conducted has only focused on personalized recommendations (Ali et al. 2020a). Recommendations are made using the user’s profile information and history in personalized recommendation models. In contrast, non-personalized recommendation model generates uniform recommendations for all users containing relevant and top-rated articles. The proposed technique provides a solution for the cold-start problem and generates non-personalized recommendations by computing rhetoric zones similarity for both article-to-article and article-to-user query. A research article contains several rhetoric zones with specific characteristics (Asadi et al. 2019; Badie et al. 2018). These rhetoric zones can be classified as background, motivation, goal, problem, hypothesis, method, model, experiment, results/findings and conclusion (Liakata and Soldatova 2009; Liakata et al. 2009). The proposed rhetoric zone classification method retrieves small chunks of text from the published articles against the above rhetorical zones and computes zone-wise similarity. In addition to rhetoric zones similarity, the traditional metadata comparison is in-cooperated to evaluate its effectiveness. The deep learning model is trained and tested on CORE (Knoth et al. 2017) and ART (Liakata and Soldatova 2009) datasets. The tenfold crossover validation of the trained model has resulted in an accuracy of 76.3%. The recommendations made by the proposed technique are evaluated against 2543 articles using average precision and normalized discounted cumulative gain (nDCG) measures involving ten domain experts.

The present work makes a noteworthy contribution to the cold-start problem in citation recommendation through rhetoric zones classification using a deep learning model and computing similarity among rhetoric zones. Moreover, the present work performs an extensive evaluation for measuring the effectiveness of the proposed model using a combination of two real-world datasets. The paper is organized as follows: Sect. 2 contains a literature review of specifically, deep learning-based research article recommendation techniques. In Sect. 3, the proposed system’s methodological details are provided, explaining the process of rhetoric zones classification and similarity computation. Section 4 presents the experimental results. Finally, Sect. 5 concludes the present work with future directions.

2 Related work

The traditional citation recommendation approaches are based on co-citations, bibliographic coupling, metadata analysis, content-based filtering, collaborative filtering and graph-based filtering (Habib and Afzal 2019). Co-citations or direct citation or the bibliographic coupling recommend research articles based on citation analysis, mainly the relationship information among articles. Any research article is considered relevant to another article if a citation link is present between them. The citation link can be a direct link from one article to another, or it can be through some intermediate article. The problem with the co-citations and bibliographic coupling is that the recommendation works only on the explicit information provided as citation links. Articles that are relevant but not cited directly or indirectly will not appear in citation-based recommendation techniques. The citation-based techniques’ accuracy was enhanced by combining the content of the cited articles and the citation information. However, content-based filtering (CBF) methods have their own challenges (Ma et al. 2020).

Content-based filtering (CBF) approaches exploits the content of the research article to produce recommendations. CBF approaches utilize the article’s title, keywords, abstract, venue and authors information, and in some cases, the whole article itself. However, the complete article reduces the recommendation accuracy as the article contains a substantial number of wider or general context statements. Suppose the problem addressed by an article is the same as query paper, but the methodology of the article is entirely different from query paper; this results in a very weak similarity between the article and the query paper if the content of the complete article is considered. The reason is that a large portion of the article content is about its methodology with significantly overlapping/similar content with the query paper. On the other hand, keywords were the first choice of CBF approaches where keywords extracted from the articles were matched to compute similarity. Several approaches enhance extracted keywords by augmenting them using dictionaries or ontologies (Chughtai et al. 2020). However, the keyword-based search is a straightforward technique but with several limitations such as keywords does not reflect the whole article and the user needs, ambiguous keywords and vocabulary mismatched.

CBF only perform well when user preferences along with article descriptions are provided; otherwise, such techniques are prone to the well-known cold-start problem. The cold-start problem concerns the issue that the system cannot draw any conclusion for articles about which it has not yet gathered enough information. This problem is highly observed for newly published articles as they are not cited by many papers and their collaborative ratings are also unavailable (Abro et al. 2020; Christoforidis et al. 2018).

Several researchers have recently employed deep learning models for citation recommendations. These deep learning citation recommendation approaches have used paper’s content, profile information, keywords, and venue information to train the deep learning models, which later makes recommendations (Ambalavanan and Devarakonda 2020; Jeong et al. 2020; Kumar et al. 2021). Deep learning-based approaches have shown better results as compared to matrix-based and graph-based citation recommendation techniques. However, very few have addressed the cold-start problem and mainly the global recommendation context. Global and local are two types of context-aware citation recommendations (Jeong et al. 2020; Wang et al. 2020a). Global context-aware citation recommendation techniques consider the title and abstract of the query paper and candidate citation paper to deriving the recommendations. In the local context, the text nearby a citation reference is considered for providing recommendations.

A recommender system named HRM (Li et al. 2019) was proposed, which sends out newsletters containing citation recommendations to the subscribed users. Citation recommendations are generated based on the user’s browsing history (previous search queries and interactions) on the articles search engine. The newsletter items are ranked citation recommendations. This approach faced the cold-start problem for new users who do not have browsing history or just have subscribed. HRM system has made recommendations based on entity (authors, articles, venue) similarity in embeddings space. A usability approach of recording user interaction by monitoring clicks made by a user is used to make recommendations for a new user. HRM combines entity information with user behavior to generate the newsletter items list.

Another approach using deep learning was presented to overcome the cold-start problem in a collaborative filtering scenario (Bansal et al. 2016). This technique has used gated recurrent units (GRU) to train text sequences for collaborative filtering tasks. However, collaborative filtering approaches are prone to sparsity problem where data about user interaction is unavailable. Bansal et al. has combined metadata of the article with collaborative information (graph structure) to generate first recommendations for a user. A graph representation from the author and article profile was constructed in heterogeneous information networks for citation recommendation by Ma and Wang (2019) in a system named HGRec. HGRec initializes the node vector by using word-embeddings of the text extracted from the candidate articles. Later, graph representation is updated by joining it with node embeddings using a meta-path based proximity measure. Like HRM and Bansal et al. (2016), HGRec uses embeddings for similarity computation. A system named HIPRec (Xiao Ma et al. 2019) for citation recommendation claimed that previous techniques had computed similarity from a bipartite network of query and candidate article. However, other networks information such as venue, researchers, topic, research domains have been included in HIPRec to form a meta-graph that increases the accuracy of the recommendations. A greedy approach is employed to extract sub-graphs for final recommendations. HIPRec is implemented using the DBLP dataset that is mainly a citation graph rather than full-text articles.

In conclusion, nearly all the studies addressing the cold-start problem are based on collaborative filtering, especially those using deep learning techniques. Studies using deep learning have utilized embeddings generated from the deep learning models to compute similarities among modelled information, either items, authors, venues or topics. The traditional multi-layered perceptron (MLP), support vector machine (SVM) and logistics models (Asadi et al. 2019) were the main choices of the previous citation recommendation systems. Auxiliary information such as the author’s information, venue, keywords, and social interactions was used in previous citation recommendation systems to overcome the cold-start problem. The same has been reported in recent surveys on deep learning based citation recommendations (Ali et al. 2020a; Martins et al. 2020). In contrast, the present proposal is on content-based filtering and for the reason, only deep learning based hybrid approaches combining collaborative filtering and to some extent the content of the articles to address cold-start problem are presented here. The present work addresses the gap of solving cold-start problem through a content-based filtering approach using deep learning models, which deemed as a novelty to the present research.

3 Rhetorical zone classification and similarity

The architecture of the proposed context-aware citation recommendation system is shown in Fig. 2. The system architecture is comprised of three modules (i) model training, (ii) model testing, and (iii) similarity computation. The training module takes the textual format dataset and generates a trained model for classifying rhetoric zones. The testing module uses the trained model and predicts the class label of new articles. Finally, the similarity phase computes the similarity between rhetorically classified articles and generates a ranked list of articles. Performance evaluation is performed on the accuracy of the trained model and the final ranked list.

Fig. 2
figure 2

Proposed system architecture

3.1 Dataset

The deep learning models are trained on ART (Liakata and Soldatova 2009) and CORE (Knoth et al. 2017) datasets as two well-known corpora. The ART corpus consists of 3433 (Mean 343 Std. Div 163.91) labelled sentences retrieved from 150 research articles of physical chemistry and biochemistry domains. These sentences were taken from the abstract and introduction sections of the articles. Every sentence is manually labelled with a rhetoric zone from a total of ten zones. ART corpus is a small-sized dataset from the deep learning perspective. On the other hand, the process of automated data augmentation to increase the size of the dataset is at an early stage of research for the textual data as compared to computer vision. An attempt has been made with online and offline data augmentations following consistency regularization and with the most recent AugLy from Facebook. However, both approaches result in the loss of rhetoric semantics which is salient to the present research. Moreover, the proposed deep learning model was tested with and without functional regularization such as dropout but did not found a significant difference. For this reason, 24,323 sentences were tokenized and extracted from the introduction section of the top 500 open access research articles from the CORE dataset—computer science domain. These sentences were then provided to 60 postgraduate students through an online system for labelling them against ten rhetoric zones (same as ART corpus). The postgraduate students have performed labelling as an ungraded assignment for the research methodology module. A sentence is provided only once to a group of six students. Students can either assign a class label to a sentence or skip it if they reckon the sentence belong to a general category. The labelled sentences are accepted based on the level of agreement among annotators. A total of 18,413 sentences were labelled, among which 13,730 were selected based on their high Cohen’s Kappa agreement value, i.e., more than 0.8. After combining the ART with our manually labelled dataset and applying dataset balancing resulted in 14,689 sentences as our final dataset. For the present work, a simple undersampling technique named Neighborhood Cleaning Rule (NCL) which is based on Edited Nearest Neighbor (ENN) method is used for balancing the dataset. The choice of undersampling as compared to oversampling is because the variability is not high (Mean 1716.4 Std. Div 172.70) among class instances, as shown in Fig. 3. Moreover, researchers have reported that simple undersampling outperforms state-of-the-art Synthetic Minority Over-sampling TEchnique (SMOTE) in many cases because SMOTE without variable selection biases the classifiers towards minority classes (Blagus and Lusa 2013). After balancing the dataset, the mean value of the class instances is 1469, with a standard deviation of 9.01.

Fig. 3
figure 3

Dataset class distribution

3.2 Word embedding and feature modelling

Textual data need to be translated into a structure called embeddings (Si et al. 2019) that deep learning algorithms can process it easily and efficiently. For deep learning, the text’s vocabulary is a high dimensional vector that can be modelled into low-dimensional, learned continuous vector representation called embeddings. In natural language processing (NLP), word embeddings are used to represent a dense vector of words in low-dimensional space to capture the semantics and syntactic information of the given text (Ali et al. 2020b). Deep learning classifiers can perform mathematical operations on the numerically represented semantics in the word embeddings. Word embeddings support contextual representation, e.g. apple fruit and apple electronics shall be treated differently based on their separate vectors. Some of the well-known models for word embedding are Word2Vec (Mikolov et al. 2013), doc2vec (Han et al. 2018) and most recently BERT (Devlin et al. 2019) by Google. The present work utilizes Word2vec and BERT models for representing the dataset as word embeddings. In addition to word embeddings, the traditional approach of feature extraction using minimum inverse document frequency (min_idf) is performed for comparison purposes. Min_IDF collects the most common and important features from the given text. Furthermore, a feature vector with the size of 2678 was handcrafted by three domain experts using the dataset itself and several other available phrasebooks (Manchester phrasebank, style of writing, etc.) that contains general-purpose rhetoric sentences for technical writing support. This manual feature extraction was initiated due to an initial analysis of the features extracted by the embedding and Min_IDF methods. Both methods have mainly formulated unigram features, whereas it is assumed that bi-gram or tri-gram features containing stop words might reflect the accurate representation of a sentence.

The traditional feature extraction method lacks the representation of the surrounding context of a word as it merges all possible meanings of the word into a single representation. Word2vec addresses this problem by directly modeling the context of the word in a multidimensional vector representation. This vector representation is the initial task for the predictive models in information and semantic retrieval. The continuous Bag of Words (CBOW) component of the Word2vec infers the target word for a given context, and on the other hand, the skip-gram component infers the context for a given word.

BERT embedding is a recent advancement in modelling the contextual representation of a word or phrase (Ambalavanan and Devarakonda 2020; Jeong et al. 2020). The BERT embeddings can model the contextual information and dynamically modify a multilayer representation, unlike the Word2vec embeddings, which construct a separate vector for each word that remains constant throughout the later processing. The process of learning this contextual information for the construction of embeddings is known as pretraining. After pretraining, sentences are formed with vector representations of the words and fed to classifiers for the prediction. BERT models deeper contextual information as compared to its predecessors due to its underlying deep bi-directional transformer technique. Self-attention transformer architecture is employed by the BERT that provides long-distance context comprehension. BERT fine-tune language model is integrated into the downstream task to achieve task-specific architecture.

3.3 Deep learning classifiers

Long-term short memory (LSTM) is a sequence-based classification method that has shown significant improvements over the traditional text classification methods (Jang et al. 2020; Wang et al. 2020a; Zeng and Acuna 2020). Facebook and Microsoft have claimed over 95% accuracy in automatic translation for their billion size datasets using LSTM. LSTM has improved the recurrent neural network (RNN) architecture by overcoming the vanishing gradient problem by inducing the gating mechanism. Different gates such as input, forget and output decides about retaining the data from the previous state or losing it during the current state. LSTM’s ability to extract vital information has shown an important role in text classification. In recent years, the scope of application of LSTMs has rapidly expanded, and several researchers have revamped the LSTM to gain improved accuracy, such as bi-directional long-term short memory (Bi-LSTM).

The Bi-LSTM consists of LSTM units that function in both directions, keeping track of past and future context information. This is done by combining the outputs of two LSTMs layers. The first layer process from backwards to forwards, the other from forwards to backwards. This bi-directional approach captures the dependencies between contexts (Fig. 4). Formally, the rhetorical zone as context \({c}_{i}\) for the present case is a combination of words \({w}_{i}\in {{\varvec{R}}}^{{d}_{w}}\) that represents a specific semantics. These words in the form of embeddings are inputs and are assembled into matrix \({X}_{i}^{c}\in {{\varvec{R}}}^{{d}_{w}\times {N}_{i}}\). The Bi-LSTM applied over matrix \({X}_{i}^{c}\) is provided as following equations:

$${H}_{t}^{(forward)}={LSTM}^{\left(forward\right)}({W}_{t},{H}_{t-1}^{(forward)})$$
$${H}_{t}^{(backward)}={LSTM}^{\left(backward\right)}({W}_{t},{H}_{t-1}^{(backward)})$$
$${H}_{t}={H}_{t}^{(backward)}\oplus {H}_{t}^{(forward)}$$

where \({H}_{t}^{(forward)}\) and \({H}_{t}^{(backward)}\) represent the hidden states of the forward and backward LSTMs at time \(t\). In Bi-LSTM both backward and forward hidden states are concatenated (⊕) together. The LSTM has a gated approach to overcome the short-memory problem is through adding a cell to justify that either retaining information is utile or not. LSTM cell memory is consisting of input, forgot, and an output gate, i.e., mathematically represented as:

$${Input}_{t}=sigmod({W}_{input}{x}_{t}+{W}_{input}{h}_{t-1}+{Bias}_{input})$$
$${Forgot}_{t}=sigmod({W}_{forgot}{x}_{t}+{W}_{forgot}{h}_{t-1}+{Bias}_{forgot})$$
$${Gate}_{t}=tanh({W}_{gate}{x}_{t}+{W}_{gate}{h}_{t-1}+{Bias}_{gate})$$
$${Output}_{t}=sigmod({W}_{output}{x}_{t}+{W}_{output}{h}_{t-1}+{Bias}_{output})$$
$$Forgot_{t } \otimes State_{t - 1} + Input_{t} \otimes Gate_{t - 1}$$
$$Hidden_{t} = Output_{t} { } \otimes {\text{tanh}}\left( {State_{t} } \right).$$

where W is the parameters, xt is the input at time t. The hidden state at time t is computed by the dot product \(\left( \otimes \right)\) of the output gate and tangent activation function over LSTM cell state (\({State}_{t}\)). The input, forgot, output and cell gates control the information that needs to be retained or passed to the next step. Bi-LSTM can be combined with attention technique to make predictions more precise. Although, after bi-directional LSTM the input word embeddings are shrunk enough to make a prediction using a classifier. However, the correlation between an individual rhetoric zone and the research domain of the article is not clearly visible to the classifier. For this reason, the attention layer with meta-data embeddings are concatenated with the features extracted and a one-dimensional convolutional neural network (1D-CNN) is applied for the final classification task. The attention is given as a maximizing function:

Fig. 4
figure 4

The Bi-LSTM architecture for rhetoric zone classification

$$\mathrm{log}P\left(z \right| {X}^{d}, {M}^{d})= \sum_{i}^{k}\mathrm{log}P\left({z}_{i} \right|{ z}_{\le i},s)$$

where \(P\left({z}_{i} \right|{ z}_{\le i},s)=softmax ({Vh}_{i})\)

\(P\left({z}_{i} \right|{ z}_{\le i},s)\) is the conditional probability of all previous words in the rhetoric sentences prior to the i-th word. The \({X}^{d}\) denotes the vector representation of the rhetoric sentences and k is the number of words in a given rhetoric sentence. The attention is ranked based on metadata context vector representation \({M}^{d}\) containing title, keywords, venue and authors information.

In addition to LSTM and Bi-LSTM, the present work has evaluated SciBERT, a large-scale pre-trained model based on BERT. SciBERT follows the same multi-layered bidirectional transformer model as BERT; however, the difference is that SciBERT is pretrained on scientific articles dataset. SciBERT is an uncased BERT model trained on random a sample of over 1.4 million scientific articles from the Semantic Scholar dataset. The pretraining carried out for SciBERT was unsupervised on a multi-domain corpus of scientific articles for improving the performance of NLP tasks such as sentence classification, sequence tagging and dependency parsing. The dataset used by SciBERT closely resembles the present work as it consisted of 18% of research articles from the computer science domain and 82% from the biomedical domain. SciBERT has a 46% vocabulary overlapping with BERT with a total of 3.17 billion tokens.

LSTM and Bi-LSTM models are trained using Adam (Adaptive Moment Estimation) optimization algorithm. Adam optimizer is based on RMSProp (Root Mean Square Propagation) in which learning rate is adapted for each parameter; however, the present work used a fixed learning rate of 0.01. Leaning rate is used to tune parameters in an optimization algorithm to decide step size for reaching a minimum of the loss function. These models were trained with a SoftMax activation function with a batch size of 128 and L2 regularization. The embeddings generated by Word2vec and BERT was lowercased unigrams vocabulary tokens of length 300. The evaluation parameters for the models were Precision, Recall and F1-score. Micro averages for all evaluation parameters were used due to the balanced class distribution.

Table 1 shows the F1-score of the LSTM, Bi-LSTM and BERT models on embeddings and feature sets. These results show model training after ten epochs. The automated feature extraction using minimum inverse document frequency has shown the highest results for the background rhetoric zone. Background class have the largest number of overlapping vocabulary items with other classes as it contains general context sentences. Similarly, Min_IDF models the most common vocabulary in the feature set. BERT has shown a similar result as Min_IDF for the background zone. Manual features (a total of 2768 features) and Word2vec were not as good as others because the background zone has the highest number of features that were not completely modeled in their case.

Table 1 F-measure score of the training models (highest value shown as underlined)

BERT model has shown better performance for most classes, i.e., motivation, goal, hypothesis, experiment, and results (as shown in Fig. 5). However, manual features with Bi-LSTM have demonstrated 90% results for the ‘problem rhetoric zone’. An analysis of BERT embeddings and manual features for problem class reveals that manual features have bi-gram and tri-gram features such as ‘time consuming’, ‘major barrier’, ‘remain unstudied’ and ‘has been neglected’ etc. that made classifier more accurate in predicting the class label. Moreover, varying sizes of BERT embeddings need further evaluation, as for the present work, the embedding size was limited to 300 tokens. Whereas the manual feature vector for the problem class contains 292 features. A comparison of the BERT and Word2vec embeddings clearly shows that the underlying SciBERT has given BERT a clear advantage over the Word2vec. Moreover, a t-test evaluation of F1-scores of the BERT (mean 0.633, SD 0.087) compared to Word2vec-LSTM (mean 0.403, SD 0.204) and Word2vec-BiLSTM (mean 0.394, SD 0.202) shows that BERT has performed better i.e. t(10) = 4.558, p = 0.00 and t(10) = 4.578, p = 0.00 respectively. Based on the evaluation, the null hypothesis that both models perform the same is rejected in this case as there is a significant difference between both models. Evaluating all models using one-way ANOVA shows a statistically significant difference between models i.e. F(6,63) = 3.589, p = 0.004. Therefore, the BERT trained model was selected for classifying the rhetoric zones of research articles during the experiment.

Fig. 5
figure 5

Comparison of different models (F1-score)

All these models and embeddings were implemented on the Google Colab server with data stored on Google Cloud Storage. Tensorflow deep learning library is used to implement LSTM, Bi-LSTM and BERT models. The web interface for user interaction was implemented using the Flask web framework.

3.4 Rhetoric zones similarity

The present work computes similarity among rhetoric zones after classifying them to generate the final ranked list. Traditional measures for computing similarity are Jaccard, dice, hamming distance and cosine similarity. The problem with traditional approaches is that they compute similarity based on the existence criteria of words that lack contextual similarity. Any negation in a sentence is treated closer to the same as positive. However, recently the similarity of the rhetoric sentences can be computed directly using embeddings, but in this case, the runtime is proportional to the scale of the corpus, i.e., if there are one million sentences or articles in the dataset then one million pairs need to be classified by the deep learning model. To overcome this problem, the present work follows an efficient approach by generating fixed-sized embeddings for every instance of the dataset and embeddings of the incoming query. Both embeddings are then classified according to the rhetoric zones, and finally, the similarity between pairs of classified zones will be computed.

The present work has computed embeddings similarity using a recent unsupervised learning technique called Sent2vec (Pagliardini et al. 2018). Sent2vec is a combination of CBOW model of the Word2vec including n-gram tokens and averaging the embeddings for their summarization to form a single vector in the same latent space. Sent2vec works on distributional hypothesis where words appearing nearby are considered to have the same context. Formally, the Sent2vec learn two embeddings, the source (RW) and the target (TW) of h dimension for every word in the vocabulary. Averaging the constituent words of the source word embeddings (RW) forms the sentence embedding. Sent2vec augmented the source word embeddings by including n-grams (where n = 1,…,n) of each sentence. These n-grams are also averaged along with the words. The Sent2vec embeddings are modelled as formula:

$${E}_{S}=\frac{1}{|NG\left(S\right)|}{\sum }_{w\in NG(S)}{R}_{w}$$

The NG(S) is a function that generates the list of n-grams (where n = 1..n) appears in the sentence (S). Later, the SoftMax activation function with negative sampling is applied to predict a missing word. Negative sampling is known to be efficient for predicting a large number of output classes. Sent2vec uses binary logistic loss function combined with negative sampling to predict the output class. Sent2vec has a low computational overhead for inference and training as only |S| is required.

4 Experiments and evaluation results

This section presents both the subjective and objective evaluation of the proposed rhetoric zones classification and similarity technique. The subjective evaluation involves domain experts, whilst the objective evaluation is carried out by comparing the proposed technique with the other content-based filtering approaches for citation recommendations.

4.1 Expert-based subjective evaluation

A set of related articles was manually formulated for the experiment. Each of ten senior faculty members from the computer science and biochemistry departments were requested to provide two research articles related to their research domains. These twenty articles are selected as query papers. Against every query paper, each faculty member has provided ten most relevant papers they have reviewed before and are published between 2017 and 2019. Using the keywords from the query papers and also suggested by the faculty members, the top 20 results of different digital libraries (ScienceDirect, PubMed, Wiley, IEEExplore, CORE and Arxiv) were collected. With this method, 2371 articles were gathered that has some relevance to the query papers. Later, the two hundred articles provided by faculty members were added to the experimental set. A very few duplicate articles were removed to form the final set of 2543 articles. Every article was assigned a unique id for identification.

The abstract, introduction section and metadata of articles were extracted manually and stored as text files. Metadata included title, venue and keywords of the article. Sentence-wise tokenization was performed to separate every sentence. Pre-processing such as special characters removal, citation removal and lower-case were applied on the sentence tokens. A total of 118,325 sentences were gathered, with an average of 46 sentences per article.

All these sentences are then classified individually using the proposed model. The model predicts a rhetoric zone for every sentence based on its features. However, sentences with a classification probability of more than 0.5 were considered for similarity comparison. Otherwise, any sentence assigned a rhetoric zone label by the proposed model with a probability less than 0.5 was discarded from further processing. A separate JSON file for every rhetoric zone was created that stored the article id, sentence text and classification probability of the sentence. The same procedure was performed for all twenty query papers, and their JSON files were stored separately. Based on classification and the selection criteria, a total of 16,975 sentences for related papers and 186 sentences for query papers were classified under ten rhetoric zones. The distribution of classified sentences is shown in Table 2.

Table 2 Rhetoric sentence distribution

After classifying the rhetoric zones, the similarities between the individual rhetoric zones of the query papers and the related papers were computed. Based on computed similarity using Sent2vec, the top ten articles were selected in order of descending similarities. The mapping information is marked between the related papers provided by the faculty members and their corresponding query papers, through which the evaluation technique computes the average precision (AP@10) of every query paper for every rhetoric zone. This mapping information is only used for the evaluation of the results retrieved by the proposed algorithm.

Average precision is an evaluation measure that considers both Precision and Recall for ranked retrieval results. Furthermore, average precision gives an indication about the position of the relevant retrieved results in a ranked list by computing the mean of the precision value after each relevant document appears in the list. Usually, average precision is computed for all retrieved documents; however, it can be measured for a given number of results, i.e., known as the cut-off rank or average precision at k denoted by AP@k. Average precision at k (AP@k) considers only the top k results of the ranked list.

$$\mathrm{AP}@\mathrm{k}=\frac{1}{gPov}{\sum }_{i=1}^{k}\mathrm{P}@i \times \mathrm{relevance}@i$$

The equation shows the formula of average precision at k. The k in the equation refers to the number of retrieved documents that shall be considered for evaluation. The gPov is the ground truth positive, P@i is the precision at ith item and relevance@i is a function that returns true if the document at the ith position is relevant otherwise false.

Table 3 shows the average precision for the top ten (AP@10) results retrieved by the proposed system. Results were ranked according to rhetoric zone similarity for a given query paper. An evaluation process has evaluated every rhetoric zone of all query papers as shown in Table 3; however, the final ranked list shown to the users is an order according to the average similarity of all rhetoric zones of a query paper. Figure 6 shows the mean average precision (mAP) of the individual rhetoric zone. The highest mAP was achieved by the experiment class whereas, the conclusion class remained at the lowest. It has been observed that features of background and conclusion classes overlap with each other, which makes misclassification at several places. Mean average precision provides an indication about a possible weighting scheme that can be assigned to each class for computing similarity.

Table 3 Average precision (AP) of query articles @10
Fig. 6
figure 6

Mean average precision (mAP) of the rhetoric zones

Average precision is an objective evaluation method in which an automated process measures the performance of the recommendations made by the proposed system. In addition to objective evaluation, the normalized discounted cumulative gain (nDCG), a subjective evaluation measure was computed involving ten experts. NDGC evaluates the system based on the graded relevance of the retrieved results. Grading is performed by the experts on a Likert scale. In the present case, all ten faculty members were experts and the Likert scale range between 1 and 3, with 1 as highly relevant and 3 for the low. Every faculty member is provided with a list of the top ten results retrieved by the proposed system against their provided query paper. A web-based interface is provided to the faculty members for browsing the results and viewing the complete article if required. The equation below shows the formula of nDCG.

$${nDCG}_{K}=\frac{1}{{IDCG}_{K}}\times {\sum }_{i=1}^{K}\frac{{2}^{{rel}_{i}}-1}{{log}_{2} \left(i+1\right)}$$

The nDCG formula is the product of normalizing factor on the left and the discounted cumulative gain (DCG) on the right. The k represents the rank position to limit the varying length of the results. The DCG penalized the score if the highly relevant document appears at the lower rank in the result list. The reli is the grading value assigned by an expert to the document at ith position. The value of nDCG (normalized discounted cumulative gain) ranges between 0 and 1, with 1 as the ideal ranking. Ideal discounted cumulative gain (IDCGK) is the maximum possible discounted cumulative gain (DCG) at kth position. Table 4 shows the results of nDCG@10 of the proposed rhetoric zone classification and similarity technique.

Table 4 Results of normalized cumulative discounted gain (nDCG@10)

The nDCG results are better than AP@10 for respective query papers. This shows that the articles retrieved by the proposed technique are relevant papers. From the average precision and nDCG results, it can be concluded that proposed system has retrieved related articles among the top 10 results that are either provided by the faculty member or available in the dataset. The overall impression of the faculty members who used and evaluated the proposed system was “efficient” as most of them have commented that few articles the proposed system has retrieved are highly relevant to the query paper and they themselves have not found those articles before.

4.2 Comparison with other approaches—objective evaluation

The performance of the proposed rhetoric zone classification and similarity model (RtZone) has been compared with several other citation recommendations models that are based on the content filtering approach. These models are summarized as follows:

NNRank (Bhagavatula et al. 2018): Neural network ranking is content-based filtering method that represents query and candidate documents into vector space model and use the nearest neighbour technique to rank the relevant results. For the present experiment, the hyperparameters such as size of embeddings (dim = 325) and hidden layers (dimH = 150), regularization strength (λ1 = 0, λ2 = 1e-6), number of epochs (epo = 256) and learning rate (0.001) are kept the same as the original study.

LSTM-CAV (Wang et al. 2020a): The LSTM based personalized context-aware citation recommendation model is a hybrid approach using both collaborative and content-based filtering. LSTM-CAV process author, venue, and keywords information along with the content of the article in the form of distributed vector representation. The LSTM model learns both the article and the citation contexts and then measure relevance between them. A ranked list is returned based on high relevance scores. LSTM-CAV was evaluated on the AAN dataset using an embedding size of 150, regularization weights for λ1 = 1e−5 and λ2 = 1e−6 with stochastic optimization method AdaGrad having a learning rate set to 0.001.

Doc2vec model (Le and Mikolov 2014): The D2V model is an unsupervised learning technique for representing documents as fixed-length dense feature vector representation. D2V, as compared to the traditional bag-of-words approach, learns the feature vector using a neural network and keep in view the ordering among words and individual word semantics. Similarity among documents is measured by their representational relevance score. For the present evaluation, the embedding size is kept at 150 as consistent to Doc2vec original experiment.

GRU-MTL (Bansal et al. 2016): A latent vector of text sequences is encoded using gated recurrent neural units (GRUs) for citation recommendation on the collaborative filtering task. Full text of articles is used here for evaluation with embedding size 200 and dimensions of hidden layers of first and second recurrent neural network (RNNs) as dimH1 = 200 and dimH2 = 400, respectively.

Scholarfy (Achakulvisut et al. 2016): The vectorization of a document is carried out by Latent Semantic Analysis (LSA) with a combination of log-entropy and Tf-idf for weighting purpose. Scholarfy has used abstracts of the articles. Recommendations are made using Rocchio Algorithm for finding nearest neighbour articles. An embedding or vector size of 150 is selected here for evaluation that is the same as the original experiment carried out by the Scholarfy.

The real-world bibliographic dataset named the ACL anthology network (AAN) (Radev et al. 2013) is used for the performance evaluation of the proposed model as compared to previous content-based citation recommendation approaches. AAN contains articles on natural language processing (NLP) and computational linguistics collected from different venues. After removing papers with incomplete information, the dataset for evaluation contains 27,324 papers. Abstract and introduction section is used for the evaluation of proposed model whereas abstract only for Scholarfy and full text for all other techniques. LSTM-CAV is provided with author, venue and keyword information in addition to the full text of the articles.

The comparison of the proposed rhetoric zones similarity model against the baseline approaches in terms of Recall is shown in Fig. 7. The comparison results show that the recommendations made by the proposed model are more precise as compared to all other baseline approaches. The reason that the proposed model computes similarity among articles is based on the semantics of their content rather than considering individual words appearing in the text as Doc2vec, Scholarfy and NNRank. The NNRank has shown second best results for AAN dataset; however, it has not shown the similar in case of ART + CORE dataset because it is well known that the nearest neighbour approach is highly sensitive to irrelevant features and scale of data. The AAN dataset mainly contains papers on a specific topic, whereas the ART + CORE has general computer science articles. Moreover, the top 325 features learned by the NNRank in the case of AAN are all relevant to candidate articles; however, it does not remain the same for multi-domain articles. Similar behavior as NNRank is shown by Scholarfy with its LSA approach. It remains consistent with below average results for both datasets due to its dependency on representations provided by the LSA. Comparatively, the LSTM-CAV and GRU-ML generate insignificant results due to their small embedding size against full article text. LSTM-CAV have reported the same, as a limitation of their work that increasing the embedding size to 500 and 1000 in some cases produce better results. A conclusion can be made from these results that the proposed method based on semantic similarity has outperformed the state-of-the-art content-based filtering methods using syntactical similarity for citation recommendations.

Fig. 7
figure 7

Recall on the a AAN dataset b ART + CORE dataset

5 Conclusion

In the present work, a citation recommendation system using deep learning technique is proposed, which considers both local and global context-aware citation recommendations approach and presents a remedy to the cold-start problem. Previous research has heavily addressed the cold-start problem using collaborative filtering techniques relying on pre-computed or available information about articles. However, the present proposal is based on content filtering, which requires no prior information about query and candidate articles. Citation recommendations are made through rhetoric zones classification using Bi-LSTM and BERT models and computing similarity using Sent2vec embeddings of every individual zone. Moreover, Metadata information is combined with rhetoric zone information to produce better results. The proposal is both an offline and online approach that computes article relatedness based on the semantics of the content as a solution to cold-start problem. The deep learning model was trained using well-known ART and CORE datasets. The trained model with an accuracy of over 80% is tested on 2,543 articles. The objective evaluation using mean average precision and subjective evaluation using normalized discounted cumulative gain with ten experts was computed. The evaluation results clearly show the effectiveness of the proposed approach in terms of citation recommendation.

The proposed approach is validated through experiments for citation recommendation. However, there are several limitations that need further study. Currently, all rhetoric zones have been assigned the same weight during similarity computation. During this research, it has been observed that setting dynamic weights to rhetoric zones shall improve the final ranking of the recommendation list. Moreover, the size of embeddings window is kept to 300, which can be increased or decreased to further evaluate its effectiveness. Sometimes rhetoric zones overlap each other; currently the present work assigns a single class label to a zone, whereas multi-class classification might result differently. The present word considers the rhetoric zone of sentence length from the introduction section only, which can be extended to multiple sentences and multiple sections of the paper. The current approach carried an experiment with a collection of articles to show its validity. However, it can be compared to top-ranking article recommendations of search systems such as Google Scholar, ScienceDirect, CORE and DBLP.