Keywords

1 Introduction

Keyphrase extraction is the task of automatically extracting words or phrases from a text, which concisely represent the essence of the text. Because of the succinct expression, keyphrases are widely used in many tasks like document retrieval [13, 25], document categorization [9, 12], opinion mining [3] and summarization [24, 31]. Figure 1 shows an example of a title and the abstract of a research paper along with the author-specified keyphrases highlighted in bold.

Present methods for keyphrase extraction follow a two-step procedure where they select important phrases from the document as potential keyphrase candidates by heuristic rules [18, 28, 29] and then the extracted candidate phrases are ranked either by unsupervised approaches [17, 21, 27] or supervised approaches [18, 22, 29]. Unsupervised approaches score those candidate phrases based on individual words comprising the candidate phrases. They utilize various scoring measures based on the informativeness of the word with respect to the whole document [10]. Other paradigms utilize graph-based ranking algorithms wherein each word in the document is mapped to a node in the graph and the connecting edges in the graph represent the association patterns among the words in the document. Then, the scores of the individual words are estimated using various graph centrality measures [6, 21, 27]. On the other hand, supervised approaches [4, 14] use binary classification to label the extracted candidate phrases as keyphrases or non-keyphrases, based on various features such as, tf-idf, part-of-speech (POS) tags, and the position of phrases in the document. The major limitation of these supervised approaches is that they classify the labels of each candidate phrase independently without taking into account the dependencies that could potentially exist between neighbouring labels and they also ignore the semantic meaning of the text. To overcome the above stated limitation, [8] formulated keyphrase extraction as a sequence labeling task and used linear-chain Conditional Random Fields for this task. However, this approach does not explicitly take into account the long-term dependencies and semantics of the text. More recently, to capture both the semantics of the text as well as the dependencies among the labels of neighboring words [1] used a deep learning-based approach called BiLSTM-CRF which combines a bi-directional Long Short-Term Memory (BiLSTM) layer that models the sequential input text with a Conditional Random Field (CRF) layer that captures the dependencies in the output.

Fig. 1.
figure 1

An example of keyphrase extraction with author-specified keyphrases highlighted in bold.

The above mentioned approaches treat keyphrase extraction as a sentence-level task where sentences in the same document are viewed as independent. When labeling a word, local contextual information from the surrounding words is crucial because the context gives insight to the semantic meaning of the word. However, there are many instances in which the local context is ambiguous or lacks sufficient information. If the model has access to supporting information that provides additional context, the model may use this additional supporting information to predict the label correctly. Such additional supporting information may be found from other sentences in the same document from which the query sentence is taken. To utilize this additional supporting information, we propose a document-level attention mechanism inspired from [20, 30]; it dynamically weights the additional supporting information emphasizing the most relevant information from each supporting sentence with respect to the local context. But leveraging this additional supporting information has a downside of introducing noise into the representations. To alleviate this problem, we use a gating mechanism [20, 30] that balances the influence of the local contextual representations and the additional supporting information from the document-level contextual representations.

To this end, in this paper, we propose Document-level Attention for Keyphrase Extraction (DAKE). It initially produces representations for each word that encode the local context from the query sentence using BiLSTM, then uses a document-level attention mechanism to incorporate the most relevant information from each supporting information with respect to the local context, and employs a gating mechanism to filter out the irrelevant information. Finally, it uses a CRF layer which captures output label dependencies to decode the gated local and the document-level contextual representations to predict the label. The main contributions of this paper are as follows:

  • We propose DAKE, a BiLSTM-CRF model augmented with document-level attention and a gating mechanism for improved keyword extraction from research papers.

  • Experimental results on a dataset of research papers show that DAKE outperforms previous state-of-the-art approaches.

2 Problem Formulation

We formally describe the keyphrase extraction task as follows: Given a sentence, \({s =\{w_1,w_2, \ldots , w_n\}}\) where n is the length of the sentence, predict the labels sequence \({y =\{y_1,y_2, \ldots , y_n\}}\) where \({y_i}\) is the label corresponding to word \({w_i}\) and it can KP (keyphrase word) or Not-KP (not a keyphrase word). Every longest sequence of KP words in a sentence is a keyphrase.

3 Proposed Method

The main components in our proposed architecture, DAKE, are: Word Embedding Layer, Sentence Encoding Layer, Document-level Attention mechanism, Gating mechanism, Context Augmenting Layer and Label Sequence Prediction Layer. The first layer produces word embeddings of the sentence from which the second layer generates word representations that encode the local context from the query sentence. Then the document-level attention mechanism extracts supporting information from other sentences in the document to enrich the current word representation. Subsequently, we utilize a gating mechanism to filter out the irrelevant information from each word representation. The next layer fuses the local and the global contexts into each word representation. Finally, we feed these word representations into the CRF layer which acts as a decoder to predict the label, KP or Not-KP, associated with each word. The model is trained in an end-to-end fashion.

3.1 Word Embedding Layer

Given a document \({D=\{s_1,s_2,\ldots ,s_m\}}\) of m sentences, where a sentence \({s_i=\{w_{i1},w_{i2},\ldots ,w_{in}\}}\) is a sequence of n words, we transform each word \({w_{ij}}\) in the sentence \({s_i}\) into a vector \({\mathbf {x}_{ij}}\) using pre-trained word embeddings.

3.2 Sentence Encoding Layer

We use a BiLSTM [11] to obtain the hidden representation \({H_i}\) of the sentence \({s_i}\). A BiLSTM comprises a forward-LSTM which reads the input sequence in the original direction and a backward-LSTM which reads it in the opposite direction. We apply forward-LSTM on the sentence \({s_i}\) = (\({\mathbf {x}_{i1}}\),\({\mathbf {x}_{i2}}\), ...,\({\mathbf {x}_{in}}\)) to obtain \(\overrightarrow{H_i}\) = (\(\overrightarrow{\mathbf {h}_{i1}}\),\(\overrightarrow{\mathbf {h}_{i2}}\), ...,\(\overrightarrow{\mathbf {h}_{in}}\)). The backward-LSTM on \({s_i}\) produces \(\overleftarrow{H_i}\) = (\(\overleftarrow{\mathbf {h}_{i1}}\),\(\overleftarrow{\mathbf {h}_{i2}}\),...,\(\overleftarrow{\mathbf {h}_{in}}\)). We concatenate the outputs of the forward and the backward LSTMs to obtain the local contextual representation \({H_i=\{\mathbf {h}_{i1},\mathbf {h}_{i2},\ldots ,\mathbf {h}_{in}\}}\) where \({\mathbf {h}_{ij}}\) = [\(\overrightarrow{\mathbf {h}_{ij}}\):\(\overleftarrow{\mathbf {h}_{ij}}\)]; here, : denotes concatenation operation. Succinctly, \(\mathbf {h}_{ij}\,=\,{{\,\mathrm{BiLSTM}\,}}(\mathbf {x}_{ij})\)

3.3 Document-Level Attention

Many keyphrase mentions are tagged incorrectly in current approaches including the BiLSTM-CRF model [1] due to ambiguous contexts present in the input sentence. In cases where a sentence is short or highly ambiguous, the model may either fail to identify keyphrases due to insufficient information or make wrong predictions by using noisy context. We hypothesize that this limitation can be alleviated using additional supporting information from other sentences within the same document. To extract this global context, we need vector representations of other sentences in the same document D. We utilize BERT [5] as a sentence encoder to obtain representations for the sentences in D. Given an input sentence \(s_{l}\) in D, we extract the final hidden state of the [CLS] token as the representation \(\mathbf {h}'_{l}\) of the sentence, where [CLS] is the special classification embedding in BERT. Then, for each word, \({w_{ij}}\) in the input sentence \(s_i\), we apply an attention mechanism to weight the supporting sentences in D as follows

$$\begin{aligned} e^{l}_{ij}=\mathbf {v}^\top \tanh (W_1 \mathbf {h}_{ij} + W_2 \mathbf {h}'_l + \mathbf {b}_1) \end{aligned}$$
(1)
$$\begin{aligned} \alpha ^{l}_{ij}=\frac{\exp (e^{l}_{ij})}{\sum _{p=1}^m\exp (e^p_{ij})} \end{aligned}$$
(2)

where \({W_1}\), \({W_2}\) are trainable weight matrices and \({\mathbf {b}_1}\) is a trainable bias vector. We compute the final representation of supporting information as \({\tilde{\mathbf {h}}_{ij}=\sum _{l=1}^m \alpha ^l_{ij}\mathbf {h}'_{l}}\). For each word \({w_{ij}}\), \({\tilde{\mathbf {h}}_{ij}}\) captures the document-level supporting evidence with regard to \({w_{ij}}\).

3.4 Gating Mechanism

Though the above supporting information from the entire document is valuable to the prediction, we must mitigate the influence of the distant supporting information as the prediction should be made primarily based on the local context. Therefore, we apply a gating mechanism to constrain this influence and enable the model to decide the amount of the supporting information that should be incorporated in the model, which is given as follows:

$$\begin{aligned} \mathbf {r}_{ij}=\sigma (W_3 \tilde{\mathbf {h}}_{ij}+W_4\mathbf {h}_{ij}+ \mathbf {b}_2) \end{aligned}$$
(3)
$$\begin{aligned} \mathbf {z}_{ij}=\sigma (W_5\tilde{\mathbf {h}}_{ij}+W_6 \mathbf {h}_{ij} + \mathbf {b}_3) \end{aligned}$$
(4)
$$\begin{aligned} \mathbf {g}_{ij}= \tanh (W_7 \mathbf {h}_{ij}+ \mathbf {z}_{ij} \odot (W_8 \tilde{\mathbf {h}}_{ij}+ \mathbf {b}_4)) \end{aligned}$$
(5)
$$\begin{aligned} \mathbf {d}_{ij}=\mathbf {r}_{ij}\odot \mathbf {h}_{ij}+ (1-\mathbf {r}_{ij})\odot \mathbf {g}_{ij} \end{aligned}$$
(6)

where \(\odot \) denotes Hadamard product and \({W_3}\),\({W_4}\),\({W_5}\),\({W_6}\),\({W_7}\),\({W_8}\) are trainable weight matrices and \({\mathbf {b}_2}\),\({\mathbf {b}_3}\),\({\mathbf {b}_4}\) are trainable bias vectors. \({\mathbf {d}_{ij}}\) is the representation for the gated supporting evidence for \({w_{ij}}\).

3.5 Context Augmenting Layer

For each word \({w_{ij}}\) of sentence \({s_i}\), we concatenate its local contextual representation \({\mathbf {h}_{ij}}\) and gated document-level supporting contextual representation \({\mathbf {d}_{ij}}\) to obtain its final representation \({\mathbf {a}_{ij}}=[\mathbf {h}_{ij}:\mathbf {d}_{ij}]\), where : denotes concatenation operation. These final representations \({A_i=\{\mathbf {a}_{i1},\mathbf {a}_{i2},\ldots ,\mathbf {a}_{in}\}}\) of sentence \(s_i\) are fed to another BiLSTM to further encode the local contextual features along with supporting contextual information into unified representations \({C_i=\{\mathbf {c}_{i1}, \mathbf {c}_{i2},\ldots ,\mathbf {c}_{in}\}}\) where \({\mathbf {c}_{ij}={{\,\mathrm{BiLSTM}\,}}(\mathbf {a}_{ij})}\). The output of this encoding captures the interaction among the context words conditioned on the supporting information. This is different from the initial encoding layer, which captures the interaction among words of the sentence independent of the supporting information.

3.6 Label Sequence Prediction Layer

The obtained contextual representations \(C_i\) of query sentence \(s_i\) are given as input sequence to a CRF layer [16] that produces a probability distribution over the output label sequence using the dependencies among the labels of the entire input sequence. In order to efficiently find the best sequence of labels for an input sentence, the Viterbi algorithm [7] is used.

4 Experiments

4.1 Dataset

We use the dataset from [19] which comprises metadata of papers from several online digital libraries. The dataset contains metadata for 567,830 papers with a clear split as train, validation, and test sets provided by the authors, as follows: 527,830 were used for model training, 20,000 were used for validation and the rest 20,000 were used for testing. We refer to these sets as kp527k, kp20k-v and kp20k respectively. The metadata of each paper consists of title, abstract, and author-assigned keyphrases. The title and abstract of each paper are used to extract keyphrases, whereas the author-input keyphrases are used as gold-standard for evaluation.

4.2 Baselines and Evaluation Metrics

We compare our approach, DAKE with the following baselines: Bi-LSTM-CRF [1], CRF [8], Bi-LSTM [1], copy-RNN [19], KEA [29], Tf-Idf, TextRank [21] and SingleRank [27]. We also carry out an ablation test to understand the effectiveness of document-level attention and gating mechanism components by removing them. Similar to previous works, we evaluate the predictions of each method against the author-specified keyphrases that can be located in the corresponding paper abstracts in the dataset (“gold standard”). We present results for all our experiments using the precision, recall, and F1-score measures. For comparison of the methods, we choose the F1-score, which is the harmonic mean of precision and recall.

4.3 Implementation Details

We use pre-trained word embedding vectors obtained using GloVe [23]. We use SciBERT [2], a BERT model trained on scientific text for the sentence encoder. For word representations, we use 300-dimensional pre-trained word embeddings and for sentence encoder, we use 768 dimensional representation obtained using SciBERT. The hidden state of the LSTM is set to 300 dimensions. The model is trained end-to-end using the Adam optimization method [15]. The learning rate is initially set as 0.001 and decayed by 0.5 after each epoch. For regularization to avoid over-fitting, dropout [26] is applied to each layer. We select the model with the best F1-score on the validation set, kp20k-v.

5 Results and Discussion

Table 1a shows the results of our approach in comparison to various baselines. Our approach, DAKE outperforms all baselines in terms of the F1-score. Tf-Idf, TextRank and SingleRank are unsupervised extractive approaches while KEA, Bi-LSTM-CRF, CRF, Bi-LSTM follow supervised extractive approach. copyRNN is a recently proposed generative model based on sequence-to-sequence learning along with a copying mechanism. For the unsupervised models and the sequence-to-sequence learning model, we report the performance at top-5 predicted keyphrases since top-5 showed highest performance in the previous works for these models. From Table 1a, we observe that the deep learning-based approaches perform better than the traditional feature-based approaches. This indicates the importance of understanding the semantics of the text for keyphrase extraction. BiLSTM-CRF yields better results in terms of the F1-score over CRF (improvement of F1-score by \(18.17\%\) from \(17.46\%\) to \(35.63\%\)) and BiLSTM (improvement of F1-score by \(18.88\%\) from \(16.75\%\) to \(35.63\%\)) models alone. This result indicates that the combination of BiLSTM, which is powerful in capturing the semantics of the textual content, with CRF, which captures the dependencies among the output labels, helped boost the performance in identifying keyphrases. Our proposed method, DAKE outperforms the BiLSTM-CRF (improvement of F1-score by \(6.67\%\) from \(35.63\%\) to \(42.30\%\)) approach, which indicates that the incorporation of additional contextual information from other sentences in the document into the BiLSTM-CRF model helps to further boost the performance.

Table 1. Performance analysis of DAKE

Table 1b shows the ablation study. We observe that document-level attention increases the F1-score of the baseline BiLSTM-CRF by \(0.84\%\) (from \(35.63\%\) to \(36.47\%\)). This validates our hypothesis that additional supporting information boosts the performance for keyphrase extraction. But leveraging this additional supporting information has a downside of introducing noise into the representations, and to alleviate this, we used a gating mechanism which boosted the F1-score by \(1.62\%\) (from \(36.47\%\) to \(38.09\%\)). Document-level attention did not show great improvement when it has only one layer of BiSLTM because the final tagging predictions mainly depend on the local context of each word while additional context only supplements extra information. Therefore, our model needs another layer of BiLSTM to encode the sequential intermediate vectors containing additional context and local context, as evidenced from our F1-score improvement by \(4.21\%\) (from \(38.09\%\) to \(42.30\%\)). When CRF is removed from DAKE, the F1-score falls by \(3.09\%\), showing that CRF successfully captures the output label dependencies.

6 Conclusion and Future Work

We proposed an architecture, DAKE, for keyword extraction from documents. It uses a BiLSTM-CRF network enhanced with a document-level attention mechanism to incorporate contextual information from the entire document, and gating mechanisms to balance between the global and the local contexts. It outperforms existing keyphrase extraction methods on a dataset of research papers. In future, we would like to integrate the relationships between documents such as those available from a citation network by enhancing our approach with contexts in which the document is referenced within a citation network.