The entire process of keyphrase extraction using our proposed technique can be parted into three main phases: (i) candidate keyphrase selection or pre-processing, (ii) candidate keyphrase processing or simply processing, and (iii) ranking and selecting final keyphrases or post-processing (see Fig. 3).
Candidate Keyphrase Selection
The proposed technique employs the Part-Of-Speech (POS) Tagging (POST) approach to extract candidate keyphrases from δ. Since keyphrases are generally noun phrases [13], the proposed technique limits the extraction to only noun phrases [13]. For this, the following POS pattern is utilized, which has been demonstrated in [52] as one of the most suitable patterns for extracting candidate keyphrases.
$$ (<NN.*>+ <JJ.*>?)|(<JJ.*>? <NN.*>+) $$
Note that it is a regular expression that is written in a simplified format using NLTK’s RegexpParser, where nouns are tagged with NN and adjectives are tagged with JJ. More details could be found in [23].
Once the candidate keyphrases are extracted, they are passed through a cleaning process to filter out those keyphrases that are less likely to be final keyphrases. For that, following conditions are applied: (i) any candidate keyphrase that contains non-alphabetic characters, (ii) any candidate keyphrase that contains single alphabetic word(s), and (iii) if the frequency of any candidate keyphrase fails to satisfy lsaf factor (see the “Conceptual Framework” section). The first two conditions filter out candidate keyphrases that make no sense to the human reader in general; and the latter one filters out all non-popular candidate keyphrases from the list.
Candidate Keyphrase Processing Using KeyPhrase Extraction (KePhEx) Tree
In conventional unsupervised keyphrase extraction techniques, candidate keyprhases are not processed; instead, they are sent to the ranking phase immediately after the selection. On the contrary, an intermediate phase between candidate keyphase selection and ranking could release the burden of ranking unnecessary keyphrases, and thus, lead to finding more appropriate keyphrases. The proposed KePhEx tree takes all the formerly mentioned hypotheses (see the “Preliminaries” section) into account for extracting final keyphrases. The KePhEx tree expands (hypothesis 3) or shrinks (hypothesis 2) or remains in the same state (hypothesis 1) based on the candidate keyphrases. The advantages of employing KePhEx tree in keyphrase extraction are threefold: (i) extracts quality keyphases from candidate keyphrases, (ii) provides flexibility during keyphrase extraction, and (iii) contributes in ranking by providing a value that represents cohesiveness of a word in a keyphrase with respect to a root.
Among different classes of tree data structure, the KePhEx tree falls under a binary tree. Again, although there exist several variants of a binary tree, it is different from others since the position of every node in the tree and its level are fixed. Again, all the predecessors of a node at the upper-levels (including root) are also fixed unlike other variants. It is so because a good keyphrase must be a coherently connected sequence of words that appear continuously in the text. Every node in a KePhEx tree holds a 2-tuple data along with other information, namely a word and its CI or μ value. The CI provides two advantages: (i) assists in finding the cohesiveness of various words with respect to the root of the tree, which is employed as a factor in ranking keyphrases and (ii) provides flexibility during keyphrase extraction as the value of μ increases or decreases based on the existence of that word in candidate keyphrases.
Root Selection
It is important to select a qualified root since a poorly selected root may lead to a poor keyphrase. In this technique, only nouns are designated as roots, which are selected from the candidate keyphase list, χ, and are saved in another list, η. As noun phrases are the most likely candidate for final keyphrases, selecting them (i.e., nouns) as roots increases the chances of extracting quality final keyphrases.
After selecting the roots, the trees are formed taking these roots into consideration. The entire process from tree formation to final keyphrase extraction is segmented into three main steps, namely (i) tree formation, (ii) tree processing, and (iii) keyphrase extraction.
Tree Formation
For forming a KePhEx tree, a root, γ, is selected from η. Afterwards, the proposed system selects candidate keyphrases that contains γ. Let us denote them as similar candidate keyphrases, which could be defined as follows:
Definition 1
Similar candidate keyphrases, σ, are those candidate keyphrases that contain γ in them—irrespective of its position, and \(\sigma \subseteq \chi \).
A partial sample of σ for γ = servic could be: σ = {scalabl grid servic discoveri base, grid servic, servic discoveri mechan, scalabl web servic permiss, distribut grid servic discoveri architectur, servic discoveri architectur, grid discoveri servic, servic discoveri, grid inform servic, servic discoveri grid comput, servic technolog, servic discoveri function, grid servic call registri, web servic version, discoveri servic, servic properti, thi servic, index servic, servic discoveri, web servic commun, …}. Among them, the first encountered similar candidate keyphrase, σ1 (e.g., scalabl grid servic discoveri base), is employed in forming the KePhEx tree and the rest are utilized in processing the tree (see the “Tree Processing” section).
Here, the process of tree formation starts by selecting the position of γ in σ1; but the tree starts forming once the γ is assigned as the root of the tree and μ value is initialized to 1. For any other word (wi), its position, \({w_{i}^{p}}\), is determined at first to decide in which subtree it would be placed. If position of γ, γp, is more than \({w_{i}^{p}}\) (i.e., \(\gamma ^{p} > {w_{i}^{p}}\)), it would be placed in the left subtree, otherwise (i.e., \(\gamma ^{p} < {w_{i}^{p}}\)), the right subtree. Again, the depth of wi, \({w_{i}^{d}}\), in a phrase with respect to γ is also necessary to calculate for determining the level of the tree where wi would be added, which could be defined as follows:
Definition 2
Depth of wi, \({w_{i}^{d}}\), in a keyphrase is the distance of that word from γ irrespective of its direction, which is calculated as, \({w_{i}^{d}} = |\gamma ^{p} - {w_{i}^{p}}|\).
Note that \({w_{i}^{d}}\) in a candidate keyphrase of wi and the level of wi in the KePhEx tree, \({w_{i}^{l}}\), are identical, and hence, they are used interchangeably in this paper. Once the subtree of wi is determined using \({w_{i}^{p}}\), \({w_{i}^{d}}\) is calculated. The next condition to be satisfied is that all the predecessors must be in their respective places. This can be tracked by traversing the tree from level 0 to l − 1 and by comparing the word in each level with that of in σ1 at that depth. Once these constraints are satisfied, wi is qualified for adding in the tree at level l. For that, a node is created by incorporating wi in it and initializing μ to 1.
Once all the words at the left side of γ are added in the left subtree, then the words at the right side of γ are added in the right subtree following the same procedure. The tree formation ends when all the words of σ1 are added in the tree. This entire process is illustrated in Algorithm 1.
A sample tree is depicted in Fig. 4, which is formed using σ1 = scalabl grid servic discoveri base and γ =servic. The tree formation starts by adding servic in the tree as root and initializing μ of the node to 1. Afterwards, all the words at the left side (i.e., grid and scalabl) are added in the left subtree in their respective levels, where levels are calculated based on their respective depths in σ1. For instance, since gridd = 1, grid is added at level 1 in left subtree, whereas, since scalabld = 2, scalabl is added at level 2 in left subtree. Again, when grid is added in the tree, it is tracked that its predecessor servic is in the tree. Similarly, when scalabl is added in the tree, it is tracked that grid and servic are its predecessors, respectively. Once all the words at the left side of servic are added in the tree, the words at the right side (i.e., discoveri and base) are added in the right subtree employing a similar procedure as the left subtree.
Tree Processing
After forming the tree employing σ1, the rest of the similar candidate keyphrases, \(\sigma ^{\prime }\), where \(\sigma ^{\prime }\) = {σ2, σ3, ... , σn} are utilized to process the tree. For that, the cases that are mentioned in the “Preliminaries” section are taken into account, i.e., no tree processing is needed for case 1; the tree must be trimmed properly to remove unnecessary parts for case 2; and it must be expanded to put on necessary parts from all the similar candidate keyphrases in \(\sigma ^{\prime }\) for case 3. This process is described in Algorithm 1.
Let us fetch a similar candidate keyphrase, \({\sigma }_{i}^{\prime }\), from \(\sigma ^{\prime }\), and utilizes it for processing the KePhEx tree. At first, γp in \({\sigma }_{i}^{\prime }\) need to be determined. Like tree formation, the tree processing also starts from γ followed by the words at the left side of γ and then, right side. Afterwards, any word (\(w_{i}\in \sigma ^{\prime }_{i}\)) at position \({w_{i}^{p}}\) is qualified to be added to the left subtree if \({w_{i}^{p}} < \gamma ^{p}\); otherwise, when \({w_{i}^{p}} > \gamma ^{p}\), it is qualified to be added to the right subtree. Again, the depth (\({w_{i}^{d}}\)) is calculated to determine at which level wi is qualified to be added in the tree and all the predecessors (from 0 to l − 1) are checked with the ones in \({\sigma }_{i}^{\prime }\) before their inclusion.
At level l, where wi is qualified for possible inclusion, three events can occur: (i) there is no node, (ii) there is only one node, and (iii) there are two nodes. In the case of the first event, a node is created for wi by initializing μ to 1, and then, is added it as a left child for the left subtree or as a right child for the right subtree. For the second event, if the word in the node is the same as wi, then no node is added. Otherwise, a node is created like before and it is added as a new child at the present level in the subtree. Lastly, if both children already exist at that level, the new node with wi replaces the node whose word has the lowest TF. The reason is that any word with higher TF is highly likely to form a quality final keyphrase. For that, if the lower TF node is a leaf node, the new node will replace it. Otherwise, if it is a root of a subtree, then the subtree is deleted from the tree and the new node is added in that position. This process is deemed complete when all the words of \({\sigma }_{i}^{\prime }\) have been considered.
Update μ Values
The process of updating μ values starts as soon as the nodes of \({\sigma }_{i}^{\prime }\) have been added to the tree as demonstrated in Algorithm 2. It starts by determining γp in \({\sigma }_{i}^{\prime }\). If γp is 0, i.e., γ is the leftmost word of \({\sigma }_{i}^{\prime }\), μ values of all the nodes in the left subtree are decreased. Similarly, if γp is \(|\sigma ^{\prime }_{i}| - 1\), i.e., γ is the rightmost word of \({\sigma }_{i}^{\prime }\), μ values of all the nodes in the right subtree are decreased. Afterwards, the μ value of the root is increased and the tree is traversed and compared starting from the left subtree followed by the right subtree using iterative procedures.
At a given level l for any wi, three events may occur: (i) wi is absent in l, (ii) wi is present as a left child, and (iii) wi is present as a right child. For the first event, μ values of all the nodes in the left and right subtree are decreased. In the second case, μ value of the left child is increased, whereas they are decreased for the nodes in the right subtree, and then, move to the next level. In the case of the last event, μ value of the right child is increased, whereas they are decreased for the nodes in the left subtree, and then, move to the next level. This procedure continues until all the words are taken into account.
An example of tree processing and updating μ values are demonstrated in Fig. 5, where tree in Fig. 4 is utilized as the initial tree. Again, the tree is formed using σ1 in σ, and the rest (i.e., \(\sigma ^{\prime }\)) are utilized to process the tree. As in Fig. 5a, since \(\sigma ^{\prime }_{1}\) is grid servic, and both the words already exist in the tree in sequence, the tree remains in the same state as before. However, μ values of the nodes that contain grid and servic are increased, and all others are decreased. In Fig. 5b, among the three words, only mechan does not exist in the right subtree at level 2; therefore, it is added as the left child. Afterwards, μ values are increased based on \(\sigma ^{\prime }_{2}\). Similarly, the tree keeps amending with every encountered \({\sigma }_{i}^{\prime }\) and μ values are also updated accordingly. This process keeps continuing until all the keyphrases in \(\sigma ^{\prime }\) are processed. Although this example demonstrates only expansion or no change of tree state, the shrinkage occurs in the keyphrase extraction phase.
Keyphrase Extraction
This process is initiated by pruning the weak nodes from the tree. Here, weak nodes are selected based on their cohesiveness with respect to γ with an assumption that they may not be the parts of final keyphrases. For that, a constant integer value, named minimum allowableμ (mamu), is utilized. A node whose μ value is lower than the mamu is pruned from the tree. For instance, in Fig. 5e, it could be observed that several nodes in the tree contain lower μ values, i.e., their cohesiveness with respect to γ is weak, and hence, most likely, they would not be a part of the final keyphrase. Now, mamu value determined which nodes to keep in the tree and which to prune from the tree. Such a tree is depicted in Fig. 6, where mamu is considered as 2.
Hence, if that node is a root of a subtree than that entire subtree is also erased from the tree with the assumption that a weak root would form a weak subtree. Again, a mamu value must be selected with considerable attentionbecause a smaller mamu value results in many and/or long keyphrases, whereas a large mamu value results in lower and/or abbreviated keyphrases. Therefore, it is essential to find a suitable mamu value for improved performance of the system. Hence, this paper conducts an experiment to find a suitable mamu value (see the “Parameter Value Selection” section). Again, this mamu value also provides flexibility during keyphrase extraction.
Afterwards, all paths from the root to the leaves are extracted to discover final keyphrases. Since this procedure is dissimilar to any conventional tree traversal technique (namely preorder, inorder, and postorder), they are not directly applicable in this case. Hence, inorder tree traversal technique is enhanced to perform the task, which is explained in Algorithm 3. This algorithm extracts all the paths from root to leaf and separates them in left paths (paths from left subtree) and right paths (paths from right subtree), which are later processed to generate final keyphrases individually (one final keyphrase from one path) or collectively (by joining a path from the left subtree and a path from the right subtree) as demonstrated in Algorithm 4.
Now, in the case of left paths, since they are extracted from root to leaf, they are unlikely to be the final keyphrases as they are aligned in reverse direction, and hence, misses the coherent relationship. Therefore, all left paths are reversed before extracting final keyphrases. Afterwards, all the words are acquired from each path and a keyphrase is formed. Then, its presence (entirely) is checked in χ as a candidate keyphrase or a part of candidate keyphrase. A similar technique is followed to extract keyphrases from right paths with an exception is that the paths are not reversed since they are already satisfying the coherent relationship conditions. After acquiring all the final keyphrases from the left and right paths, they are concatenated to generate more long and meaningful keyphrases. Again, these keyphrases will qualify as final keyphrases if they are entirely found in χ as candidate keyphrases or part of candidate keyphrases.
Flexibility During Keyphrase Extraction
The proposed technique offers flexibility in keyphrase extraction via employing the mamu values. As an example, Table 1 is generated using the tree in Fig. 6. As expected, for different mamu values, different final keyphrases are generated. These keyphrases also differ in length and quantity. For instance, the longest keyphrase generated by mamu values from 1 to 3 is 4, whereas it is 3 for mamu value 4, 2 for mamu values from 5 to 7 and so on. On the other hand, for mamu values from 1 to 4, 3 final keyphrases are extracted, whereas it is only 1 for mamu values from 5 to 45 and 0 afterwards. From here, we can conclude that a greedy approach may choose a lower mamu value and hence, would get considerably many and/or lengthy keyphrases; but the quality would be a little bit compromised. On the other hand, a conservative approach may choose a large mamu value which will in turn provide considerably lower and/or mostly abbreviated keyphrases. Hence, to receive a desired level of performance, mamu value must be set properly. To realize this, an experimental evaluation is performed in the “Results and Discussion” section and the results are analyzed with detail evidences.
Table 1 Final keyphrases from the resultant tree in Fig. 6
After extracting all the final keyphrases from the tree for a γ, the next γ is chosen from the list η and the same procedure is repeated again. It continues until all the nouns in η are considered as γ. After finish extracting all the final keyphrases, they are passed for ranking and selecting.
Ranking and Selecting Final Keyphrases
Generally, automatic keyphrase extraction techniques extract a good number of final keyphrases. However, various applications including recommender system and document indexing techniques utilize only a certain number of top keyphrases. Therefore, an automatic keyphrase extraction technique must offer the most relevant top-N keyphrases to these applications. Hence, keyphrase extraction is also accounted for as a ranking problem.
In the proposed ranking technique, the μ value is employed along with the TF as follows to calculate weight, ω of a keyphrase p:
$$ \omega_{p} = \sum\limits_{k=1}^{N} tf_{k} \times \sum\limits_{k=1}^{N} \mu_{k} $$
(1)
Here, N is the number of words in p. The first factor in Eq. 1, i.e., TF, is utilized to identify the popularity of that particular keyphrase in a document with an assumption that the non-popular keyphrases are unlikely to become a final keyphrase. For that, TF of all the words in p are summed together. It is noteworthy to mention that instead of averaging each factor, summation is performed to eliminate the bias towards the single terms. Again, the second factor is for realizing the cohesiveness of every word in that keyphrase to γ, which can be found by summing the μ values of all the words in p.
After calculating the ω values for all keyphrases, they are sorted to arrange them in rank. Since the quantity of final keyphrases is limited, any sorting algorithm is suitable. In the proposed system, the quick sort [27] algorithm is applied to perform the task rapidly. After ranking, these keyphrases are ready to be rendered. Now, when a user or an application seeks for any N keyphrases, the system will provide top-N keyprhases from the rank 1 to N, respectively.