A patent keywords extraction method using TextRank model with prior public knowledge

For large amount of patent texts, how to extract their keywords in an unsupervised way is a very important problem. In existing methods, only the own information of patent texts is analyzed. In this study, an improved TextRank model is proposed, in which prior public knowledge is effectively utilized. Specifically, two following points are first considered: (1) a TextRank network is constructed for each patent text, (2) a prior knowledge network is constructed based on public dictionary data, in which network edges represent the prior interpretation relationship among all dictionary words in dictionary entries. Then, an improved node rank value evaluation formula is designed for TextRank networks of patent texts, in which prior interpretation information in prior knowledge network are introduced. Finally, patent keywords can be extracted by finding top-k node words with higher node rank values. In our experiments, patent text clustering task is used to examine the performance of proposed method, wherein several comparison experiments are executed. Corresponding results demonstrate that, new method can markedly obtain better performance than existing methods for patent keywords extraction task in an unsupervised way.


Introduction
For more and more patent texts, how to mine their contents to effectively obtain valuable patent information has aroused widespread concern [1,2].Generally, patent contents can be well represented by some key term words, also called patent keywords.Then, these keywords can be widely used in text mining such as automatic summary generation [3], patent novelty discovery [4], text clustering and classification [5,6].
Patent texts, as a type of semi-structured texts, usually have relatively regular structures about their paragraphs and sentences.Thus, TextRank [7] as a graph-based analysis model has been effectively used to extract patent keywords [8].For existing TextRank methods, they use the PageRank [9] formula to calculate node rank values based on the cooccurrence relationship among all possible words.However, these methods did not consider the original differences of term node importance over public common knowledge.Therefore, an improved TextRank model is proposed in this study by introducing a prior knowledge network, which is called as PrTextRank in this text.
In PrTextRank, a patent text is first modeled as a classical TextRank network, and a public dictionary data are modeled as a prior knowledge network.In the prior knowledge network, node weights are also computed by PageRank formula, and the edge weights are computed by means of the co-occurrence degrees among node words in dictionary entries.Then, the prior information in prior knowledge network is integrated into patent TextRank networks, and a new evaluation method of node rank value is designed.Finally, patent keywords can be extracted by finding the top-k node words with higher node rank values like in standard Tex-tRank model.
The main contributions of this study can be summarized as follows: (i) An improved TextRank model with prior public knowledge is effectively proposed for patent text keywords extraction.(ii) Several experiments on patent text clustering tasks using several clustering methods are performed, in which, several compared methods are considered including our proposed method, Term Frequency-Inverse Document Frequency (TF-IDF), TextRank and TopicalPageRank.According to experimental results, good performance of our proposed method can be indicated.(iii) An extended experimental analysis is also executed on other types of text including news text and food popular texts.Corresponding experimental results also display the availability of our proposed method.
The rest of this text is structured as follows.The second section briefly reviews the related works.The third section describes the proposed TextRank model.The fourth section gives our experimental analysis and results.In the fifth section, we provide a summary of our work.

Keywords extraction
Keywords extraction is the foundation of many text mining tasks and has been widely studied by many researchers.Wherein, supervised and unsupervised strategies are two basic categories.The supervised keywords extraction methods can be regarded as a binary classification processing, which needs labeled corpus data to train classification functions to obtain good performance, for example, the methods based on decision tree [10], Support Vector Machines (SVM) [11], neural networks [12] and so on.In recent years, with the deepening of deep learning research, many keywords extraction techniques based on neural network emerged.Zhang et al. [13] proposed a target center-based Long Short-Term memory (LSTM) model (TC-LSTM) to achieve performance improvement.She et al. [14] proposed a deep neural semantic network (DNSN).Feng et al. [15] used reinforcement learning and deep learning to extract entities and relationships from texts, and used bidirectional LSTM to realize preliminary entity extraction.Then, Tree-LSTM was designed to capture the most important information mentioned in the relationship.
However, the supervised keywords extraction methods rely too much on labeled corpus.Because manual labeling is time-consuming and laborious, supervised methods will be limited in many application scenes.
At present, more and more researchers are focusing on unsupervised keywords extraction problems.Such type of methods usually designs different scoring criteria to rank candidate keywords, and to extract the top-k words as keywords.Generally speaking, unsupervised extraction methods may be divided into three categories: statistical methods, word graph methods and Latent Dirichlet Allocation (LDA) methods.
The most classical statistical method is TF-IDF [16], which has strong applicability, but it mainly relies on word frequency information, which leads to ignoring the semantic features in the text.
Another kind of famous unsupervised methods are word graph methods.Inspired by the great success of PageRank algorithm and its wide application, Mihalcea et al. [7] proposed the TextRank method in 2004, which constructs an undirected weighted graph and uses the PageRank iterative calculation formula to calculate the importance of nodes.However, the graph-based keyword extraction algorithm needs many iterations, which increases the complexity of calculation.Gollapalli et al. [17] proposed CiteTextRank method, which uses citation networks to enhance the information content of text graph.Florescu et al. [18] put forward PositionRank method by adding word position information to the PageRank model.In addition, Devika et al. [19] proposed Semantic graph-based Keywords Extraction Method (SKEM), which can effectively extract keywords by means of semantic information and graph indices.
By calculating the similarity between candidate keywords and topics, the extraction method based on LDA could be proposed.As early as 2010, Liu of Tsinghua University put forward the algorithm of TopicalPageRank (TPR) [20], in which the PageRank score of each candidate keyword under corresponding topic is evaluated.However, above method based on LDA is largely affected by the distribution of training topics.In addition, the number of topics should be adjusted manually in advance.

Patent analysis based on text mining
With the standardization and specialization of patent texts, text mining has been widely used in patent analysis.When applying text mining methods in the field of patent analysis, most people have paid attention to meaningful keywords extraction.For example, Yang et al. [21] used regular expression pattern matching techniques to extract semantic information from patent claims.Noh et al. [22] took four different factors to determine the best keyword selection and processing strategy for patent texts.
Technology evolution analysis, new technology discovery, and patent search are all important patent analysis tasks.Madani et al. [23] used CiteSpace for bibliometric analysis and cluster analysis of keyword networks to analyze patent evolution problems.Park et al. [24] perform the technology opportunity discovery by means of comprehensive analysis of patent classification and collaborative filtering.Yanagihori et al. [25] created an extended dictionary of word meaning, and applied compound noun analysis to realize similar patent search.

Clustering algorithm
At present, the commonly used clustering methods are partition methods, hierarchical methods, density-based methods, graph clustering methods, etc.
Partition-based clustering algorithms iteratively divide the data into different clusters until the distances between points in the same cluster are close enough while the distances between points in different clusters are far enough.Among them, the most classical clustering algorithm based on partition is k-means [26,27].K-means algorithm is simple and efficient, but it needs to preset the number of clusters and it is sensitive to the selection of initial cluster centers.
Hierarchical clustering algorithms combine the nearest points into one cluster, and then combines the nearest clusters into one big cluster, until all points form one cluster.Although these methods do not need to set the number of clusters in advance, its computational complexity is very high.
Density-based clustering algorithms can realize irregular shape clustering by density estimation, and decide whether to continue clustering according to whether the point densities in a region exceed a threshold.A classical algorithm is Density-Based Spatial Clustering of Application with Noise (DBSCAN) [28].Although this algorithm can realize the clustering of irregular shapes, its performance depends too much on conditional parameters.
Based on graph theory, graph clustering algorithms regard data as points with connected edges in a graph space, and realize clustering by cutting graphs.A classic algorithm is spectral clustering [29].Spectral clustering algorithm depends on the input of similarity matrix, which is easy to implement, but it is also sensitive to initial parameter selection.

Overview
Considering the introduction of prior knowledge network, a new model framework is designed in this study, as shown in Fig. 1.
In this study, first, the patent text is preprocessed to obtain candidate keywords.Wherein, the accuracy of identified candidate keywords will directly affect the quality of finally extracted keywords [30].At the same time, the public dictionary data are also preprocessed.Then, each patent text is modeled as a TextRank network, and a prior knowledge network is constructed based on public dictionary data.Further, an improved node rank value evaluation formula is designed by combining the prior information in prior knowledge network.Finally, the top-k nodes with higher node values are extracted as patent keywords.

TextRank model
The TextRank model is a typical graph-based keywords extraction method inspired by the PageRank.In this model, the text is modeled as an undirected weighted graph G = (V, E, W) , in which the candidate keywords are regarded as the node set V, and the co-occurrence relationship between two words in a sliding window is regarded as an edge in E. W represents the time of co-occurrence with respect to E. Figure 2 shows an example of one patent TextRank network.The size of nodes in the graph is same, which means that all nodes have same initial weights.The thickness of an edge represents the value of W, and thicker edges denote larger W values.It can be seen from Fig. 2 that the edge between "semiconductor device" and "feature" is the thickest, which indicates that above two words appear most frequently in same sliding windows.
Inspired by the PageRank principle, the iterative calculation formula ( 1) is introduced to compute node weights.wherein, d is a damping coefficient of iterative computation, and may take 0.85 as default.In(V i ) represents the set of nodes pointing to V i , Out(V j ) is the set of nodes pointed by V j .The formula shows that the weight of a node V i depends on the edge weight from V j to V i and the sum of edge weights from node V j to other nodes.
TextRank model only uses the information of text itself, which is suitable for general fields, but ignores prior text characteristics for some specific applications.

Prior knowledge network
Usually, the entries in a public dictionary have been carefully constructed by domain experts.They should cover a wide range of fields and be authoritative.Based on above considerations, a directed prior knowledge network can be constructed based on public dictionary data, in which network nodes represent dictionary words, and network edges represent the explanatory relations among dictionary words.
A partial prior knowledge network is shown in Fig. 3.The node sizes in the network are set according to their in-degree values.We believe that the more times a node is interpreted by, the more possible it is a professional entry, and the more specific meaning it represents.It can be seen from the figure that the word in-degree value of "electronic computer" is very high, reflecting that this word has been explained more times and its content is more specific.
When constructing network edges, we record the cooccurrence times of edges between two dictionary words.After computing the edge weights (above co-occurrence time values) in prior knowledge network, node weights ( nw PKN (i) for node i) could be further calculated using the PageRank iterative equation. (1)

Prior keywords importance
In traditional TextRank model, all candidate keywords are assigned a same initial importance value.In this study, we think that some prior information of keywords can be considered for a patent text under a public dictionary.First of all, we introduce the classical TF-IDF calculation method.Term frequency (TF) refers to the frequency with which a certain word appears in a given text document.The formula is as follows: where n ij represents the number of occurrences of the word t i in document d j , and the denominator represents the total number of occurrences of all words in document d j .
In [31], the Inverse Document Frequency (IDF) is introduced to describe the frequency of documents containing a certain word in the corpus.If a few documents contain a certain keyword i, it shows that the keyword i has a good discrimination ability, and its IDF value is higher.The concrete formula for IDF is, where |D| represents the number of documents in the corpus.The denominator in formula (3) represents the number of documents containing the word t i , and adding 1 is to prevent meaningless 0 value.
Here, we believe that the information in prior knowledge network may be used to define different initial importance values of candidate keywords.According to general cognition, the words that appear in the definitions of other words may be more popular words, and their possibility as keywords will be lower.In contrast, other words will have higher possibility values of being keywords.
Inspired by above idea, we may introduce a node popularity for each node word in TextRank network.Concretely, it can be defined as follows: wherein, nw PKN (i, n) represents the weights of n nodes asso- ciated with patent node word v i in prior knowledge network.By combining above idea, it can be known that, the larger pd i is, the more times of patent node word v i explains other words, so the lower the importance of patent node word v i is.Then, the prior importance of patent node word v i in docu- ment D j can be introduced as follows: (2) So, we may think that, if any candidate word may be a valid keyword, then its prior importance should be higher.

Transfer factor computation
In TextRank model, to compute node rank values, edge weights also called transfer factor values on edges should be computed according to practical problem requirements. ( On one hand, we consider that the transfer factor value on an edge is related to the co-occurrence frequency f n′ of two node words n and n′ connected by that edge.
On the other hand, we may use some prior information in prior knowledge network to calculate the value of transfer factor.Here, associative memory strategy is employed.Associative memory [32] is the association of new information and known things, and it is the core function of human brain.It can be believed that, the learning processing of human brain [33] may be the process of neurons' generation, deletion and association, also the process of associative memory.So, if each entity in the prior knowledge network is regarded as a Fig. 3 A part of prior knowledge network neuron, and a connection between two entities is regarded as an association relationship, we may calculate the associative relationship between two entities by calculating their connection strength.A concrete formula is proposed in [34], and its form is: where ⟨ n ijk , n′ ijk ⟩ denotes a connection between two nodes n and n′ in prior knowledge network, and Co represents the co- occurrence times of two node words in dictionary entries.I n and I n′ respectively indicate the relative position index value in that sentence.Besides, M represents maximum associa- tive jump number of connecting two nodes n and n′ , and N is the total number of all nodes in prior knowledge network.Here, M = 2 is set as default for low computing complexity.
In addition, in formula ( 6), if there is no valid associative access within M steps, it is considered that there is no valid association between two node words, then U p n′ is set to be 1/10.It means that, if they have prior associative memory (6) information for two node words, and two node words appear in a same sliding window in given patent text, then their transfer factor value should be higher.Finally, we may introduce following equation of transfer factor calculation into our improved TextRank model.

Novel keywords rank value computation and extraction
According to above description, we may directly introduce following improved node rank value computing formula.wherein, d is a damping coefficient same as in original Tex-tRank model.pi(v i , D l ) denotes prior importance computed in formula (5).W p ij represents the transfer factor value defined in formula (7).
According to above discussion, the method process of PrTextRank can be concluded as follows.( 7) As the typical representative of semi-structured texts, patent texts not only contain structured information such as application date and IPC classification number, but also contain unstructured information such as title indicating technique focus, abstract summarizing technique contents and claims revealing detailed technical scope.In this study, the title, abstract and claims of patent texts are used.
In addition, this study uses the data set used in paper [35], in which 953 Sohu sports news texts are included.At the same time, 953 texts were captured from Foodbk [36] as another text set.Such, 1906 texts of above two types of texts are taken as dataset II.

Experiments and analysis
First, experimental datasets and performance evaluation indices are introduced.Then, the performance results of four keyword extraction methods under three clustering methods will be reported.Finally, the availability of PrTextRank will be further tested on another dataset.

Datasets
The experimental patent text corpus is provided by Changzhou Baiteng Technology Company.According to the IPC classification standard, we use three categories of patent The construction of our prior knowledge network is based on the Chinese Dictionary [37], which contains items covering various fields and their explanatory information.When constructing a prior knowledge network, the interpretation words are preprocessed by word segmentation and stop words removal, and then the network is constructed using the relationship between entry items and their interpretation words.

Algorithm settings
In order to examine the performance of our proposed method, we use extracted keywords to represent patent texts and to perform cluster analysis.If the extracted keywords can well represent patent texts, then high-quality clustering results will be obtained.Three clustering methods, K-means [27], DBSCAN [28] and classical spectral clustering [29] are used for performance analysis.
In our experiments, the sliding window size is set to 7, the iterative time of computing rank value is set to 50, the damping coefficient is set to 0.85 by default, and the number of Topi-calPageRank topics is 5.Because the number of categories of dataset I is 3, the number of categories of K-means is set to 3. Spectral clustering uses k-nearest neighbor to represent the similarity matrix, n_neighbors is set to 100, and the final number of clusters is 3.In DBSCAN clustering, ϵ is 1 and MinPts is 4.

Evaluation indices
For performance evaluation, three common evaluation indices are used including Precision (P), Recall (R) and F1-Score(F1) [38].Corresponding calculations can be written as follows:  wherein, k represents the number of clusters,TP i ,FN i ,FP i , represent the number of true positive, false negative, and false positive of category i , respectively.TP i + FP i repre- sents the number of samples predicted to be positive, and TP i + FN i represents the actual positive sample number.

Results and analysis
Here, the clustering performance respectively using full-text words and keywords are compared.Next, the performance of PrTextRank with different algorithm settings is examined.Then, the performances of PrRankText are compared with related methods.Finally, keywords extraction performance results of compared methods on non-patent documents (dataset II) are reported.

Validity analysis of keywords clustering
To verify the validity of using keywords for text clustering, we use full-text words and PrTextRank keywords as different experimental conditions, wherein three clustering algorithms are considered.In addition, baseline Convolution neural network (CNN) mentioned in [39] is also considered as comparison method.For CNN, 3000 patent texts in dataset I are divided into training data and test data with 80% and 20%.For CNN method, 12,100 features are extracted from training and testing data as CNN input, and each feature is processed as a 200-dimensional vector.In our experiments, the number of filters is set to 128, the filter window size is 3, and the step size is 1 for convolution.Max-polling is selected in the pooling layer, and the largest feature item is reserved.To prevent over-fitting, set the dropout rate to The vector dimension is compressed by adding a flatten layer.Finally, through two dense layers, the vector length is shrunk to 3 related to three types of patent texts.Corresponding results are shown in Fig. 4. As shown in Fig. 4, the clustering performance using keywords is slightly lower than that using full-text words.However, above performance loss seems to be very small.Besides, the performance result obtained by CNN classification reaches the maximum value 84.5%, which is 2.7% higher than the maximum value obtained by PrTextRank keywords method with spectral clustering.However, CNN method needs pre-training and is a supervised method, while PrTextRank is an unsupervised method.
Therefore, we can think that effective text clustering can be performed by using few keywords as representative of whole patent text.

Performance analysis with different keywords number
Here, we use k-means clustering to compare the clustering performance under different number of keywords.The performance results are shown in Fig. 5.Because only the title, abstract and claims are used in our experimental patent documents, so we set the number range of extracted keywords is from 3 to 9. From Fig. 5, PrTextRank gains the best performance when the number of keywords is 6.When the number increases from 1 to 6, the performance will also increase, while the performance will decrease when the number becomes further bigger.For above results, we may think that, more keywords will be redundant for representing patent text, and result in the decline of clustering performance.In contrast, the best keyword number for TextRank and TF-IDF is 7 according to Fig. 5.In addition, PrTextRank obtains higher performance results than two compared methods.

Performance influence under different algorithm settings
Here, Tables 1, 2 and 3 show the driving ability of different feature settings in PrTextRank under different clustering methods.Wherein, the basic model is set as TextRank, and TextRank + P i (V i , D j ) indicates the prior importance influ- ence, further TextRank + W n′ indicates the improved transfer factor influence.
Above experimental results show that, two newly designed algorithm strategies are very effective.Specifically, after adding prior importance of nodes in the classical TextRank model, the precision values can be increased by 6.16%, 6.26% and 5.79%, respectively.The recall values can be increased by 1.8%, 2.98% and 6.48%.The F1-Score can be increased by 3.88%, 4.68% and 4.48%, respectively.According to above results, we may believe that, effective patent text keywords may refer to more professional but not common words, and our previous considerations should be reasonable introduced in PrTextRank model.
Furthermore, it can be found that, the effective of newly designed transfer factor should be higher than that of the prior importance of nodes.These results might be consistent with the inherent idea of TextRank, that is, the importance of a node depends on the contribution of adjacent nodes, and the contribution mainly depends on the edge weight between two adjacent nodes.
Finally, when two aspects of strategies (prior node importance and improved transfer factor) in prior knowledge network are used together, the clustering performance can be further improved.Above results demonstrate that, two new considerations in PrTextRank are reasonable and effective.

Performance analysis with related methods
Here, further performances analysis of PrTextRank are performed using TF-IDF, classical TextRank and Topi-calPageRank (shorten as TPR).Corresponding results are given in Table 4, in which the average results of 10 runs are taken.
For F1-Score performance index, TextRank method is 2.37%, 4.64% and 2.12% higher than TF-IDF, respectively.And, the performance of TopicalPageRank is clearly better than that of TextRank.Especially, under spectral clustering method, the F1-score obtained by TopicalPageRank is 17.22% and 15.1% higher than TF-IDF and TextRank, respectively.
However, the F1-score performances of PrTextRank are further higher than that of TopicalPageRank with 5.88%, 7.87% and 3.6% under three different clustering algorithms, respectively.For above results, the reason may be that, TopicalPageRank can cover most of text topics, however, a patent text has fewer topics and more professional term words, so using prior knowledge can effectively enhance the weight of professional term words.

Complexity analysis
In PrTextRank, a prior knowledge network should be extra constructed compared to traditional TextRank, which may increase the computational complexity.So, the running complexity of PrTextRank and TextRank is further examined on dataset I.For a case with a document containing 51 words, the running time and memory usage of TextRank method are 0.38 s and 580.0 KB.And for PrTextRank method, corresponding values are 1.26 s and 1008.0KB.For another case with 3000 documents containing 12,100 words, the running time and memory usage of TextRank method are 990.21s and 79,376.0KB, while for PrTex-tRank method, corresponding values are 3113.59s and 188,280.0KB.According to above results, we may know PrTextRank requires some extra running time and running memory when constructing and using prior knowledge network.However, considering the obvious improvement in performance, those extra computational expense should be worth it.

Extended experimental analysis
To further verify the performance of PrTextRank, dataset II is used for general text keywords extraction problem.Corresponding results are shown in Fig. 6.
The results in Fig. 6 show that, PrTextRank also achieved good performance in dataset II.Compared with TF-IDF and TextRank, the performance of PrTextRank get the improvement with 4.92%, 8.16% and 10.98%, respectively.Compared with TopicalPageRank method, the performance of PrTextRank also can improve with 0.44%, 1.68% and 2.78%, respectively.These results are consistent to the experimental results on patent text datasets.In addition, because dataset II contains sports news texts and food popular texts with completely non-overlapping topics, keywords extracted by TopicalPageRank may effectively distinguish different texts.Even so, our proposed method also gains better performance results than TopicalPageRank, which reflect the good effectiveness of introducing prior public knowledge network.

Extended analysis with latent semantic index
Keyword-based document retrieval is the simplest and most common method.However, for many documents with complex structures and content, it seems insufficient of only relying on keyword matching for the requirements of document index.So, we should also pay attention to latent semantic information in documents.The Latent Semantic Indexing (LSI) proposed by Dumais et al. [40] is powerful tool to realize document index by finding latent correlation between document words.In more specific applications, LSI may be used for complex document index task based on the extracted keywords by PrTextRank.

Conclusion and future work
In this study, we proposed a new unsupervised keywords extraction method by introducing prior public knowledge in traditional TextRank model.Wherein, two aspects of prior information are introduced including prior node importance and improved transfer factor computing strategy.Our experimental results indicate that the proposed method can obtain better performance than traditional methods on patent text keywords extraction problems.In addition, the proposed method does not need additional parameter setting, and can be widely used in different applications.
For future work, the knowledge representation of words and network construction methods could be further studied.And we also plan to explore how to integrate more multidimensional heterogeneous information for keywords rank value evaluation.

12 :
add top-k keywords in KeywordsList 11: return KeywordsList texts under the electrical category, each of which has 1,000 patent texts, a total of 3000 patent texts are used as dataset I.

Fig. 4
Fig. 4 F1-Score performance results for validity analysis of keywords clustering

Table 1 K
-means clustering performance under different algorithm settingsThe best results are highlighted in bold

Table 2
DBSCAN clustering performance under different algorithm settingsThe best results are highlighted in bold

Table 3
Spectral clustering performance under different algorithm settings