AttenSy-SNER: software knowledge entity extraction with syntactic features and semantic augmentation information

Software knowledge community contains a large scale of software knowledge entity information, complex structure and rich semantic correlations. It is significant to recognize and extract software knowledge entity from software knowledge community, as it has great impact on entity-centric tasks such as software knowledge graph construction, software document generation and expert recommendation. Since the texts of the software knowledge community are unstructured by user-generated texts, it is difficult to apply the traditional entity extraction method in the domain of the software knowledge community due to the problems of entity variation, entity sparsity, entity ambiguity, out-of-vocabulary (OOV) words and the lack of annotated data sets. This paper proposes a novel software knowledge entity extraction model, named AttenSy-SNER, which integrates syntactic features and semantic augmentation information, to extract fine-grained software knowledge entities from unstructured user-generated content. The input representation layer utilizes Bidirectional Encoder Representations from Transformers (BERT) model to extract the feature representation of the input sequence. The contextual coding layer leverages the Bidirectional Long Short-Term Memory (BiLSTM) network and Graph Convolutional Network (GCN) for contextual information and syntactic dependency information, and a semantic augmentation strategy based on attention mechanism is introduced to enrich the semantic feature representation of sequences as well. The tag decoding layer leverages Conditional Random Fields (CRF) to solve the dependency between the output tags and obtain the global optimal label sequence. The results of model comparison experiments show that the proposed model has better performance than the benchmark model in software engineering domain.


Introduction
As a current mainstream software knowledge community, StackOverflow serves as a platform for software knowledge exchange and sharing, which is related to software programming, configuration management, and project organization, and gradually it develops into an important knowledge base of software domain for software developers [1]. Software developers are able to search for relevant information of specific software knowledge entities (such as libraries, APIs and Bug exceptions) on this platform, and design forums for questions, answers and comments according to individual requirements, which is helpful to understand relevant knowledge of software-specific entities and solve the problems in the development process.
So far, about 21 million software-related questions have been posted on StackOverflow, and produced lots of specific software knowledge entities, complex structure and rich semantic association in these Q&A texts. Traditional text processing technologies based on keywords and topic models treat software knowledge entities as plain texts, neglecting the domain features of software knowledge community texts, and unable to satisfy the requirements of software developers to acquire intensive software knowledge. It is an imperative challenge to acquire software knowledge accurately and efficiently from the software knowledge community in software knowledge management.
As a method of knowledge representation, knowledge graph is often applied to model entities, concepts and semantic relationships, which can enhance the expression of knowledge organization structure and enable users to process information quickly, accurately and intelligently [2]. It will promote the development of entity-centered applications such as intelligent question-answering, software document generation, expert recommendation and software reuse, if we extract software knowledge entities and semantic relationships between entities from user-generated texts in software knowledge community, and then construct the software knowledge graph.
Entity extraction aims to recognize mentions of rigid designators from the text belonging to the pre-defined semantic types such as person, location and organization, which plays an essential role in knowledge graph construction and natural language understanding [3]. In this paper, the software knowledge entity extraction is referred to recognize and extract software-specific entities (such as software programming, software development libraries, software projects, etc.) from massive unstructured texts in knowledge community, and classified into pre-defined categories. Due to the problem of error propagation, the accuracy and efficiency of software knowledge entity extraction is crucial for subsequent software entity relationship extraction, software knowledge graph construction and application.
The user-generated content in the software knowledge community StackOverflow is unstructured short text with the following common characteristics: 1. Lack of unified programming language specifications and strict spelling rules, more spelling mistakes and abbreviations generated, resulting in the problem of name variation. For example, software knowledge entity "JavaScript" has multiple entity variation generated by name abbreviations: "JS", "javascript". 2. Many software-specific entity names are in common words, resulting in entity sparsity problem. 3. The same software knowledge entity can belong to different entity types in different contexts, resulting in entity ambiguity. For example, software knowledge entity "Mac" can be labeled either "PlatCOS" (operating system) or "SLMDL" (mobile development library). 4. Some rare, distinctive software knowledge entities may cause OOV words that are unrecognizable.
Compared with finance, law, biomedical and other domains, entity extraction research in software engineering domain lacks corresponding resources and technologies, and encounter huge challenges such as entity variation, entity sparsity, entity ambiguity, more OOV words and lack of annotated data sets.
In view of the above challenges, by comprehensive considerations of word, syntax, entity context and its semantic features, we propose a novel software knowledge entity extraction model which integrates syntactic features and semantic enhancement information. Different from the current entity extraction methods in the general field, this model improves the input feature representation and context coding based on BERT model, BiLSTM and GCN. The main contributions of this paper are: 1. According to the domain features of software knowledge community text, a novel software knowledge entity extraction model is proposed, which integrates syntactic features and semantic enhancement information. This model applied GCN to encode the syntactic dependencies of sequences, and a semantic enhancement strategy based on the attention mechanism is proposed to enhance the semantic representation of words in the sequence.

Related work
Entity extraction is a classic sequence labeling task, which can be formally defined as: given an input sequence X = {x 1 , x 2 , . . . , x n }, entity extraction model will output a list of triples {Y s , Y e , t}. Each triple in the list represents a named entity of the input sequence X, with Y s as the starting index of the entity, Y e as the ending index of the entity, and t as pre-defined type of the entity. At present, the models and methods of entity extraction are mainly divided into three categories: rules-based and dictionary-based methods, machine learning-based methods and deep learning-based methods [4]. The rule-based and dictionary-based methods rely on artificial feature selection and domain dictionary [5,6]. Such methods usually showed high Precision (P), but low Recall (R) and poor domain migration performance. With advances in machine learning, Hidden Markov Models (HMM) [7], Maximum Entropy Models (MEM) [8], Support Vector Machines (SVM) [9] and Conditional Random Fields (CRF) [10] have been widely used in entity extraction tasks. Compared with the rulebased and dictionary-based methods, this kind of methods are more adaptable and do not require extra language knowledge, but relies on feature engineering and a large amount of annotated data. The quality of feature extraction and the size of annotated data determine the generalization ability of the algorithm. However, it is difficult to feature extraction and obtain annotated data in actual production. Recently, deep learning models such as Convolutional Neural Network (CNN), Long Short-term Memory Network (LSTM) and GCN have been widely used in entity extraction task and achieved promising performance [11,12]. Compared with other methods, the deep learning-based method adopts an end-to-end method to automatically learn features from texts without additional feature engineering. So, it has been applied to entity extraction tasks in various domains.
For entity extraction tasks in the software knowledge community domain, Ye et al. [13] divided software knowledge entities into five categories: Programming Languages, Platform, API, Tool-library-framework and Software Standard. Then, a software-specific named entity recognition was proposed in software engineering social content based on semi-supervised learning. The model obtained better performances through orthographic features, lexical and contextual features, word bitstring features and gazetteer features. Zhao et al. [14] proposed a relational triplet extraction framework HDSKG in software engineering domain by incorporated dependency parser with rule-based method. In this model, SVM is used as a classifier to evaluate the domain relevance of candidate relational triples. Combined with text features, corpus features, concept features and source features, a knowledge graph of software engineering domain with 35,279 relational triples, 44,800 concepts and 9660 unique verb phrases was constructed. Aiming at the problem that the HDSKG framework does not fully consider the features of entity concepts and term phrases, Guo et al. [15] proposed a strategy for the extraction of WiKi pages in the software engineering domain. First, the domain dictionary was built by page titles. Secondly, rules were designed according to the characteristics of concept in software engineering domain. Finally, the domain dictionary was used to improve the precision of entity recognition. The above researches focused on information extraction in software domain earlier, which belongs to rules-based and dictionarybased methods or machine learning-based methods, and the performance of the model relied on manual feature extraction and a large amount of annotated data. In terms of deep learning-based methods, Reddy et al. [16] proposed a named entity recognition model based on BiLSTM + CRF model, classifying entities in the software engineering domain into 22 entity types. For multi-source data such as unstructured data, semi-structured data and code data, Lv et al. [17] extracted multi-source software knowledge entities by using BiLSTM + CRF, template matching and abstract syntax tree. At the same time, TF-IDF, TextRank and K-means methods were used to solve the problem of the lacking of annotated data set. Tabassum et al. [18] combined the question-andanswer text data of Stack Overflow, constructed a named entity recognition corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types, and proposed an entity recognition model based on attention mechanism, which improved the effect of code entity recognition.
Above all, the extraction of software knowledge entity and the construction of software knowledge graph have attracted the attention of many scholars and obtained tremendous outstanding progresses. However, due to the problems in software knowledge community, such as entity variation, entity sparsity, entity ambiguity, more OOV words and lack of annotated data sets, the task of software knowledge entity extraction has not been well solved.

Proposed method
Aimed at the above problems and challenges, we propose a software knowledge entity extraction model, named AttenSy-SNER, which integrates syntactic features and semantic augmentation information, to extract finegrained software knowledge entities from unstructured usergenerated content. The Attensy-SNER model is generally divided into input representation layer, context coding layer and tag decoding layer. The overall architecture is shown in Fig. 1.
First, Attensy-SNER model uses pre-training model BERT for unsupervised learning of massive software knowledge community texts to generate word vector representation of software engineering field and enrich input feature representation of software knowledge entity extraction. Secondly, as the context information and syntactic dependency of sentence sequences play a key role in entity extraction, Attensy-SNER model uses BiLSTM and GCN to encode the context information and syntactic dependencies of sequences, and obtain multi-feature context representation, which effectively alleviates entity ambiguity. Meanwhile, aiming at the problems of entity variant, entity sparsity and out-ofvocabulary in the text of software knowledge community, semantic enhancement strategy based on attentional mechanism is proposed to enhance the semantic representation of words in the sequences and the final feature vector representation of entities is obtained by feature fusion. Finally, CRF model is used as the tag decoder to solve the dependence between the output tags and the global optimal tag sequence is obtained.

Input representation layer based on BERT model
The input representation layer of the model is responsible for transforming the sentence sequence of the software knowledge community text into low-dimensional and dense distributed vector representation, which is fed to the next layer of this model to obtain context features. Relevant literature [19][20][21] shows that rich domain features obtained in the input representation layer will help to improve the performance of specific domain entity extraction model. As a pre-trained language model, BERT model can learn a large amount of prior information about lexical, syntactic and domain features for downstream tasks through unsupervised training based on a large number of corpus in the early stage, helping to generate dynamic word vectors with the current context, thus improving the semantic disambiguation ability of the model. In terms of model architecture, BERT model uses bidirectional Transformer encoder, combined with Attention mechanism, to learn the context information of the current word, so as to obtain a better distributed representation of the word. In terms of input representation, the input of BERT model consists of Token Embeddings, Segment Embeddings and Position Embeddings. The Token Embeddings convert words to 768-dimensional word vector representations starting with the [CLS] and ending with the [SEP]. Segment Embeddings represents a feature vector of sentences for downstream sentence-level classification tasks. Position Embeddings encode the position information of words into vector representation, and add the key sequential characteristic to the sequence data with Position Embeddings [22]. In terms of pre-training task, BERT model obtains character-level, word-level, sentence-level and intersentence relationship feature representation through the joint training of the Masked Language Model (MLM) and Next Sentence Prediction (NSP). The MLM task is to mask 15% of the words randomly and make predictions based on the context. Of these, 80% of the words were replaced with [mask], 10% were randomly replaced with other words, and 10% remained unchanged. The NSP task is to train the model to understand the relationship between sentences, predict the next sentence based on the previous sentence, and enhance the model to understand the sentence-level task.
To obtain high-quality word vector representations of input sequence and improve the performance of extracting entities from the software knowledge community text, we use the BERT to pre-train the massive question-answer text of StackOverflow, and take the pre-trained word vector as the input of the model. In the training process of Attensy-SNER model, the words in the input sequence can be represented by the corresponding word vector based on querying the BERT pre-trained word vector, and then input into the model.

Context encoder layer based on multi-feature fusion
For the fine-grained entity classification task of software knowledge, the acquisition of context information and syntactic dependency can alleviate the problem of entity ambiguity caused by different contexts. Therefore, BiLSTM and GCN are used to encode the context information and syntactic dependency of sentence sequences, respectively.

Context feature encoder based on BiLSTM model
In the task of software knowledge entity extraction, data are sentence sequences from the software knowledge community, which is suitable for modeling with a Recurrent Neural Network (RNN). The RNN model has the characteristics of parameter sharing and memory, which combine the output of the previous time step with the input of the current time step to determine the output of the current time step. Nevertheless, for processing long sequences, the RNN Fig. 2 The structure of LSTM circulation unit model appears the gradient vanishing problem and gradient exploding problem, which cannot solve the problem of long-distance dependence. The LSTM model proposed by Hochreiter [23] alleviates the gradient vanishing problem and gradient exploding problem by introducing memory unit and gating mechanism to process long-distance information. The circulation unit structure is shown in Fig. 2.
The formal representation of the T time of Model is as follows [23]: Among them, σ and tanh represent the nonlinear activation function of neurons, i t represents the input gate state, f t represents the forget gate state, c t represents the cell unit state, o t represents the output gate state, x t is the input vector, h t is the hidden state vector, W represents the interlayer weight matrix, and b is the bias vector of neurons.
According to the structure of the LSTM model, at the moment t of sequence processing, the LSTM model captures the above information of the current sentence sequence through the states of input gate, forget gate, cell unit and output gate. However, it lacks the following information, which also plays an important role in the task of software knowledge entity extraction. Therefore, two LSTM models with opposite directions are used to construct a bidirectional LSTM model, which can simultaneously capture the context information of the sentence sequence at moment t, and the final output is obtained by combining the output of the LSTM model in both forward and reverse directions, the calculation process is as follows: At time t of the model, the output of the hidden layer state of the forward LSTM is: Among them, − → h t and X t represent the output and input at time t, h t−1 represents the hidden layer state at time t − 1. The output of the hidden layer state of the reverse LSTM is: Among them, ← − h t and X t represent the output and input at time t, h t+1 represents the hidden layer state at time t + 1. Therefore, the total output of the bidirectional LSTM model at time t is:

Syntactic feature encoder based on GCN model
Syntactic feature refers to the extraction of the sentence structure and the dependency between the words in the sentence by means of syntactic analysis techniques. Among them, syntactic dependency features focus on describing the local components of sentence sequences and the longdistance dependency between words. Therefore, obtaining the dependency feature between words in sentence sequence can capture the dependency relationship and long-distance dependencies between entities, which is helpful to recognize some potential entities. Compared with the general field, the data of the software knowledge entity extraction in this paper come from the software knowledge community, and there are a large number of phrase entities and long-distance dependencies. Therefore, it is of great significance for software knowledge entity extraction to extract the dependency relations between words in sentence sequence. GCN is a kind of convolutional neural network based on graph structure data. By modeling graph nodes and edges, it can capture the dependence between nodes, and is gradually applied in the field of natural language processing, such as text classification [24], semantic role labeling [25], relation extraction [26] and machine translation [27]. Since the standard LSTM network cannot model the dependency structure of sentence sequences in the software knowledge community text, this paper uses GCN to encode the dependency graph of the sentence sequence, and treat each word in the sentence sequence as a node. The node generates its eigenvector representation by collecting the information of its neighboring nodes.  Fig. 3 Example of a syntactic dependency graph The input of syntactic feature encoder based on GCN consists of two parts: the output of the context feature encoder layer based on BiLSTM and the adjacency matrix constructed by syntactic dependency analysis. The adjacency matrix of the sentence sequence is constructed from the syntactic dependency graph obtained in advance. In this paper we use the StanfordCoreNLP tool to analyze the syntactic dependency of sentence sequences in software knowledge community texts. For example, the sentence sequence "How to convert a TIFF to a JPG with ImageMagick?", the dependency diagram of the sentence sequence is obtained through syntactic dependency analysis, as shown in Fig. 3.
In this sentence sequence, the entity is "TIFF", "JPG" and "ImageMagick" respectively, and the dependency word is "convert", where the dependency relationship between "TIFF" and the dependency word "convert" is "obj", indicating that the entity "TIFF" is the object of the predicated "convert".
After the dependency diagram of the sentence sequence is constructed, it can be transformed into an adjacency matrix, and its formal description is stated as follows: For the sentence sequence X = (x 1 , x 2 , . . . , x t ), each word is a node. If there is a dependency relationship between node x i and node x j , it means that there is an edge between node x i and node x j , and A ij = 1, then adjacency matrix A is obtained. For example, the adjacency matrix of the above sentence sequence is: In an L layer of GCN syntactic feature coding layer, the node i aggregates the features of its neighboring nodes through graph convolution operation to complete the output of feature vectors: where ReLU represents the nonlinear activation function,h l−1 i represents the input of the node at the lth layer, h l i represents the output of the node at the lth layer, W represents the weight matrix, and b l represents the bias units.

Semantic enhancement strategy based on attention mechanism
Problems such as entity variation, entity sparsity and outof-vocabulary words in the software knowledge community texts affect the quality of software knowledge entity extraction. Especially in the case of fine-grained entity types, the traditional methods based on domain dictionaries or external features will bring noise and error propagation to the extraction results of software knowledge entities, resulting in errors in the labeling of software knowledge entities.
Relevant literature [28] shows that enhancing semantic feature representation of entities in a specific domain can effectively alleviate entity variants, entity sparsity and out-of-vocabulary word problems and improve the accuracy of software knowledge entity extraction. To alleviate these problems, a semantic enhancement strategy based on attention mechanism is introduced in the context coding layer of the model to enhance the semantic representation of entities in the sentence sequence, which mainly includes three steps: similar word extraction based on semantic consistency, semantic contribution weight calculation based on attention mechanism and feature vector fusion. The semantic enhancement strategy based on attention mechanism can be described as Algorithm 1.
As shown in Algorithm 1, the specific process is listed: First, pre-trained word vectors and thesaurus of software engineering are introduced as auxiliary resources to extract semantically consistent domain words through semantic similarity calculation. Then, the attention mechanism is used to calculate the semantic contribution weight of similar domain words to the target word. Finally, the final vector representation of the word is obtained by fusing the semantic enhanced vector representation.

Similar word extraction based on semantic consistency
External auxiliary resources play an important role in similar word extraction and affect the quality of semantic enhanced vector representation. On one hand, in the input representation layer of the Attensy-SNER model, in this paper we uses the BERT model to pre-train the sentence sequences of 43,594,128 in software knowledge community texts, and obtains the pre-training word vectors in the software engineering domain. The pre-trained word vector contains rich semantic information of software knowledge entities and can be used as an auxiliary corpus for similar word extraction. On the other hand, some research works [29,30] focus on the construction of thesaurus in the domain of software engineering, and automatically build thesaurus in the domain of software engineering through unsupervised learning from StackOverflow and Wikipedia, realizing functions such as abbreviation recognition and similar word recognition for the software engineering domain.
To improve the quality of similar word extraction and ensure semantic consistency, this paper comprehensively considers similar words extracted from BERT pre-trained word vectors and software engineering Sethesaurus [29] as semantic enhancement auxiliary. As an example, for the word "PHP" in the input sentence sequence "My PHP Version is 7.1", similar words like "JavaScript", "Ruby", "Groovy", "Cython" and "Sinatra" are extracted with semantic similarity calculation. These similar words have software domain relevance and will be used as auxiliary resources to enhance the semantic representation of the word "PHP".

Semantic contribution weight calculation based on attention mechanism
As the semantic contribution degree of similar words extracted in different contexts is different from the target words, this paper combines the attention mechanism to design the weight according to the semantic contribution degree of similar words, and comprehensively considers the semantic consistency of similar words.
The essence of attention mechanism is to selectively focus on some important information, and it is a selection mechanism to allocate information processing capacity. Specifically, it can be described as a mapping from a query in the target data to a series of key-value pairs in the metadata. By calculating the similarity between Query and Key, the corresponding weight of each Value is obtained [31,32]: Based on the attention mechanism, the sentence of software knowledge community text is given X = (x 1 , x 2 , . . . , x t ), for each word in the sequence x i ∈ x, extract n similar words S i = (s 1 , s 2 , . . . , s n ), and get the corresponding word vector representation C i = c 1 , c 2 , . . . , c j , then the semantic contribution weight of each similar word s i to x i is: where h i is the context hidden vector corresponding to the word x i in the sentence sequence, and the dot product operation is adopted for the similarity function. After the contribution weight of each similar word s i to the word x i is obtained, the semantic enhancement embedding vector of the current word in the sentence sequence is obtained by the weighted sum: Feature vector fusion After the above two steps, the semantic enhanced embedding vector A i of the current word X i in the sentence sequence of the software knowledge community text is obtained, and then it is integrated with the hidden vector h i of the context-encoded word x i to serve as the input vector of the linear CRF layer.

Tag decoding layer
Contextual encoding layer captures the context information of sentence sequence in the software knowledge community text, and gets the tag score of each word in the sentence sequence. However, if the label with the highest score is directly selected as the predicted output result, it will result in label error or invalid label as the dependency relationship between labels is not considered. For example, in this paper, software knowledge entities are annotated according to the BIO pattern of the sequence annotation specification. The tag sequence "B-APIWA I-APIWA" is valid, while the tag sequence "B-APIWA I-APIPM" or the tag sequence "O B-APIWA" is invalid. Therefore, a linear CRF layer is added in this paper to obtain a globally optimal tag sequence by considering the constraint relationship of adjacent tags from the sentence level to avoid the above problems. The formal representation of the linear CRF layer is [33]: the hidden state sequence h = (h 1 , h 2 , . . . , h t ) output by the context coding layer is transformed into the optimal tag sequence y = (y 1 , y 2 , . . . , y t ). The specific calculation process is as follows: First, the sequence of sentences X = (x 1 , x 2 , . . . , x t ) for the specified software knowledge community text, and the total score of the tag sequence is calculated by formula (14); Then, the probability of tag sequence y is normalized through the Softmax function, as shown in formula (15); Finally, the dynamic programming algorithm Viterbi is used to calculate the tag sequence with the highest score.
where Z is the transfer matrix between tags, Z yt−1, yt represents the probability of the transfer of tag yt − 1 to yt, W t yt is the parameter vector, and Y (h) is an all possible tag sequence.

Experimental setup
To evaluate the performance of Attensy-SNER model proposed in this paper, the comparative experiments with the benchmark models in the field of entity extraction were carried out. Attensy-SNER model was implemented in Python using the deep learning framework PyTorch. It was specifically configured as Intel Xeon Gold 5117 processor, 2.0 GHz clock speed, NVIDIA Tesla T4 GPU, 16GiB display memory, and all the experiments in this paper were conducted in this experimental environment.

Construction of pre-trained word vectors based on BERT model
First, in this paper, we download the official database of the StackOverflow, which uses SQL Sever2008 to store all the question-and-answer text of the StackOverflow from 2008 to 2018.
Second, in the software knowledge community Stack-Overflow, the tags represent the knowledge domain of the question, and users label 1-5 tags according to the topic of the question. Therefore, a huge amount of software engineering texts is preliminarily classified with the help of the tag system. The greater the number of questions are covered by the tag, the more attentions are paid to the domain and the greater the probability that the questions will be answered or high-quality answers will be generated. To obtain highquality corpus texts, the strategies for selecting Q&A texts in this paper are stated as follows: first, all the tags are sorted in the StackOverflow by the number of posts they cover, and select the tags with the higher attention. Second, the posts are selected based on whether they have accepted answers, question scores, answer scores, and the posts viewing times. The text content of the selected post consists of the question titles, the question descriptions, accepted the answers, and randomly selected comments. Finally, the text corpus set of software engineering field is obtained after data preprocessing.
The BERT-Base pre-training model is used for unsupervised learning of 43,594,128 sentences (including 2,656,719 questions, 5,526,559 answers and comments) in the software engineering domain text corpus set. According to the corpus set and experimental hardware configuration, the training sample size, learning rate and maximum sentence length are set as 32, 0.0001 and 128, respectively, with the sum of Masked Language Model loss and Next Sentence Prediction loss the lowest and training effect of model the best. Finally, the 768-dimensional software engineering domain word vector representation is obtained for software knowledge entity extraction task.

Construction of annotated data sets in software engineering domain
Due to the lack of open domain annotation data set for software engineering, in this paper we build annotation data set based on the question-and-answer text of the StackOverflow. To obtain high-quality corpus text, this paper adopts the same strategy mentioned above to construct annotated datasets in the software engineering domain.
The pre-definition of entity type is the key step of entity extraction, which reflects the granularity of entity and the goal of knowledge graph construction. Compared with the entity type in general domain, the software knowledge graph is a domain knowledge graph, which requires a more detailed entity type definition based on the knowledge requirements of the software engineering field and the goal of the software knowledge graph. From the related research work, there is no unified standard for the type of software knowledge entity in the field of software engineering, and most of them are pre-defined according to the application goal of software knowledge graph. The aim of this paper is to extract the knowledge entities related to software programming from the question-and-answer texts of the community, and to provide the entity-centric software knowledge retrieval and recommendation service for software developers. Therefore, this paper combines the Wikipedia category system relate to software programming and the knowledge requirements of software developers, this paper expands the types of software knowledge entities based on the literature [13], and predefines a total of 40 entity types from eight aspects, including programming language, system platform, software API, software tools, software development library, software framework, software standards, and software development process. Details are shown in Table 1.
In the process of data annotation, in this paper, we adopt BIO mode for annotation, in which "B-" represents the starting position of the software knowledge entity, "I-" represents the interior of the corresponding entity, and "O" represents the non-entity. The annotation team is composed of 10 teachers, software developers, graduate students and undergraduates with software engineering background. After 5 rounds of cross-validation, the annotation data set of software engineering domain is obtained. To ensure the scientific and reasonable results of model experiment, the data set is divided into training set, verification set and test set according to the ratio of 7:1:2 for the experiment of software knowledge entity extraction. The detailed information of the data set is shown in Table 2.

Parameter setting
In the training process of Attensy-SNER model, the dimension of pre-trained word vector in the input representation layer is set as 768 dimensions, the number of units in the hidden state of BiLSTM in the context coding layer is set as 200, and GCN is set as 1-3 layers. The Categorical Cross Entropy is used as the loss function of the model, and Adam is used as the optimizer. The initial learning rate is set at 0.001. At the same time, L2 regularization and Dropout mechanism were adopted to prevent over-fitting of model training. The setting of relevant hyperparameters of the model is shown in Table 3.

Evaluation metrics
In this paper, general evaluation metrics in information extraction task are selected to evaluate the performance of the model, including precision rate (P), recall rate (R) and F1 score (F1). Precision rate (P) represents the percentage of correctly recognized samples in all recognized samples in the model recognition results; the recall rate (R) represents the percentage of correctly recognized samples in the number of all correct samples; F1 score is the weighted harmonic average of precision rate (P) and recall rate (R), which is used as the comprehensive performance evaluation index of the model. The formal expression of each evaluation index  is as follows: where TP (True Positive) means that the model recognizes the correct entity as the correct number of entities, FP (False Positive) means that the model recognizes the wrong entity as the correct number of entities and FN (False Negative) means that the model recognizes the correct entity as the wrong number of entities.

Experimental results and discussion
The software knowledge entity extraction model Attensy-SNER proposed in this paper is improved in the input representation layer and context coding layer compared with the benchmark model of sequence annotation BiLSTM-CRF. Therefore, the performance of the model before and after improvement will be compared and analyzed.

Contribution of pre-trained word vectors to model performance
In the input representation layer of Attensy-SNER model, the pre-training word vectors are obtained by unsupervised training on the massive question-and-answer text corpus of StackOverflow by BERT model. To evaluate the contribution of pre-trained word vectors to the task of software knowledge entity extraction, a comparative experiment was conducted with the benchmark model BiLSTM-CRF. The experimental results are shown in Table 4 and Fig. 4. According to the experimental results of model comparison, the BERT-BiLSTM-CRF model is better than the benchmark model BiLSTM-CRF model. Specifically, after embedding the BERT pre-trained word vector, the precision rate increases by 4%, the recall rate increases by 7%, F1 score increases by 6%, which shows that BERT model based on the Transformer with a strong ability of text feature extraction can enrich the text feature presentation of input representation layer, so as improve the performance of the software knowledge entity extraction task.

Contribution of context features based on BiLSTM to model performance
To explore the effect of context information based on BiL-STM on software knowledge entity extraction, this paper compares the effects of LSTM and BiLSTM with different levels, and adopts BERT-CRF as the benchmark model. The experimental results are shown in Table 5 and Fig. 5.
Compared with the benchmark model BERT-CRF, the precision rate and F1 score of the models are improved after adding context encoding, indicating that context feature extraction contributes to retain semantic information of text. Meanwhile, compared with the LSTM, the F1 score of BiLSTM model increases by 4%, indicating that Bidirectional LSTM can capture the context information of sentence sequence more effectively.
As the number of BiLSTM layers increase, the F1 score decreases. The experimental results show that when the BiL-STM layer increases to 2 layers, precision rate and F1 score decrease by 1% and 2%, respectively, indicating that the network model may fall into local optimality or overfit.

Contribution of syntactic features based on GCN to model performance
To explore the influence of syntactic dependency relationship based on GCN on the task of software knowledge entity extraction, this paper compares the effects of software knowledge entity extraction with different levels of GCN and adopts BERT-BiLSTM-CRF as the reference model. The experimental results are shown in Table 6 and Fig. 6. According to the comparative experimental results, the F1 score of the models are improved after syntactic features integrated with different levels of GCN, indicating that syntactic dependency features contribute to the performance improvement of the software knowledge entity extraction task.
At the same time, the results show that when the layer number of GCN is 2, the precision rate and F1 score of the model are the highest, and the performance of the model is improved compared with 1 layer of GCN. However, when the number of GCN layers is 3, the precision rate and F1 score of the model decrease, indicating that the increase in the number of GCN layers will lead to the over-fitting problem of the model.

Contribution of semantic enhancement based on attention mechanism to model performance
Semantic enhancement strategies based on attention mechanisms extract similar words from BERT pre-trained word vectors and software engineering thesaurus Sethesaurus, assign weight to the semantic contributions of similar words based on attention mechanisms, obtain semantic enhanced representation of words with the weighted sum way and thus alleviating the problems of entity variant, entity sparsity and out-of-vocabulary words. For example, in the process of model training, similar words corresponding to software knowledge entity "CentOS" and their semantic contribution are obtained through semantic enhancement strategy based on attention mechanism, as shown in Fig. 7.
According to the Fig. 7, entity "Rehel" has the greatest semantic contribution, and entity variants such as "Cen-tos6" and "Rehel7" also have corresponding contributions to the semantic representation of the target entity as auxiliary resources. Therefore, semantic enhancement based on attention mechanism can alleviate the entity variant problem existing in the software knowledge community text and improve the domain adaptability of the model.
To evaluate the performance contribution of semantic enhancement strategy based on attention mechanism to software knowledge entity extraction task, Attensy-SNER model is selected to conduct comparative experiments with BiLSTM-CRF, BERT-BiLSTM-CRF and BERT-BiLSTM-GCN-CRF. The experimental results are shown in Table 7 and Fig. 8.
In Table 7, the model is represented by the symbol "✓", if the corresponding features representation is used; Otherwise, the symbol "✕" denotes the scenario it's not used. Based on the comparison experiment results, the Attensy-SNER model is improved; the F1 score increases by 10%, 5% and  To explore the effects of semantic enhancement strategies based on attention mechanism on entity sparsity and out-of-vocabulary words, the entity extraction results of AttenSy-SNER model were analyzed. The training set contains 5548 software knowledge entities, the test set contains 1559 software knowledge entities, among which the test set contains 451 out-of-vocabulary entities. The results for recognizing out-of-vocabulary entities in the test set of all datasets are shown in Table 8. The results show that AttenSy-SNER model with semantic enhancement strategies has higher recall rates (R) than the other three models without semantic enhancement strategies, and has better performance in recognizing out-of-vocabulary entities. Therefore, semantic enhancement strategies based on attention mechanism can enhance the semantic representation of entities by integrating the vector representation of similar words, which is conducive to solving the outof-vocabulary problem existing in the software knowledge community and thus alleviate the entity sparsity problem.

Contrastive analysis of model training
The training process of deep learning model is a process of constantly updating parameters. To further understand the training situation of this model, this paper selects the process data of Epoch in the first 100 rounds of model training for comparative analysis. The relationship between F1 score of each model and Epoch is shown in Fig. 9.
It can be seen that the F1 score of the BiLSTM-CRF model without BERT pre-training word vectors continuously increases from the initial low value. In the other three models with BERT pre-training word vectors, start from a higher initial value and maintained a higher level continuously. Therefore, it is verified that embedding the BERT pre-trained word vector as input layer can effectively extract the text features of the software knowledge community, enrich the feature representation of input layer, and make an important contribution to the performance of the software knowledge entity extraction task.
Compared with other models, the Attensy-SNER model proposed in this paper can obtain the highest F1 score at the initial stage by integrating syntactic features and semantic enhancement information, and at the 40th round of Epoch, the loss function begins to converge and keeps the best F1 score state continuously. The results show that BERT pretrained word vector, syntactic dependency based on GCN and semantic enhancement based on attention mechanism play an important role in improving the performance of software knowledge entity extraction task.

Conclusion
In view of the problems such as entity variation, entity sparsity, entity ambiguity, out-of-vocabulary words and lack of annotated data sets in the software knowledge community text, we consider the features of word, syntax, entity context and semantics, and propose a software knowledge entity extraction model Attensy-SNER. It combines the pretrained word vector representation, syntactic dependency features and entity semantic enhanced information, to extract fine-grained software knowledge entities from unstructured user-generated content. To solve the problem of lack of open data set in software engineering field, the pre-trained word vector based on BERT model and the annotated data set of software engineering domain covering 8 aspects and 40 fine-grained entity types are constructed based on the question-and-answer texts of the StackOverflow. The comparison experimental analysis shows that the Attensy-SNER model proposed in this paper is superior to the current benchmark model in the task of software knowledge entity extraction, and it paves a way for the construction of software knowledge graph in the next step.
Funding This work is supported by Yunnan Science and Technology Major Project (Grant no. 202002AE090010), the Subproject 5 of Yunnan Science and Technology Major (Grant no. 202002AD080002-5).

Availability of data and material Submit soon.
Code availability (software application or custom code) Submit soon.

Conflict of interest
The authors declare no competing interests. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.