Knowledge organization of node enterprises’ technological innovation under supply chain environment

An improved text classification method based on domain ontology is proposed in this paper to organize the mass information that records node enterprises’ innovation activities under the supply chain environment. This method can classify the documents of node enterprises under the supply chain without a training set. It achieves a precision of 80% for documents’ classification, which outperforms the baseline method. Besides, the paper constructs a domain ontology of enterprises’ technological innovation under the supply chain that effectively enhances the semantic relationship between words. Therefore, it can summarize and classify the textual information generated by node enterprises in product design, production, storage, logistics, and sales.


Introduction
With the development of technology and the economy, a new supply chain management model has been formed in the global business community. As a result, the enterprises' technology innovation model has been changed from a single enterprises' original independent innovation to a collaborative innovation model of upstream and downstream enterprises in the supply chain. The supply chain involves multiple entities such as suppliers, manufacturers, retailers, and customers. The innovation activities and processes of all entities form the enterprises' technological innovation in a supply chain. Therefore, the supply chain entities should innovate collaboratively to improve the entire supply chain's competitiveness. Figure 1 shows the main entities in a supply chain and the process of technological innovation. The node enterprises and upstream enterprises in the supply chain need to convey the market supply and cost information. Node enterprises in the supply chain need knowledge sharing and integration. Node enterprises and downstream enterprises or customers need to transfer the market demand and product information. Therefore, the information collection, classification, and knowledge system reconstruction of the entire supply chain is the key to promote an enterprise's technological innovation, which is conducive to enhance product competitiveness and even the whole supply chain.
Due to the complexity and the huge amount of information produced by innovation under the supply chain, a text classification method that can automatically process, organize and mine textual data is highly demanded. However, most of the existing studies focus on exploring influencing factors and cooperation modes of innovation under the supply chain by the empirical study [1][2][3][4]. A complete knowledge system should be built to search, organize, and analyze each node enterprise's knowledge to develop the text classification method. Furthermore, it makes information sharing, synchronize planning, and process coordination between members across different regions and industries come true. Therefore, this paper constructs an ontology of the enterprise's technological innovation field under the supply chain environment and classifies and summarizes the textual information. This method can generate the influential factors of innovation under the supply chain dynamically, providing researchers or managers with influential factors of innovation under the supply chain and understanding the production knowledge dynamically in this field. Besides, the knowledge organization and sharing for node enterprises can realize enterprises' continuous innovation and enhance the entire supply chain's competitiveness. The remainder of this paper is organized as follows. The second section presents the literature review on existing semantic text classification algorithms and semantic similarity methods based on ontology. The third section provides the implementation process of the proposed methodology. The fourth section presents the experiments, results analysis, and performance evaluation. The final section concludes the work and contribution of this paper and presents the limitation and future works.

Supply chain management and enterprises' technological innovation
The supply chain is a functional network built around by the core enterprises [5,6]. It revolves around the core enterprise and connects suppliers, manufacturers, distributors, retailers, and end-users through information flow, logistics, and capital flow. Supply chain management spans all activities from raw materials to final products. The synergy of demand and supply brings competitive advantages for enterprises in terms of value and cost. Technological innovation in supply chain management helps enterprises to reduce procurement costs and production costs.
Most researchers explored the association between efficient supply chain management and enterprises' innovation by empirical inquiry or survey methods [1][2][3][4][5][6][7]. An increasing number of scholars recently realized the importance of data analysis and text mining for supply chain management. Schniederjans et al. [8] enhanced the supply chain digital research paradigm through a large-scale literature review and a textual analysis of digitization technologies and topics. Kim et al. [9] explored sustainable supply chain management trends and firms' strategic positioning and execution based on news articles and sustainability reports with text-mining algorithms. Chu et al. [10] proposed a text-mining-based global supply chain risk management framework to identify region-specific supply chain risks. Chircu et al. [11] presented research examining the use of business analytics, big data, and business intelligence methods in operations and supply chain management by analyzed 625 published papers with text mining. Rozados et al. [12] concluded the trend and related research of big data analysis in supply chain management.

Semantic text classification algorithms
Text classification is a text-mining algorithm that automatically assigns the analyzed document to one or more predefined categories based on its content [14]. Traditional supervised text classification methods such as Support Vector Machines (SVM), Naïve Bayes, decision trees, and Latent Semantic Analysis (LSA) K-Nearest Neighbor (KNN) generally presented by the terms and their feature weights, also known as the "Bag of Word" (BOW) representation model. The number of words determines the word vector dimension in the vocabulary, which usually results in a very high and sparse dimensional document vector [15][16][17][18][19][20][21].
Ontology is a conceptual, structured, and standardized knowledge representation and organization method that can describe semantics and hidden knowledge from enormous amounts of information. Using domain ontology for knowledge representation can explore similar topics or events in the documents. Hence, it can construct a text representation model with the pre-defined semantic relationships between recognized entities and knowledge from the ontology and augment it with important background facts that are not directly present in the document. With this knowledge, the system can distinguish which terms or concepts 1. Some researchers utilized the domain ontology to enrich the semantic feature vector representation and improve text classification accuracy. For example, Elhadad et al. [22] proposed building the feature vector for web text document classification based on the WordNet ontology. Abdollahi et al. [23] utilized the UMLS domain ontology to extract the key features and classify the medical text document. 2. Some researchers utilized the hierarchical taxonomy of domain ontology in the text classification task. For example, Cerri et al. [24] classified proteins in functions organized according to the Gene Ontology hierarchical taxonomy. Liu et al. [25] proposed the text classification method based on the ontology graph and structure. 3. Some researchers proposed a method based on the semantic similarity of concepts in the ontology for text classification. For example, Albitar et al. [26] proposed new text-to-text semantic similarity measures to replace classical similarity measures for text classification.
There is no research that utilizes the big data techniques for knowledge organization of enterprises' technological innovation under the supply chain environment from the above literature survey. The traditional text classification methods are usually represented by BOW, which ignores the semantic relationship between terms and usually requires a large number of labeled training texts, which increased manual annotation workload. Using the hierarchy of knowledge from domain ontology directly in the text classification process can obtain the semantic relations between terms and directly skip classifier construction training steps without any pre-categorized training sets.
Therefore, there are two research points in our paper. First, use the big data techniques to automatically process, organize, and mining the large amounts of textual data generated by the node enterprise's technological innovation and realize the knowledge service among node enterprises in the supply chain. Second, an improved text classification method does not require a large amount of training text to automatically organize and analyze the large amounts of textual data generated by the node enterprises' technological innovation to realize the knowledge classification of enterprises' technological innovation.

Methodology
Therefore, this paper utilizes the semantic concept model based on the domain ontology of enterprises' technological innovation under the supply chain to improve the text classification and proposes an enhanced text classification method based on the semantic similarity and relatedness between keywords and categories. This paper mapped the target categories and the keyword sets extracted from the collected textual documents to constructed domain ontology concepts. Then, the mapped target category-concept set and keywords-concept set are obtained. The domain ontology-based semantic similarity calculation and the concept distribution-based relatedness calculation are used to obtain the weight matrix of semantic similarity and relatedness between keywords and categories. Compared to the maximum weighted value of semantic similarity and relatedness between keywords and categories in the matrix's transverse space, the document categories can be obtained by the category corresponding to the keyword with the maximum value. The framework of the process on the improved text classification method based on the semantic conceptual model is shown in Fig. 2. According to the framework, there are mainly four steps in the improved methodology, and the detail is as follows. The main parameters used in the following equations are shown in Table 1.

Text preprocessing
The module of text preprocessing mainly includes word segmentation, part-of-speech tagging, and stop word removal. First, utilize the Python software Jieba to segment the collected textual documents. The result of Chinese word segmentation will lead to the problem that Chinese phrases are incorrectly divided into multiple words, such as the phrase "enterprises technological innovation," which were divided into three small-grained words "enterprises," "technology," and "innovation." Hence, the custom dictionary utilized to defined particular terms in the field, such as "enterprise technological innovation," "product innovation," and "mechanism innovation". Furthermore, tagged the text with partof-speech (POS), where nouns are more representative and essential to the source document's semantic information. Therefore, nouns, gerundial phrases, adjective-noun collocation were selected as the research objects. Finally, the useless words were filtered through the stop word dictionary, such as "a, the, we, us, they" and other terms with high frequency without meanings. The index structure's size can be significantly reduced by stop word removal, and the keyword sets can be obtained. The general process of text preprocessing was shown in Fig. 3.

Domain ontology-based concept mapping
The key to constructing an improved semantic conceptual vector representation model based on domain ontology is the concept mapping from text keywords to ontology. The concepts of domain ontology are usually defined by attributes, keywords, or synonyms in the texts. Hence, there are four situations when mapping text keywords to domain ontology as follows.   The number of multiple concepts c i in ontology that matched with the keyword t j S tc The matching degree between the keyword and each concept attribute in the domain ontology TF The frequency of the keyword in data set The threshold value of keyword frequency K The rate at which the weight value decreases with the ontology hierarchy depth(c j ) The depth from root to concept c j in ontology The value to the path of each node in ontology The semantic distance between concepts c i and c j in ontology sim(c i , c j ) The semantic similarity between concepts c i and c j in ontology The influence factor of semantic distance on semantic similarity E ij The keyword pair co-occurrence matrix The number of times that concept c i and c j appear simultaneously in the k words window at the entire corpus The frequency of the concept c i at the entire corpus Relatedness between concept c i and c j Sim_Rel(c i , c j ) Semantic similarity and relatedness between concept of c i and c j The weight of semantic similarity H j The weighted sum of each transverse dimension vector in d j the keywords can be directly replaced by the ontology concepts. 3. 1: n mapping. When the keyword t j corresponds to multiple concept attributes c i in the domain ontology, the mapping concept is determined by the matching degree between the keyword and each concept attribute in the domain ontology shown in formula (1). Selected the maximum value of the concept in S to replace the keyword t j , where nc i represents the number of multiple concepts c i in the ontology that matched with the keyword t j .
4. The mapping relationship between keywords and concepts in n:1 and n:m, since the concepts in the domain ontology are usually composed of professional compound words. It is not easy to find concepts that directly and exactly matched the keywords. Therefore, utilize the maximum matching method to map multiple feature items to the same concept in mapping keywords to the domain ontology concepts. There are two situations for mapping keywords to multiple concepts. First, when one or more keywords are cross-mapped to multiple concepts, keep the multiple concepts from multiple keywords mapped to the domain ontology. For example, keywords t 1 , t 2 mapped to concept c 1 , while keywords t 1 , t 2 , t 3 mapped to concept c 2 and then kept the concepts c 1 and c 2 . Second, when one or more keywords are mapped to multiple concepts without cross-over, keywords are unique in the text and retain the concepts directly.

Semantic similarity and relatedness calculation based on domain ontology
According to the previous literature review on ontologybased semantic similarity measures, this paper proposes a new calculation method that combines domain ontologybased semantic similarity and concept distribution-based relatedness. The proposed method obtained the semantic similarity matrix between concepts by calculating the semantic distance of concepts in the domain ontology, then calculating the relatedness matrix between concepts by cooccurrence frequency in the text, and fused the semantic similarity and the correlation matrix to obtain the final weight matrix. First, assigned value to each node's path in the ontology and calculated the semantic distance between concepts with the following formula: where K represents the rate at which the weight value decreases with the ontology hierarchy, depth(c j ) represents the depth from root to c j in the ontology, and the depth(root) = 0 . Therefore, the semantic distance Dist of the two concepts can be defined by assigned the path weights between two concepts and shown as follows: where when the concept nodes c i and c j are the same concept, the semantic distance is 0; when there exists a direct path between the concept node c i and c j , the semantic distance is the path weight value between the two concepts; when there is an indirect path connected the two concept nodes c i and c j , the semantic distance is the sum of the path weights. The path weight assignment formula proposed above has the following properties.
1. The value of the semantic distance between concepts at the upper level in domain ontology is bigger than that at the lower level because that the more abstract concepts  Fig. 3 The process of text preprocessing in the ontology hierarchy have less similarity, and the more specific concepts have a greater similarity. 2. The semantic distance between concepts in the parent class and subclass is smaller than the value of the sibling concepts, which indicates that different types of concepts have different weights. 3. There is symmetry in the distribution of path weights between concepts.
The relationship between semantic distance and semantic similarity is inversely proportional. Hence, the semantic similarity Sim(c i , c j ) can be calculated according to the semantic distance between concepts. The semantic similarity generally has the following properties.
• 0 ≤ sim(c i , c j ) ≤ 1 defined the scope of the semantic similarity. When the c i and c j are the same concept, the semantic similarity is 1; when the concept c i and c j have nothing in common, the semantic similarity is 0. • ∀c i ∶ sim(c i , c j ) = 1 defined the semantic similarity between c i and itself as 1.
then sim(c i , c j ) < sim(c i , c k ) defined the relationship between conceptual semantic distance and semantic similarity. If the semantic distance between concepts c i and c j is greater than the semantic distance between concepts c i and c k , the semantic similarity between concepts c i and c j is less than that of concepts c i and c k . Therefore, the calculation of semantic similarity is shown as the following formula: where is the influence factor of semantic distance on semantic similarity, 0 < ≤ 1.
After the preprocessing of the texts, selected the most representative keywords as the keywords set. To calculate the relatedness of a given keyword pair, the calculation formula of the i × j co-occurrence matrix E ij generated for the terms in a certain window size k of the corpus is shown as the following formula: where f k represents the number of times that concept c i and concept c j appear simultaneously in a window containing k words at the entire corpus. The generated co-occurrence matrix E ij was further processed by the mutual information method based on word distribution. The relatedness matrix of concept c i and c j was obtained, and the calculation formula is shown as the following formula: where f k represents the number of times that concept c i and c j appear simultaneously in the k words window at the entire corpus. f c (c i ) and f c (c j ) represent the frequency of the concepts c i and c j at the entire corpus.
The co-occurrence frequency information represents the strength of the content relatedness between concepts in the corpus. The similarity of concepts in the domain ontology represents the strength of the semantic relationship between concepts. Combined the semantic similarity and relatedness between concepts can represent documents more accurately. The following formula is used to normalize and fuse the similarity matrix and co-occurrence matrix of concepts which represents the weight of semantic similarity: , others,  The ith keyword in T j C m Target category with total number of m sim(t i , c m ) Semantic similarity between the keyword t i and the category c m rel(t i , c m )

Relatedness between keyword t i and category c m W im
Semantic similarity and relatedness between keyword t i and category c m

Improved text classification algorithm
In this paper, the concept model based on the domain ontology proposed above was applied to text categorization. An improved text classification method based on the semantic similarity and relatedness between keywords and categories was proposed. The corpus D contains j documents and denotes as D = {d 1 , d 2 , ..., d j } . First, constructed a vector space model for each text, extracted the keywords whose TF is greater than the threshold , sorted the keywords according to TF weight, selected the top 20 most representative keywords and the document d j can be represented as The semantic similarity matrix M between the keyword t i and the category C m is calculated by the formula (4). The relatedness matrix Q of the keyword t i and the category C m based on the word distribution is calculated by the formula (6). The matrix W is obtained by fusing the matrices M and Q through the formula (7). The element W im in the matrix W represents the semantic similarity and relatedness of the keyword t i to the category C m the matrix W generated by the keyword t i and the category C m in the text d j can be denoted as shown in Table 2. Finally, the weighted sum of each transverse dimension vector in d j is obtained by the formula (8), took the maximum value W im corresponded category C m as the text category. The improved text classification method based on semantic similarity and relatedness of keywords and categories is described as follows, and Table 3 defines the main elements: Compared with the traditional text classification method based on machine learning, the improved text classification method based on the semantic similarity of keywords and categories has the following advantages. First, the proposed improved text classification method does not require enormous amounts of labeled training text. The method is friendly to the textual data without the label. Second, this method uses the domain ontology to map concepts, convert text into low-dimensional space vectors, and reduce space complexity. Thirdly, this method calculates the semantic similarity and relatedness between keywords and categories through domain ontology and overcomes the defect of ignoring the semantic relationship between concepts in the traditional vector space representation method.

Experiment and analysis
This paper presents a methodology for text classification of enterprise's technological innovation under supply chain without the training set. However, to compare the performance with other text classification methods, labeled the collected textual data with pre-defined categories. The result analysis and performance evaluation are as follows. The structure of the enterprise's technology innovation domain ontology under the supply chain environment is shown in Fig. 4.

Dataset
This paper's experimental data mainly consist of enterprises' application form for technical center certification, provided by the Beijing Municipal Commission of Economic Informatization. The textual data consist of 400 enterprises in Beijing, and after data cleaning and selection, there are 867 valid texts, and the overall data size is about 20 M. Table 4 briefly shows the details of the data collection result. The experimental operating environment is Windows 10 system, 2.70 GHz core processor, 8.0 GB memory, and the Python 3.6.2 used for programming. There are seven pre-defined categories: manufacturing capability, innovation resource, mechanism innovation, innovation output, market innovation, protection measures, and innovation strategy. The labeled textual document set was divided into 70% training     The performance comparison of text classification based on precision rate 1 3 Fig. 7 The performance comparison of text classification based on F-measure Fig. 8 The result of semantic similarity between part of concepts and categories Fig. 9 The result of relatedness between part of concepts and categories set and 30% test set. Part of the labeled textual dataset is shown in Table 5.

Performance comparison with KNN
According to the different application backgrounds, scholars have proposed various indicators for evaluating text classification systems' performance, including Accuracy, Precision, Recall, F-measure, and Macro-averaging, etc. The most commonly used indicators include precision rate, recall rate, and F-measure. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The recall is the ratio of correctly predicted positive observations to all observations in the actual class.
The F-measure combines precision and recall, which is the harmonic mean of precision and recall. The following formulas represent the definition of the three methods.
This paper used Precision, Recall, and F-measure indicators to compare the proposed text classification method's performance based on the semantic similarity of keywords and categories and the KNN classification method based on TF*IDF.
The above performance comparison analysis shows that the value of Recall rate, Precision, and F-measure on the improved text classification based on semantic similarity and relatedness was higher than the KNN method based on TF*IDF (Table 6; Figs. 5, 6, 7). The mean value of precision of the improved text classification method proposed in this paper over 80%. Therefore, compared with the KNN classification method based on TF*IDF, the proposed text classification based on semantic similarity and relatedness between keywords and categories presented in this paper has better classification performance on texts related to enterprise technology innovation under the supply chain environment.
The improved semantic text classification method proposed in this paper used domain ontology concept sets instead of keywords as each element of the feature vector, enhancing the semantic relationship between words, highlighting the semantic expression, and improving the classification precision rate. The text representation based on domain ontology reduces the space vector's dimension  Fig. 10 The result of semantic similarity and relatedness between part of concepts and categories Fig. 11 The visualization of relatedness between concepts and categories and saves the calculation time. Furthermore, the improved method can also realize the text classification in node enterprise's technological innovation under the supply chain environment without a labeled training set and has a better classification effect. To some extent, this method solves the problem of text classification that lacks a training set due to the enormous workload of manual labeling in reality.

Result analysis
The semantic similarity and relatedness between keywords and categories are calculated based on the domain ontology of node enterprise's technological innovation under the supply chain environment. The result is shown from Figs. 8, 9, 10 and 11. The following shows part of the semantic similarity and relatedness matrix between concepts and categories due to space limited. The improvement semantic text classification method proposed in this paper can effectively classify node enterprises' collected information in the supply chain and organize the concepts based on semantic similarity and relatedness of enterprise's technological innovation in the supply chain. The concepts here are the key influencing factor of the node enterprise's technological innovation within the supply chain. The classification system for key influencing factors can be obtained through the above experimental analysis of semantic similarity and relatedness between keywords and categories. There are seven types of influencing factors of node enterprise's technological innovation under the supply chain, including manufacturing capability, innovation resources, mechanism innovation, innovation output, market innovation, protection measures, and innovation strategy. According to the semantic text classification, the seven types of first-class factors can be divided into 20 kinds of secondclass factors, shown in Table 7.
The influence of manufacturing capability on enterprises' technological innovation is mainly reflected in transforming the R&D results into manufacturing production. The word "quality control" has a high value of similarity and relatedness with the manufacturing capability, reflecting product quality management's content. Hence, the item belongs to the category of manufacturing capability. The innovation resources mainly refer to the enterprises' investment in technological innovation resources. For example, the investment in staff, funds, and equipment in R&D. The mechanism innovation is an innovation activity in various operating mechanisms to enhance the whole enterprise's competitiveness. The innovation output reflects the production of enterprises' innovation and the innovation benefits. Market innovation refers to that innovation in product sales and promotion made by enterprises to meet market demands. Protection measures reflect the content of protection measures of intellectual property. Protection measures reflect the protection measures of intellectual property. Technical knowledge protection can promote technology diffusion and attracting foreign capital and technology introduction. The innovation strategy refers to integrating and arranging the enterprise's internal and external innovation resources and technologies from the overall system with enterprise operation.

Conclusions
The knowledge of enterprises' technological innovation under the supply chain environment is the information source such as the database or documents collected from the supply chain's node enterprises. The knowledge organization is a process of classification and analysis of messy, complex, and huge information. This paper introduces domain ontology to make the knowledge organization system semantic and knowledgeable and constructs an ontology of the enterprises' technology innovation under the supply chain. It utilizes the relationship between the domain ontology concepts to describe the existing enterprises' knowledge management system's semantic information. An improved semantic text classification method was proposed in this paper, which can obtain a document's category by calculating the weighted maximum value semantic similarity and relatedness of the text's key feature words and categories. This method enhances the semantic relationship between words, reduces the space vector's dimension, and saves calculation time. Furthermore, this paper's improved method can classify the document based on the domain ontology hierarchy without a labeled training setthe mean value of precision of the improved text classification method is over 80%.
The contributions of this study are twofold. From an academic perspective, the improved text classification method proposed in this paper had a better performance than the KNN classification method based on TF*IDF. From a practical standpoint, this paper constructs a domain ontology for enterprises' technological innovation under the supply chain from a practical standpoint. It helps to summarize and classify the innovation information under the supply chain, providing researchers or managers with influential factors of innovation under the supply chain and understanding the production knowledge dynamically in this field.
However, there are still some limitations in this paper that future researches should solve. For example, first, the method proposed in this paper requires domain ontology to provide background knowledge and concept mapping. Future researchers may consider using the general ontology that can be applied to more fields. Second, future researchers can consider more influential factors of similarity and relatedness between concepts to increase the word association and improve text classification accuracy.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.