TCM2Vec: a detached feature extraction deep learning approach of traditional Chinese medicine for formula efficacy prediction

In current era, the intelligent development of traditional Chinese medicine (TCM) has attracted more and more attention. As the main carrier of clinical medication, formulas use synergies of active substances to enhance efficacy and reduce side effects. Related studies show that there is a nonlinear relationship between the efficacy of formulas and herbs. Deep learning is an effective technique for fitting nonlinear relationships. However, it is not good for using deep learning model directly due to ignoring the characteristics of formulas. In this paper, we propose a detached feature extraction approach (TCM2Vec) based on deep learning for better feature extraction and efficacy prediction. We build two detached encoders, one of it uses cross-feature-based unsupervised pre-training model (FMh2v) to extract the relationship features of herbal medicines for initializing, while the other one simulates multi-dimensional characteristics of medicines by normal distribution. Then we integrate relationships and medicinal characteristics for deep feature extraction. We processed 31,114 unlabeled formulas for pre-training and two classification tasks in-domain for predicting and fine-tuning. One of tasks is multi-classed with 1036 formulas, other one is multi-labelled with 1,723 formulas. For labelled formulas, different feature extraction models based on detached encoder are trained to predict efficacy. Compared with the no pre-training, CBOW and BERT baseline models, FMh2v leads to performance gains. Moreover, the detached encoder offers large positive effects in different models which for efficacy prediction, where ACC increased by 5.80% on average and F1 increased by 12.06% on average. Overall, the proposed feature extraction is an effective method for obtaining characteristic representation of TCM formulas, and provides reference for the adaptability of artificial intelligence technology in the domain of TCM.


Introduction
As the traditional culture of Chinese nation, TCM plays a significant role in health care [39]. Its unique and effective treatment method has received more and more attention. Formulas consisting of multi-herbs are the major form of TCM clinical treatment, implying the ideas of TCM syndrome differentiation and formula compatibility. It is also the key to the intelligent and modern development of TCM. The efficacy of formula refers to its effect on preventing and treating diseases in the condition of herbal interaction. There are many relationships between herbs (TCM-HRs, the representation of herbal relations), such as mutual reinforcement, mutual restraint, mutual suppression, mutual assistance. And the property of herb (TCM-HPs, the representation of herbal medicines) is the internal factor that affects the efficacy of formula, including toxicity, four characters (cold, hot, warm, cool) and neutral, five tastes (sour, bitter, sweet, pungent and salty), twelve channel tropism (lung, pericardium, heart, large intestine, triple energizers, small intestine, stomach, gallbladder, bladder, spleen, liver, kidney), totally 23 descriptions. However, the specific relationship between TCM-HPs and efficacy of TCM is complex and unclear in practice, which lead challenges to the exploration of inherent rules. Thus, it is necessary to model the matching patterns behind the herbs if TCM tends to intelligent development as we hope [47].
The strategy of the relationship between formula and efficacy researching has changed in the past few decades. It cannot limit to a single herb, but also needs to be considered as a whole. The multi-dimensional properties of TCM determines the efficacy of the prescription. Subsequently, some scholars proposed the medicinal combination model of herb [14,35] revealing the relationship between the combination of herb and its efficacy from the overall perspective of TCM. Current research mainly uses traditional machine learning algorithms [24]. But the relationship of formulas and efficacies are extremely complex, traditional machine learning algorithms is not good for mining deep rules. Deep learning algorithms [18] which driven by data have developed significantly and are widely used in medical domain [1,8]. As the carrier of TCM knowledge and information, determining how to apply deep learning to formulas for mining becomes a novel topic.
The first step of text modeling is the digital representation of features. As the core technologies in natural language processing (NLP) field [3], word embedding means to map structured text words into vector space, achieving the purpose of using structured vector to represent unstructured text. Pre-training language models (PLMs) are the most important method for generating word embedding. Among them, Word2Vec [26] based on neural network has achieved great success. It greatly alleviates the problem of feature sparsity and improves the semantic information expression of embedding vector, laying a good foundation for the application and development of deep learning text mining model [42]. In this research, we propose an improved pre-training method to learn the distributed representation of TCM-HRs and employ mathematical ideas to initialize representation of TCM-HPs, which may significant for TCM further development.
However, there are obvious differences between formulas and public texts that come from natural language. Such as formulas is low-sequence in writing form, that means the herb at the front of the formula can be related to the herb at the end rather than the surrounding herb. But for the text of natural language, the word in sentence may follow grammatical rules. As mentioned by Gururangan [9], there should be corresponding domain adaptation pre-training representation for different tasks. For this purpose, we designed a model based on CBOW and cross-feature for TCM pre-training.
Formula-efficacy predicting belongs to classification task of artificial intelligence. In recent years, deep learning has achieved good performance in classification tasks, such as face recognition [10,41], audio classification [11,22], email spam filtering [7], text generation [40] and in TCM field, having tongue image [15] and symptoms classification [37]. However, there are few studies on the mapping relationship between formula and efficacy. The two main reasons are as follow: (1) low-resource settings, the little training data is not enough to support the huge parameter training in deep learning model; (2) complex theory, using the model directly may produce bad results; (3) moreover, it is a pity to know the talents for the crossstudy of TCM and computer are scarce.
In this research, the experimental data is crawled from internet. We totally obtain 31,114 unsupervised formula data, one multi-class with 1036 formulas and one multi-label 1723 formulas. In order to predict the efficacy of formula, we propose TCM2Vec, including a feature extraction pre-training model FMh2v integrated with TCM theory for better characteristic representation of the relationship between herbs, and a detached embedding network in the deep learning network for predicting. Meanwhile, the herbal expression obtained by pretraining can be fine-tuned while predicting the efficacy of formula. This research provides a new research idea for TCM feature extraction, and contributes to the study of formula-assisted decision-making and other related work. Our contributions mainly lie in the following aspects: & The ideas of NLP research are applied to study TCM. Integrating the theory of TCM, we propose a pre-training model based on CBOW and cross features. This model training on unsupervised data can extract TCM relationship features and obtain a reliable initial representation of TCM. & We employ a mathematical method to simulate the distribution of herbal medicines in the real world and fine-tune it by sparse activation during the training of prediction model. & We construct a deep learning model with a detached encoder to predict formula-efficacy relations, while fine-tuning the pre-trained herbal vector on supervised data.
The remainder of this research is structured as follows. Section 2 focuses on the related works. Section 3 presents the explanation of the overall proposed framework. In Section 4, we evaluate the experimental results of our research and provides the comparison with existing techniques, followed by a conclusion and suggestions for future research directions in Section 5.

Feature representation
Feature representation is the foundation and important content of text data mining. Traditional feature methods mainly based on manual extraction, generally having problems of high dimension and poor correlation, such as one-hot [34] and TF-IDF [32]. Bengio [2] first proposed to learn the distributed representation of words while predicting the next word in a sequence to overcome the dimensional disaster. Following this idea, Mikolov [27] extended the simple feed forward neural network to a recursive neural network for capturing longer distance dependence. However, both models still resemble the probabilistic language model. Subsequently, Mikolov proposed Word2Vec model [26], which included CBOW and Skip-Gram training modes. CBOW predicts the center word in the context window using a simple logistic regression classifier based on the words in the context window. Skip Gram uses the same architecture, but predicts context words based on the central word. Recently, in view of its superior representation learning ability, BERT becomes a commonly used pre-training method in various fields.
There are few studies on the distribution characterization of herbal medicines. But the research approach has changed in recently, for herb representation is moving towards modelbased training. Li [23] firstly proposed the distributed representation model of TCM and used N-Gram and CBOW models for experimental comparison. Deng [5] proposed a herb vector training (QM-BP) model based on multi-layer feedforward neural network, which took herbefficacy as the sample for training. The experiment showed that the quantified value of herb obtained by the correlation between TCM property and efficacy was more accurate and had better presentation performance. Inspired by the above ideas and combined with TCM theory, we develop a feature learning method of herb based on cross features. During training, the model not only observed the individual features in the window, but also paid attention to the value information brought by the fusion of two features.

Activation function
Sparsity refers to expressing most original signals with a linear combination. The nonlinear activation function is a simple and efficient sparse method. The saturation activation functions include Sigmoid [16] and Tanh [30] functions. Such activation functions are smooth and easy to take derivatives, but the mapped vectors are more denseness. The unsaturated activation function ReLU initially appeared to solve the saturation type non-zero center and gradient disappearance [38]. For segmented function mode, it possesses the power of sparse vectors. Pathirage [31] used ReLU to constraint the information content in the latent representation for making it low dimensional. Then, in order to avoid neuron necrosis in ReLU, researchers proposed PreLU [29], RreLU [43], ELU [4] and other segmented functions with parameters, having good performance in many tasks. Activation functions also have been used in TCM field. For recommending effective TCM formula for advantage diseases, Zhou [46] proposed a deep learning network(FordNet) with ReLU as the activation function of convolution layer. In deep learning models with cloud computing proposed by Zhang [45], using Tanh and ReLU activation functions in the model for diagnosing spleen and stomach diseases. The medicinal characteristics of herb belong to discrete characteristics. We initialized the medicinal characteristics by mathematical method, and used unsaturated activation function to sparse it for fitting the true distribution in training process.

Predicting model
Traditional text classification models are based on machine learning algorithms, such as SVM [36] and naive bayes [25], while deep learning algorithms are widely used due to learn features automatically. In 2014, Kim [19] firstly applied Convolutional Neural Network (CNN) to English text classification task, extracting features of text with one-dimensional convolutional kernel and key information in maximum pooling. However, conventional CNN can only learn local relations between minimal semantic units. In order to capture long-term textual dependencies, Johnson [17] proposed a low-complexity, word-level deep new convolutional network DPCNN. In order to better obtain the global information of text, researchers turn to a Recurrent Neural Network (RNN) model, and recurrent memory units make the network memorable. Hu [13] experimentally analyzed the performance of TextRNN in Chinese text classification, and the results showed that the accuracy of Chinese text classification could be improved from 88.60% (THUCTC) to 94.62% (TextRNN) combined with Word2vec. CNN and RNN are two popular models in the field of deep learning at present, but they are not perfect. Some researchers mix the two models to realize complementary advantages and minimize the weaknesses. Liu [21] proposed the RCNN model, which adopts the bidirectional circular structure to obtain global text information. And it reduces the noise information brought by the traditional window-based neural network and retains word order information in maximum extent when learning text representation. Then uses the max-pooling layer to obtain key features.
With the development of artificial intelligence, many Chinese scholars have conducted research on Chinese medicine cases classification with deep learning models. In order to explore the correlation between tongue diagnosis and formulas, Hu [15] constructed a twochannel CNN model to train tongue images for correspondent formulas of different tongue diagnosis. And Song [37] classified TCM cases by Text-CNN and LSTM. It must be recognized that deep learning networks have great development prospects in TCM field though there are few studies and applications of deep learning at present.

Methodology
The workflow of TCM2Vec is illustrated in Fig. 1. As shown in the illustration, our framework consists of two components: one is TCM-HRs pre-training under the unlabeled dataset section, the other is formula-effect predicting and fine-tuning section under the small labeled datasets, of which details are introduced in Section 3.1 and Section 3.3, respectively. The initialization of TCM-HPs is introduced in 3.2.

Initialization of TCM-HRs
The main advantage of PLMs is automatic feature extraction under data-driven conditions, without manual feature engineering in advance. Although there are many excellent models in academic research, we should decide the appropriate choice according to the characteristics of the data type. Compared with public data, such as Wikipedia, Books Corpus data, our usable data are not enough. And the text composition is short, however, most popular models [6,12,28] are great for long text. In this research, we choose CBOW framework as benchmark model which based on context scene.
In order to get better expressions of herb relationship characteristics, we integrate TCM theoretical knowledge into the pre-training model. Herb pair theory [44] is the basis of formula compatibility, and its composition is affected by many factors. The relationship between herbs in pairs could be synergy or antagonism. Reasonable compatibility can enhance the efficacy of herbs and reduce adverse reactions from some herbs. For example, in the pair of pinellia ternata and ginger, ginger can not only enhance the stomach warming of pinellia ternata, reducing nausea and cough, and can restrict the toxicity of pinellia ternata as well. But combinations that violate the rules may inhibit efficacy or produce toxicity. In the pair of ginseng and resveratol, ginseng can strengthen the contractility of the heart, while resveratol has antihypertensive properties againsting to ginseng.
Combining with herb pairs, the feature interaction network layer based on the cross-term of FM algorithm [33] is added to the basic CBOW model (Fig. 2, Left) to learn the cross relation of herbs (Eq. 3), and the last results are used to predict the score of central words. The TCM pre-training model is named FMh2v (Fig. 2, Right), which the loss calculation as shown in Eq. 4: Loss where X C represents the context of the target word, c is window size, H R (•) stands for latent space, v i is latent vector, E(•) is mapping function of embedding layer, σ stands for activation function, and X neg is the set of negative samples for one input.

Initialization of TCM-HPs
Each herb has its own specifically 23-dimensional (toxicity, four characters and neutral, five tastes and twelve channel tropism) herbal medicines in the real world, such as dendrobe is sweet in taste, a little cold in nature and attributive to the stomach and kidney meridians. However, it is very difficult to collect and process the medicinal information of all Chinese medicines manually. In this research, we used mathematical ideas to fit the distribution characteristics of TCM medicines. According to the central limit theorem [20], under the appropriate conditions, the mean values of large number of independent random variables converge to a normal distribution after proper standardization. Each dimension of medicinal intrinsic feature is an independent variable. Assuming the number of herbs appearing in the corpus is infinitely close to the number of all Chinese medicines present in nature, each dimension of intrinsic feature follows the normal distribution. The specific realization of each dimension of the 23-dimensional medicines is shown in Eqs. 5 and 6: where h p denotes herbal medicines represention, m i stand the value of the i-th dimension, μ i is the standard deviation of the medicinal normal distribution in i-th dimension, σ i represents the variance of the i-th dimension.

Formula-efficacy relations predicting
In this section, we train a fine-tuning model with a detached encode layer using a formula efficiency classification task. Fine-tuning model has better classification performance, indicating that the embedding layer parameters have better representation performance.

Detached encoder
The predicting model will be built with the initialized vector representation which we have get in 3.1 and 3.2. For the two types of TCM feature vectors, there should be different training methods in the process of network learning. With the iterative learning and updating of the network, the v R of each medicine should be a continuous dense vector implying the relationship between herbs, and the v P vector should tend to be sparse representing discrete characteristics. For example, in Table 1, herb has four character, five tastes, channel tropism and toxicity in TCM theory. But not every dimension of natural characteristic exists in herb.
Coding like one-hot, is not conducive to the calculation of the relationship between herbs. Meanwhile, some herbs have different degrees in natural characteristic, such as slightly cold. Thus, it is very difficult to code it manually. Therefore, we design a detached encoder to give different mapping spaces. In Fig. 1, TCM-HRs-Encoder represents relational feature encoder initializing with FMh2v. TCM-HPs-Encoder represents medicine encoder, which is initialized with normal distribution of herbal medicines. Both encoders are composed of fully connected neurons, but there is a RreLU sparse activation layer inside TCM-HPs-Encoder. Then, the two vector are stacked to get the final herbal representation (Eq. 9, h R denote initialized relationship representation).

Feature extractor and classifier
Feature extraction plays a key role in deep learning research. In this paper, we utilize three different kinds of feature learning techniques to experiment our double-layer encoder performance, convolution networks (CnnNet), recurrent neural networks (RecNet), LSTM with maxpooling layer (RCnnNet) and Wide deep convolution networks (WideDeepNet) respectively, And after the feature extraction completing, we construct a layer of fully connected layer to output the result of prediction, and define the training objective on multi-class data as Eq. 10, multi-label data as Eq. 11, y i denotes true label, b y i denotes predicting label.

Data and experiment
In 4.1, we will introduce experiment data for pre-training and predicting. Then, we will present our research experiment results from four part, FMh2v pre-training models, detached encoder performance, analyzing the differences of feature extractors and the experiment results with different sparse activation functions. Since word embedding cannot be directly evaluated from the perspective of word embedding, this research takes Accuracy (ACC) and F1 value(F1) as the evaluating indexes, in which F1 values involved Precision (PRE) and Recall (REC). The network has better classification results, showing that the embedding has better feature representation ability.

Data
TCM medical records are widely cited by doctors during treatment, but it is difficult to extract formulas from the descriptive natural language of the records due to the low degree of digitalization. Clinic is another good way to acquire large scale examples, however, much of this data is not publicly available. Meanwhile, considering the huge manual consumption if collect resource from ancient books or textbooks, we turned to Internet resources, which included a massive digital resource.
We treat formulas of TCM as object while efficacy treat as label and collect formulas data from Chinese medicine websites (http://www.zhongyaofangji.com/index.html and https://db. yaozh.com/), but the formulas from websites cannot be used directly. The online presentation of these data not only includes the composition of herbs, but also the dosage, processing methods, units and other information. In this research, we base on the level of herbal component, so some unnecessary information will be clean. For getting greater quality data, we also prepare medicine alias library to replace the herb that represent the same substance with different names. Ultimately, we cleaned and formalized three datasets, including 31,114 unlabeled formulas (Formula data one, Fd-1), 1036 multi-class formulas (Formula data two, Fd-2) and 1723 multi-label formulas (Formula data three, Fd-3). In  (Table 2).

FMh2v pre-training model
In FMh2v pre-training part, we employ ConvNet as feature extractors, we design three kinds of window size kernel (2,3,4) to learn information from different perspectives, and then the classification results of the network are used to evaluate the experiments without pre-training (None), with Word2Vec, with BERT and our FMh2v. The experimental parameters about pretraining model as shown in Table 3.
According to the experimental results shown in Table 4, it can be found that FMh2v pretraining model proposed in this paper, has achieved improvement in the classification performance of labelled formula data, comparing with none pre-training, CBOW and BERT.

Detached encoder
Now we get herbal relationship representation initialized by FMh2v and medicine representation initialized by normal distribution. On this basis, we built three different types of feature extractor to show its applicability. Except ConvNet mention above, RecNet, RCnnNet and WideDeepNet are used as well, detailed parameter settings see in Table 5.
We use ACC and F1 to evaluate Fd-2 the effect of feature extraction in Table 6, S stands for fine-tuning model only with relationship representation, and D delegates our detached encoder in the predicting model. As shown in the Table 6, the classification performance of those models is all improved for adding detached encoder. In predicting model with ConvNet, D group's ACC promote 0.96% and F1 promote 1.16% in contrast with S group's; in RecNet as feature extractor model, ACC increase from 51.92% to 64.42%, and 43.45% to 60.34% in F1; in RCnnNet, ACC also promote 0.96%, F1 promote 0.77%; and in wide deep fine-tuning model, there are 7.69% and 7.47% increase in ACC, F1 value respectively.  We adopt ConvNet, RCnnNet and WideDeepNet as base model when fine-tuning embedding and predicting multi-label. In Table 7, we can see huge improve in each experiment's F1 values. And ACC have 2.96%, 4.74%, 10.8% obvious elevation respectively.

Effect of feature extractor
In this section, we will make a reasonable analysis of the influence which arise from different feature extractor types on the fine-tuning network. We can observe that, ConvNet both gets the best performance in Fd-2 and Fd-3, but WideDeepNet effect is far-worse than ConvNet. In   terms of the technology of them, both belong to convolutional neural network, which the convolution process is similar to the extraction of N-gram information. In ConvNet only contains one convolutional layer, having 384 kernels, however, there are 500 kernels in WideDeepNet. In the view of our object, two small scale datasets, it is not enough to support the learning of huge parameters. RecNet only have one layer of bidirectional circulating neural units which may bring gradient disappearance or explode problem. And the length of formula composition is short, the information may be lost during the training process. These are probably the reasons for RecNet has lowest performance. Although RCnnNet consist of a two-way LSTM layer, but it also has a maximum pooling layer which can effective filter the noise characteristics, improving the fitting ability of the model, leading a better performance than RecNet on Fd-2.

Effect of sparse activation function
In this section, we use predicting framework based on detached encoder as the benchmark model, comparing the effects of different activation functions in TCM-HPs-Encoder for getting better fine-tuning result. PreLU, ReLU and ELU are unsaturated nonlinear activation functions as well, which may helpful to sparse distribution of input features (Fig. 3). α is hyper-parameter in PreLU (α = 0.1), RreLU(αϵ(0.1,0.5)), ELU(α = 0.1). As can be seen from Tables 8 and 9, RreLU can achieve good performance on both ConvNet_D, RCnnNet_D and WideDeepNet_D on Fd-3. Although RreLU is not the best choice by evaluation index, parameter α with a controllable range can effectively avoid neuron necrosis and reduce the inconvenience from fixed parameter setting.

Conclusion and future work
The representation of TCM play an important role for learning internal knowledge in deep learning model. In this research, we propose a detached feature extraction deep learning approach TCM2Vec, which include an unsupervised pre-training model of herb, FMh2v, combining with theory of TCM and a detached encoder for improving the applicability of deep learning methods in TCM field. FMh2v will obtain the herbal relationship representation which may imply herbal interaction information. Then we employ normal distribution to matching medicinal feature on basis of a large unsupervised formula dataset. The last, deep learning models with detached encoder are built for predicting formula-effect relations and fine-tuning the initial representation. Our experiment reveals that pre-training model incorporating cross-feature towards TCM field can provide significant benefit. Meanwhile, our findings suggest that it may be valuable to tailor learning model for specific field by using domain or task relevant theories, such as our detached encoder. However, relative to public data, formula data is a low-dataset. And we removed some other information in formula, such as dosage and manufacture method. There is a lack of representation performance. In the future, we will focus on more detailed research. In order to enhance the depth of vector expression, the information, such as the processing methods, the status or the dose of herb, will be considered. And in deep learning model application, we will design and develop professional model for TCM.