Dataset
The purpose of our model was to predict disease category and recommend registration department according to patient’s chief complaint. To verify the performance of our model, we selected 200,000 inpatients’ chief complaints and the corresponding disease diagnosis codes and treatment departments from a tertiary class hospital from January 2015 to December 2018. In electronic medical records, the disease diagnosis codes were classified by International Classification of Diseases (ICD)-10 (https://www.who.int/classifications/icd/en/), and only the first three bits of the ICD code were considered in this paper. After data cleaning and filtering, 198,000 records were collected, including 130 types of disease diagnosis and 25 departments, covering about 80% of the inpatients. In the dataset, the maximum and minimum lengths of sentences were 36 and 2, respectively, and the average length was 12. The total number of Chinese characters was 1456. The dataset was divided into a training set, a validation set, and a test set in a ratio of 70:15:15.
Figure 2 illustrates the distribution of the number of patients with 130 types of diseases, which follows the power-law distribution. The top 30 diseases account for about 50% of patients. The disease with the largest number of patients is K80 (Cholelithiasis), and the disease with the least number of patients is O36 (Maternal care for other known or suspected fetal problems). Figure 3 illustrates the distribution of the number of patients in 25 departments, which presents the power-law distribution as well, and the top three departments (general surgery department, obstetrics department and Vasculocardiology Department) in the number of patients accounted for about 35% of the total patients.
Baselines
-
TextCNN [15]. It is a text classification algorithm based on CNN. It utilizes multiple convolution kernels in different sizes to extract key information from sentences, which can capture local correlation of sentences. TextCNN has simple architecture and fast training speed, achieving state-of-the-art results on multiple datasets.
-
BiLSTM [16]. RNN is a widely applied NLP model that can process variable length text sequences and learn long distance dependencies from sentences. In this experiment, a single-layer bidirectional LSTM network was utilized to classify the input text.
-
LEAM [17].It is a model based on attention mechanism. It performs well in text representation by learning the joint embedding of word and label in the same space. Compared with other attention-based models, LEAM needs fewer model parameters and converges faster, and has good interpretability.
-
Transformer [9].It is a sequence processing model based on self-attention mechanism, which can learn long-distance dependency from sentences. It can run in parallel paradigm and is the basis of BERT and other pre-trained models.
-
BERT-base [4]. It is the original Chinese BERT pre-trained model published by Google, which achieves the state-of-the-art performance in many text classification tasks.
-
BERT-wwm [18]. The updated version of BERT, published by Harbin Industrial University, is a Chinese pre-trained model based on Whole Word Masking technology. Its performance is slightly better than that of the original BERT in sentence classification task.
Implementation details
We selected the optimum parameters on the validation dataset through parameter tuning. The differences of the experimental results with different parameters were small, indicating that the clinical dataset in this paper was insensitive to parameters. In addition, Chinese BERT-base was segmented by character size without considering Chinese word segmentation in traditional NLP. In our experiments, the word segmentation was not under consideration either.
The same word embedding size, batch size, and maximum sentence length of 64, 128, and 36, respectively, were adopted in the models of TextCNN, LSTM, LEAM, and Transformer. The Adam algorithm was utilized for optimization. The number of iterations (epochs) was not limited, and the training process was conducted until the accuracy was not improved for 10 consecutive iterations. The parameters were set as follows:
TextCNN: Four types of convolution kernels with sizes of 2, 3, 4, and 5 were used. Each convolution kernel contained 128 kernels. The fully connected layer contained 256 neurons. The dropout was 0.5, and the learning rate was 1e4.
BiLSTM: The number of neurons in the LSTM hidden layer and the full connection layer was 128, dropout was 0.2, and the learning rate was 0.001.
LEAM: The label penalty coefficient was 1.0, the convolution kernel size was 3, and the number of neurons in the hidden layer was 300. The dropout was 0.5, and the learning rate was 0.001.
Transformer: The numbers of encoder layers and heads were 4 and 8, respectively, and the number of neurons in the Point wise feed forward network was 512. The dropout was 0.1, and the learning rate was 2e5.
BERT-base: The parameter setting should be same as that in the original BERT model when tuning the pre-trained model. The parameters in this paper were set as follows. The maximum sentence length was 36, and the batch size was 16. The number of epochs ranged from 1 to 5, and the tuning ranges of learning rates were 5e-6, 1e-5, 2e-5, 3e5, 4e5, and 5e-5 [4].
The parameter setting and the corresponding tuning ranges of BERT-wwm and CHMBERT were the same as those in BERT-base.
Experimental results
The commonly used Accuracy and F1 score in NLP classification task were used as evaluation criteria to compare the effects of different models. The same chief complaints may lead to different diseases; for instance, stomachache may be caused by enteritis, appendicitis, or other diseases. Therefore, top-k prediction results were calculated when predicting the type of disease. The k values were set to 1, 5, and 10, respectively, in these experiments. Similarly, more than one choice of first diagnosis department may be present on the basis of chief complaints. Thus, we obtained the prediction results of top-k when k=1, 2, and 3 when predicting the departments. The experimental results of disease and department prediction of different models are shown in Tables 1 and 2.
Table 1 Average macro accuracy and F1-score for disease prediction Table 2 Average macro accuracy and F1-score for disease prediction Tables 1 and 2 show that the pre-trained models based on BERT were significantly better than other state-of-the-art models. The CHMBERT model proposed in this paper performed the best among the tested models, which indicated that the pre-trained model had great potential in medical NLP task. As for the non-pre-trained models, text-CNN performed the best, followed by the Transformer models, whereas LSTM and LEAM performed the worst.
In the disease prediction experiment, the proposed CHMBERT model showed obvious advantages in the top-1 prediction. Compared with those of the sub-optimal model BERT-wwm, the accuracy and F1 of CHMBERT improved by 0.16% and 0.39%, respectively. Compared with those of text-CNN, which performed the best among the non-pre-trained models, the accuracy and F1 of CHMBERT improved by 0.9% and 1.35% respectively. In the prediction of top-5 and top-10, the performance of CHMBERT was similar to that of the sub-optimal model and slightly better than that of the text-CNN model.
In the department prediction experiment, our CHMBERT model achieved the best results. Compared with those of the sub-optimal model, the accuracy and F1 of CHMBERT improved by 0.14% and 0.59%, respectively, in the top-1 prediction. Compared with those of text-CNN, the prediction accuracy and F1 of CHMBERT improved by 0.79% and 1.74%, respectively, in the top-1 prediction. In the prediction of top-2 and top-3, the CHMBERT model also performed better than the BERT-wwm and text-CNN models.
Parameters discussion
We compared the performance of the CHMBERT model in disease prediction and department prediction with different learning rates and epochs. Figure 4 shows the top-1 prediction accuracy with different epochs when the learning rate was fixed with 2e5. Figure 5 shows the top-1 prediction accuracy with different learning rates when the number of epochs was set to 3.
As shown in Figs. 4 and 5, the prediction accuracy of CHMBERT was less affected by parameters. When the learning rate was fixed at 2e5 and the number of epochs varied from 1 to 5, the differences between the maximum and minimum values of disease prediction accuracy and department prediction accuracy were 1.58% and 0.78%, respectively. When the number of epochs was fixed at 3, the differences between the maximum and minimum values of the disease prediction accuracy and department prediction accuracy were 1.11% and 0.51% under different learning rates, respectively. In general, when the number of epochs was small (such as 1, 2) and the learning rate was small (such as 5e6, 1e5), the performance was poor. These results indicate that 3 or 4 is the recommended number of epochs while 2e5, 3e5, or 5e5 is the recommended learning rate.