Introduction

Symptom-based machine learning models help patients self-detect diseases via electronic devices such as smart phones or robots in hospitals with automated question and answer systems [7]. Recently, several studies improved the text classification model for clinical department classification [27] and disease detection [12]. These studies used text from symptoms and other features of patients for disease detection [17].

Dengue fever (a mosquito-borne viral disease) [18] and influenza are dangerous infectious diseases that many people contract. Dengue and influenza have symptoms like the common cold, but they can be fatal. It is estimated that 3 to 5 million people each year become seriously ill due to influenza [21].

The research about machine learning or deep learning for dengue and influenza is divided into two parts, improvement prediction models for forecasting the number of patients [25] or forecasting an outbreak [8] in some areas or countries such as China [26], India [16], and Thailand [22]. Another type of research is focused on improving machine learning or deep learning models for detection of dengue fever and influenza from vital signs [6] and symptoms [1] of patients.

The Long Short-term Memory (LSTM) model is a recurrent neural network model. It is commonly used in text classification [13], time series classification [11], and time series forecasting [25].

In this research, we will use the LSTM model to classify the symptoms of patients as text. The LSTM model was concatenated with a fully connected neural network to use patient vital signs and other features as input data, including gender, body temperature, and age of patients to increase the performance of the classification model. Moreover, we improve our method for data preprocessing by removing words that are not important to classification, this simplifies the input data.

Theorical foundations

In this section, we describe all of the methods we used for modeling in this research.

Mutual information metric

Mutual information metric (MI) is a value used to show the ability to classify each keyword. We use MI to measure the correlation between each keyword and each class. Mutual information metric is denoted by MI(w, c), where w is a word and c is a class. It is calculated by Eq. (1).

$$ \mathrm{MI}\left(w,c\right)=\log \frac{f_A\bullet N}{\left({f}_A+{f}_C\right)\left({f}_A+{f}_B\right)} $$
(1)

When fA is the number of documents in class c that contain word w, fB is the number of the documents not in class c that contain word w, fC is the number of the documents not in class c that do not contain word w. and N is the number of all documents. The MI(w, c) has a value in range [ − log(N), log(N)] this is shown in (2) and (3).

$$ \mathrm{MI}\left(w,c\right)=\log \frac{f_A\bullet N}{\left({f}_A+{f}_C\right)\left({f}_A+{f}_B\right)}\le \log \frac{N}{\left({f}_A+{f}_B\right)}\le \log (N) $$
(2)
$$ \mathrm{MI}\left(w,c\right)=\log \frac{f_A\bullet N}{\left({f}_A+{f}_C\right)\left({f}_A+{f}_B\right)}\ge \log \frac{f_A}{\left({f}_A+{f}_C\right)}\ge \log \frac{f_A}{\left({f}_A+{f}_C\right)}\ge \log \frac{1}{N}=-\log (N) $$
(3)

The MI of each word can be measured by finding the MI between the word and the class with the highest MI value. It is shown in Eq. (4) where d is the number of classes.

$$ \mathrm{MI}(w)=\underset{i=1:d}{\max}\mathrm{MI}\left(w,{c}_i\right) $$
(4)

The MI is the largest in the case of fA = 1, fB = 0, and fC = 0 .The words that have a frequency of 1 are important for classification.

Word embedding

Word embedding is the method for representing each word with a vector of a real number. Word2vec [15] is a method of word embedding, where neighbors’ vectors of each word represents words with similar meaning. We can set the dimension of the vectors for each word when we train the word2vec model. If we use a pre-train word2vec model, we can use the principal component analysis (PCA) to reduce the dimension of the vector of words to the dimension that we want.

Interpolation

Interpolation is a method for estimating the missing data using polynomial or other functions [2], to obtain some points of data. An example for calculating the missing point of equation y =  sin (x) is shown in Fig. 1.

Fig. 1
figure 1

Data interpolation with linear and cubic functions

LSTM

Long Short-term memory Neural Network (LSTM) [9] is a model architecture for recurrent neural network (RNN). The input data for each record of LSTM model is a sequence of vectors. A structure of LSTM is shown in Fig. 2 where Xt is a vector of input data with time stamp t.

Fig. 2
figure 2

LSTM model structure

The LSTM model is used for classification or prediction of sequential input data. In the present, the LSTM has had several improvements and has been used in several ways for time series prediction and text classification, such as LSTM fully convolutional networks for time series classification [11], bidirectional LSTM for sentiment analysis [13] and medical text classification [7].

Imbalanced data problem

The imbalanced data problem is a problem of data classification, when the number of records in each class is vastly different [19]. In the case of binary class classification, we call the class with more records than the other class the majority class and call the other class the minority class.

There are two popular methods for solving the imbalanced data problem:

  1. 1)

    Using under sampling or oversampling for sampling training data in each class to have the same number of records.

  2. 2)

    Using some loss functions for machine learning or deep learning model to increase the weight of the minority class.

In this research we use the cost-entropy loss function [24] in Eq. (6) for the loss function of LSTM model for solving the imbalanced data problem. It has been improved upon from the cost-entropy loss in Eq. (5) where tk = [tk(1), tk(2), …, tk(d) ] is the vector of target output of kth record of dataset, tk(i) ∈ {0, 1} for i = 1, 2, …, d, and yk = [yk(1), yk(2), …, yk(d) ] is the vector of output of model for kth record of dataset, and yk(i) ∈ (0, 1) for i = 1, 2, …, d. Moreover, we set nk to be the number of records of training data in the class of kth record and set a constant value γ ∈ [0, 1].

$$ E=-\sum \limits_{i=1}^n\sum \limits_{k=1}^d{t}_k(i)\log {y}_k(i) $$
(5)
$$ E=-\sum \limits_{i=1}^n\sum \limits_{k=1}^d{t}_k(i)\log {y}_k(i){\left(\frac{1}{n_k}\right)}^{\gamma } $$
(6)

Material and methods

Data description

The data used in this research is from medical records from Saraphi Hospital, Chiang Mai Province, Thailand Between 2015 and 2020 [3,4,5]. We use only records of patients diagnosed with three diseases. This includes the common cold, flu, and dengue. We listed all the attributes we used in this research in Table 1.

Table 1 The attributes are used in this research

The distribution (average and standard deviation) of some features and the number of records for each class are shown in Table 2.

Table 2 The average, standard deviation, and number of patients for some features

From the statistical hypothesis test (t-test), it was found that:

  1. 1)

    Average of age: It was found that the mean of age of common cold patients was greater

than the mean of age of dengue and flu patients (p-value < 0.05), but the mean of age of dengue and flu patients was no different. (p-value > 0.05).

  1. 2)

    Average body temperature: It was found that the mean body temperature of common cold

patients were less than the mean of body temperature of dengue patients (p-value < 0.01), and the mean of body temperature of dengue patients was less than the mean of body temperature of flu patients (p-value < 0.01).

Data preprocessing

In this research, the features used for classification include CHIEFCOMP, GENDER, MONTH_SERV, BTEMP, and AGE. For numerical features (BTEMP and AGE), we use min-max normalization to adjust the values in range [0,1]. Examples of data are shown in Table 3. For MONTH_SERV, we use one hot encoder to convert each value to a vector of integers. For the CHIEFCOMP column, the data in this column is a sentence in the Thai language. We use a python library “pythainlp” [20] for word tokenization. Here is an example of word tokenization, from the sentence “เป็นหวัดมีน้ำมูกไอ” (English: “Having a cold with a runny nose and cough”) to a list of words [“เป็น”, “หวัด”, “มี”, “น้ำมูก”, “ไอ”]. Then the python library “Gensim” [14] is used to create a word2vec model that converts the text of each record into a matrix of a real number.

Table 3 Examples of data in our dataset

Keywords selection

In the process of text preprocessing for LSTM training. We removed words that were not important for classification to simplify the incoming data including:

  1. 1.

    Low MI: words with low mutual information metric (bottoms 5%).

  2. 2.

    Low frequency: words with low frequency (frequency < 2) because it had high MI. That is, it has a high ability for classification. However, it may be a typographical error.

These words are defined as stop words, and all stop words are removed from the data. Next, we set the positions of the removed words to missing values. It is shown in Fig. 3.

Fig. 3
figure 3

Vectors of words in a sentence after the removal of 2 stop words

We use three methods to solve the missing values problem:

  1. 1.

    Cut the stop words: cut the vectors of all stop words in the sentence.

  2. 2.

    Fill with mean: fill the vectors of the missing values by the mean of word2vec of all words in the sentence with the corresponding position.

  3. 3.

    Interpolation: fill the vectors with the missing values by interpolation using the corresponding position in vectors.

We show the example of filling missing values for 2 dimensional word2vec vectors in Fig. 4.

Fig. 4
figure 4

Solving missing values problem

LSTM with fully connected neural network model

For training the models, we divide the data into 3 datasets including: training data, validation data, and testing data. At first, we use all of the words in the CHIEFCOMP column of the dataset to train the word2vec model, then we divide the dataset into two datasets: 80% training and validation data and 20% testing data.

In the next step, we find MI of all words in the training and validation dataset and then cut out the words that have low MI (bottom 5%) and cut out words with frequency less than 2. Next, we solve the missing values problem, and then use the training and validation dataset to train LSTM with the fully connected neural network model, by dividing the training and validation dataset into 80% training and 20% validation data. We show the conceptual framework for our research in Fig. 5. The softmax function in Eq. (7) is used as an activation function for the last layer of the classification model to compute probability of each record in each class where y = [y1, y2, …, yd ] is a vector of real number.

$$ \mathrm{softmax}\left({y}_i\right)=\frac{y_i}{\sum \limits_{j=1}^d\exp \left({y}_j\right)} $$
(7)
Fig. 5
figure 5

Research conceptual framework

Results and discussion

Performance measurement

Since the dataset in this research is an imbalanced dataset, we cannot use accuracy to measure the performance of the model. For this reason we use G-mean (geometric mean of recall) [28] for measurement of the performance models. G-mean is defined in Eq. (9) where d is the number of classes, recall(class ci) is a recall of class ci defined in Eq. (8).

$$ \mathrm{recall}\left(\mathrm{class}\ {c}_i\right)=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{records}\ \mathrm{in}\ \mathrm{class}\ {c}_i\ \mathrm{that}\ \mathrm{true}\ \mathrm{class}\mathrm{ification}}{\mathrm{number}\ \mathrm{of}\ \mathrm{all}\ \mathrm{records}\ \mathrm{in}\ \mathrm{class}\ {c}_i\ } $$
(8)
$$ \mathrm{G}-\mathrm{mean}=\sqrt[d]{\prod \limits_{i=1}^d\mathrm{recall}\left(\mathrm{class}\ {c}_i\right)} $$
(9)

Performance of model

We have shown the performance of all models in Tables 4 and 5. Label-indicator morpheme growth (MG) [10] is the method that adds weight to the keywords with the highest MI (top 5%). SMS spam dataset is the basic dataset for text classification [23]. The model used in this research, was single layer LSTM and single hidden layer neural network (5 hidden nodes) with Adam optimizer in python library “keras”.

Table 4 Performance of models -- Area under the ROC Curve (AUC)
Table 5 Performance of models (G-mean)

For the LSTM model, we use LSTM with no hidden layer and LSTM with single hidden layer (the size of the vector in the hidden layer is 5) for performance comparison. In addition, for the single hidden layer fully connected neural network model, we ran the number of hidden nodes as 5, 10, 15, and 20. Moreover, for the word2vec model, we ran the size of the vector as 20, 25, and 30.

We considered our dataset in three ways, two of which are binary classes. It consists of 1) the common cold and dengue class, 2) the cold and flu class, and the other dataset is the multiple class (common cold, dengue and influenza class). For the SMS spam collection dataset, which is a standard dataset used to test the performance of our method. It consists of two classes, include ham and spam message.

In addition to use the LSTM and LSTM with a fully connected neural network. We also used the LSTM model with numerical features as shown in Fig. 6. to compare the model’s performance.

Fig. 6
figure 6

LSTM model structure

The results showed that LSTM with a fully connected neural network had better performance than normal LSTM. Moreover, removing stop words increased the G-mean value of the testing data for all datasets. For the medical records dataset, LSTM with Fully connected Neural network gives the best G-mean value when words with low MI (bottom 5%) and low frequency (frequency < 2) together are considered stop words. If we set the stop words to be the words with low frequency (frequency < 2), then it reduces the training time (shown in Table 6) and increases the performance of the LSTM model. Moreover, LSTM with the feed forward fully connected neural network model uses less time for training than the LSTM model, because it has a faster convergence.

Table 6 The time of data preprocessing + models training and testing of each model (second). Run on data science server at Chiang Mai University, Thailand (LINUX VPS, RAM 16 GB, CPU INTEL CORE i9, GPU 2080TI 11GB)

Conclusion

This research used the LSTM model with fully connected neural network for dengue fever and influenza detection. Text of symptoms and other features including age, body temperature, gender, and month of service were used for input data. The results showed that the LSTM with the fully connected neural network model had higher performance than the normal LSTM model. In addition, removing unimportant keywords from the dataset and also increased their performance.