Deep transfer learning: a novel glucose prediction framework for new subjects with type 2 diabetes

Blood glucose (BG) prediction is an effective approach to avoid hyper- and hypoglycemia, and achieve intelligent glucose management for patients with type 1 or serious type 2 diabetes. Recent studies have tended to adopt deep learning networks to obtain improved prediction models and more accurate prediction results, which have often required significant quantities of historical continuous glucose-monitoring (CGM) data. However, for new patients with limited historical dataset, it becomes difficult to establish an acceptable deep learning network for glucose prediction. Consequently, the goal of this study was to design a novel prediction framework with instance-based and network-based deep transfer learning for cross-subject glucose prediction based on segmented CGM time series. Taking the effects of biodiversity into consideration, dynamic time warping (DTW) was applied to determine the proper source domain dataset that shared the greatest degree of similarity for new subjects. After that, a network-based deep transfer learning method was designed with cross-domain dataset to obtain a personalized model combined with improved generalization capability. In a case study, the clinical dataset demonstrated that, with additional segmented dataset from other subjects, the proposed deep transfer learning framework achieved more accurate glucose predictions for new subjects with type 2 diabetes.


Introduction
Diabetes mellitus (DM) is a chronic metabolic disease, which causes a series of complications and seriously affects the patient's health and daily life [1]. For patients with diabetes, it is necessary that they maintain their blood glucose concentration (BGC) within a target range (e.g., 70-180 mg/ dL) [2]. Recently, continuous glucose-monitoring (CGM) systems [3][4][5] have been widely used in diabetes management to provide accurate and frequent dynamic glucose records in the form of time series data. By analyzing these data, deeper insight into glucose fluctuations could be obtained to provide valuable information for glucose management.
A glucose predictor would utilize the valuable CGM records to forecast the glucose level with a short-term time horizon and guide the diabetics to make proper insulin adjustments. Accurate glucose prediction is also vital for the early and proactive regulation of blood glucose before it drifts to undesirable levels. Therefore, numerous approaches, based on physical models or data-driven empirical models, have been proposed to predict glucose levels [6][7][8][9][10][11][12][13].
For physical models, it is usually difficult to provide a personalized description as the model parameters are usually impossible to estimate [14]. In contrast, data-driven empirical models provide a relatively easier way to estimate the time-varying relationships among the variables [8,15,16], but they have limitations in real-time prediction. The main reason is that future glucose levels are influenced by many factors such as historical trends, administered insulin, physical activity, carbohydrate intake and hormone concentration, and most of these factors are time variant and cannot be captured accurately [17,18]. Moreover, the glucose levels of different subjects may be impacted by their different life 1 3 styles and physiological reactions [19]. As a result, various prediction strategies based on complex nonlinear data-based structures only rely on CGM or a small amount of physiological status monitoring data without involving physiological variables [20][21][22].
In recent years, with the development of deep learning networks, more and more researchers have begun to apply deep learning techniques to blood glucose prediction. Mirshekarian et al. used the long short-term memory network (LSTM) with multiple physiological data to predict blood glucose, which achieved better prediction results than SVRbased methods [23]. Moreover, Li et al. proposed a convolutional recurrent neural network (CRNN) for blood glucose predictions [24]. It consisted of a multi-layer convolutional neural network (CNN) and a modified recurrent neural network (RNN) layer, in which CNN extracted the features from multi-dimensional aligned time series data, while the modified RNN layer modeled the sequence data and provided blood glucose predictions. The experimental results showed that the network performance was better than the exogenous autoregressive model (ARX), SVR and latent variable (LV) models. Aiello et al. presented a deep glucose forecasting (DGF) model that was constructed by LSTM with a full connection layer. That model had better prediction results than the average linear model and daily model predictor (DMP) [25]. Mosquera-Lopez et al. [26] developed a deep learning network that achieved superior prediction performance with the aid of huge amounts of blood glucose data (27,466 days). This model produced more accurate predictions than other machine learning approaches. The authors attributed the results to massive training data, a sufficiently complex network structure, and using different data volumes for model training. Therefore, it was reasonable to presume that deep learning networks tend to perform better when provided with sufficient training data.
Considering the significant differences in patients' physiology and behavior, many studies attempted to develop personalized deep learning models for better performance [27][28][29]. However, a common situation in clinical practice is that new patients do not have sufficient data for training a personalized model. Luo and Zhao [30] used transfer learning and incremental learning to improve the prediction accuracy of the ARX model, thus resolving the issue of insufficient data from new patients. However, that study was only concerned with traditional machine learning methods which could not benefit from the superior modeling performance of deep transfer learning.
Although various strategies for glucose prediction have been proposed in recent research studies, most of those were for patients with type 1 diabetes. Considering that type 2 diabetes accounts for a high proportion of the diabetic population and many patients with serious type 2 diabetes need to precisely manage their glucose levels, there arose a need for online glucose monitoring and prediction to provide early warning of hypoglycemia, as well as the selection of a personalized clinical medication and treatment plan [31,32]. However, glucose dynamics may vary substantially among patients with type 2 diabetes, due to their diverse individual specificities, pathogenic factors, therapeutic schedules, and lifestyles. Therefore, there are still many challenges for the accurate prediction of glucose levels in type 2 diabetes patients and these challenges must be appropriately addressed.
Motivated by the above considerations, a deep transfer framework to predict glucose levels of new subjects with limited historical data was proposed in this paper. By including historical segmented blood glucose data from other patients' database, a personalized deep learning model was established to achieve dynamic short-term glucose predictions for new patients with T2D. The schematic diagram is shown in Fig. 1.
As shown in Fig. 1, the proposed deep transfer framework consisted of instances-based deep transfer and networkbased deep transfer. The specific design ideas can be summarized as follows: first, a pre-trained deep learning model with superior neural network generalization ability was developed with the source domain data. Then, to guarantee the greatest level of similarity between the extended target domain and the target domain, the training dataset was augmented under the minimum dynamic time warping (DTW) distance principle [33]. Finally, the pre-trained general deep learning network was fine-tuned by data from the extended target domain to achieve the personalized prediction model. In addition to inheriting the generalization ability of the pretrained network, the personalized model also benefited from improved prediction performance in the target domain.
The contributions of this study were as follows: 1. A deep transfer learning framework was developed to establish a personalized glucose prediction model for new subjects with type 2 diabetes. 2. Instance-based deep transfer learning with the minimum DTW distance was proposed to guarantee the minimal differences between the extended target domain and the original target domain. 3. Based on the extended target domain, network-based deep transfer learning was developed to fine-tune the pre-trained general deep learning network to finally achieve the personalized prediction model. 4. The proposed framework was applied on clinical dataset from patients with T2D, and a series of experiments were performed to verify its effectiveness and clinical acceptability.
The rest of this paper was organized as follows: Sect. 2 explains the deep transfer learning framework for glucose 1 3 prediction, and introduces its related methods in detail. Section 3 describes the dataset and data pre-processing. Section 4 shows the experimental results in the clinical dataset with the proposed framework and presents necessary comparative experiments for performance analysis. Finally, Sect. 5 provides concluding comments for our study.

Deep transfer learning framework for personalized glucose prediction
Deep transfer learning is an extension of the standard transfer learning theory that addresses how to effectively transfer knowledge by deep neural network when there is insufficient data in the target domain. It can be classified into instancebased deep transfer learning, mapping-based deep transfer learning, network-based deep transfer learning, and adversarial-based deep transfer learning [34]. The unified definition of deep transfer learning is based on standard transfer learning [35], which defines a source domain D S and a target domain D T with learning tasks T S and T T , respectively. Deep transfer learning refers to deep learning improvement of the nonlinear predictive function f T (⋅) in D T using the knowl- Instance-based deep transfer learning is a branch of deep transfer learning which aims to find appropriate instances from the source domain to be used as supplements for deep learning in the target domain by a well-designed weight adjustment mechanism. As an example, let us consider TrAdaBoost, which is a typical instance-based algorithm that addresses inductive transfer learning problems [36]. This algorithm assumes that data in the source and target domains have the same features and labels but different distributions. Therefore, some data in the source domain could be valuable for learning, while other data may be of no use, or may even have a destructive effect.
A small amount of data from a new subject can be regarded as the target domain D T , and glucose prediction for a new subject is equivalent to the learning task T T . The data from other subjects can be regarded as the source domain D S , where D S ≠ D T because of the different distributions between the subjects. Therefore, deep transfer learning can be defined as using deep learning to complete the prediction task for new subjects. The related methods of segmented data deep transfer learning for glucose prediction are described in detail in the following sections.

Instance-based transfer with minimum DTW distance
For the glucose prediction task discussed in this paper, the source domain contained the segmented glucose measurement data from multiple subjects, and the target domain only contained the segmented glucose data of a new subject. The data in the source and target domains had the same features and labels. The limited amount of data from new subjects was not enough to train a personalized deep learning prediction model, therefore, the expansion of the target domain data through instance-based transfer was the most intuitive transfer method. However, due to the differences in the physiological state and external environment between the subjects, different subjects had different distributions of glucose measurement data. Consequently, it was necessary to extract the data from the source domain that would be beneficial for transferring into the target domain. It was generally believed that it would be beneficial to learning of the target domain if the source domain data had a higher degree of similarity to the target domain. The trend and fluctuations of the sequence can be judged by the sequence shape, hence this study used Dynamic time warping (DTW) is one of the most wellknown pattern match techniques to determine the shape similarity between two time series [37]. Compared with other pattern match methods, DTW is a simple but effective method that has been applied in many fields [38], such as speech recognition [39,40], computer vision and process monitoring [41,42]. Based on a dynamic programming technique, DTW could find the minimal distance between two time series by shrinking or stretching its time dimension [33,43].
Given two sequences X = (x 1 , x 2 , … , x m ) and Y = (y 1 , y 2 , … , y n ) , a distance matrix with dimension m × n was first calculated. The local distance d(x i , y j ) between the corresponding elements x i and y j were measured by a distance function, such as the square Euclidean distance or Manhattan distance [44]. Then, the warping path W = (w 1 , w 2 , … , w k , … , w L ) through the distance matrix was determined by minimizing the cumulative distance, where w k = (i, j) represented the local distance d(x i , y j ) on step k of warping path W , and L was the length of the warping path W , where max(m, n) ≤ L ≤ m + n − 1 . The warping path W satisfied the three local constraints as follows: (1) Boundary constraints. w 1 = (1,1) and w L = (m, n).
Next, the DTW distance was designed to search the minimum warping path DTW(i, j) by the following warping cost: Thus, the DTW distance was a dynamic programming solution that satisfied the above three constraints.
where D(i, j) represented the DTW distance between two time series of length i and j. Finally, the DTW provided the cumulative distance to measure the similarity between the two time series. The DTW alignment is illustrated in Fig. 2, where red line shows the optimal warping path (the W in the middle). A diagonal move indicated a match between the two series, while an expansion duplicated a point in one sequence and a contraction eliminated a point.
The DTW was used to measure the level of similarity between the glucose measurement sequences in the target and source domains, and to transfer the glucose measurement sequences in the source domain that had high similarity to the target domain to construct an extended target domain. The extended target domain constructed based on the minimum DTW distance had a high degree of similarity with the original target domain, and retained the distribution of the original target domain to the greatest extent, which was desirable for representation learning of the target domain.

Network-based transfer on glucose prediction network
In network-based deep transfer learning, the pre-trained partial network in the source domain could be reused and transferred as part of the deep neural network in the target domain [34]. During the transfer process, whether the network structure needs to be modified depends on if the tasks are the same or not. For the network investigated in this study, the task was unchanged, because the glucose prediction task was performed on both the source domain and the target domain. Therefore, there was no need to modify the network structure during the transfer process, and the network structure and parameters of the pre-trained model were reused completely.
Reducing the parameters of the personalized glucose prediction model effectively shortened the update and inference time, thereby making it possible to achieve online updates and real-time predictions on mobile devices. The structure of the glucose prediction network proposed in this paper was similar to the LSTNet [45] used for time series prediction. It was mainly composed of a 1-D convolutional layer, a recurrent layer (gate recurrent unit, GRU), a batch normalization layer, and a fully connected layer, with only 427 trainable parameters as shown in Fig. 3. The 1-D convolutional layer performed preliminary feature extraction on the glucose measurement data, the recurrent layer extracted the time dependencies in the sequence, the batch normalization layer was used to improve the data distribution and training speed [46], and the fully connected layer provided the final prediction. The details of the prediction model are shown in Table 1. The inputs of this prediction model were 1-h CGM records with a sequence length of 12, and the outputs were the glucose levels a half hour or 1 h in the future.
The generalization ability of a deep learning model is largely affected by the amount and diversity of the training data. Using an extended target domain data based on the minimum DTW distance achieve a high degree of shape similarity with the target domain, but at the same time, it possibly led to the weaker generalization ability of the prediction model. The pre-trained model had a stronger generalization ability by directly training on the source domain. Subsequently, the pre-trained model was transferred to the extended target domain through network-based deep transfer learning to obtain the personalized model. The personalized 1 3 model further improved the performance on the target domain while inheriting the generalization ability of the pre-trained model.

Deep transfer learning framework with segmented data
The deep transfer learning framework with segmented data is shown in Fig. 4. Initially, a deep learning model was generated for the blood glucose prediction task, and then the amount of data required for sufficient training was estimated based on the number of parameters in the model. Subsequently, DTW was used to measure the similarity between the glucose measurement sequences in the source and target domains, and enough source domain data with a high degree of similarity were transferred to the target domain to form an expanded target domain. Finally, the pre-trained model was trained on the source domain and the pre-trained model was fine-tuned on the extended target domain to obtain a personalized prediction model.
The main steps for deep transfer learning with segmented data were as follows: 1. An initial deep learning model was constructed for the blood glucose prediction task and experimental methods were used to estimate the minimum number of sources N required to sufficiently train the model based on the number of parameters n( ). 2. The source domain D S (CGM records from different subjects) was extracted from N − 1 sources with the mini-

Segmented clinical dataset
This study included 908 hospitalized patients with type 2 diabetes at Department of Endocrinology and Metabolism, Shanghai Jiao Tong University Affiliated Sixth People's Hospital from January 2018 to the end of December 2018. Each subject wore a CGM system (Medtronic iPro2, Northridge, California) for 3-5 days for glucose measurement sampling once every 5 min. Capillary blood glucose was measured every 12 h by a SureStep blood glucose meter (LifeScan, Milpitas, CA, USA) to calibrate the CGM system. This study was approved by the Ethics Committee of Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, China.
To eliminate the adaptability error of subjects wearing a CGM device for the first time, only the data from the second day to the third day were extracted and each subject had 288 × 2 valid measurement points. In this study, the first 288 glucose measurements of the subjects were used as training data, and the last 288 data points were used as the testing data. The glucose measurement data from the target subject were defined as the target domain D T and the data from other subjects served as the source domain D S . To differentiate the results and facilitate the drawing of the prediction-error grid analysis (PRED-EGA) grid, the units of the glucose measurement data were converted from mmol/L to mg/dL.

Data pre-processing
Standardizing the original data can improve the data distribution and quicken the model learning to a certain extent. The two commonly used standardization methods are min-max standardization and Z score standardization. Min-max standardization is a linear transformation of the original data and it maps the data values to [0, 1]. The Z score standardized calculation formula is shown below:  where x denoted the origin data, and and were the mean and standard deviation of the original data, respectively. Z score standardization was standardized according to the mean and standard deviation of the original data. The processed data conformed to the standard normal distribution, which was beneficial to the training of the deep learning network. Therefore, Z score standardization was selected to preprocess the blood glucose data.

Evaluation metrics
The efficiency and accuracy of our proposed method was demonstrated with the root mean squared error (RMSE) x − , results in mg/dL, the mean absolute error (MAE) results in mg/dL and the mean absolute relative difference (MARD) in %, calculated by: where ŷ i denoted the forecasting results, y i denoted the CGM measurement, and N was the total number of data points. Additionally, the prediction-error grid analysis (PRED-EGA) [47,48] quantified the clinical acceptability of the predicted glucose levels and was used to evaluate our proposed method. The PRED-EGA was shown to be a reliable and robust method for the continuous glucose-error grid analysis [49], which evaluated the accuracy of the predictions in terms of both the values and derivatives of the predictions. The PRED-EGA included two interacting components: (1) point-EGA (P-EGA) for evaluating the accuracy of the predictions and (2) rate-EGA (R-EGA) for assessing the capability of the model to characterizes the derivative of the measurement values.

Deep transfer learning based on segmented blood glucose dataset
(1) Extended target domain Before performing instance-base transfer, it is necessary to consider the size of the extended target domain. It is generally believed that the amount of data required to sufficiently train a model is proportional to its number of parameters, however, there is still no theoretical method to estimate the amount of data required. Therefore, experimental methods are used to estimate the data size required by the extended target domain.
The specific method employed in this study was to randomly select a certain number of subjects and use their training data as the testing dataset. One subject was randomly selected from the 908 clinical subjects in T2D to develop the predictive model with the training dataset. The performance of the model was evaluated based on the testing dataset. Next, the dataset of a subject in the training process was randomly expanded and the model was retrained and re-evaluated. This step was repeated until the model's performance on the test dataset stabilized. In this experiment, 20 subjects were randomly selected as the test dataset, the number of subjects with maximum expansion of the training dataset was also 20, and the prediction horizon (PH) of the model was 30 min. The training settings of all the experiments in this study are shown in Table 2. The construction method of the validation dataset was the hold-out method, where the last 20% of the training data was set aside as the validation dataset. As the size of the training dataset increased, the performance of the model stabilized, as shown in Fig. 5.
The trends in Fig. 5 showed that the performance of the model on the test dataset stabilized when the number of subjects in the training dataset reached 12. However, considering the randomness of the training, a certain margin needed to be maintained, and the size of the extended target domain was finally determined to be a training dataset of 15 subjects.
(2) Joint transfer of instances and networks The deep transfer learning framework proposed in this study was composed of instance-based transfer with DTW and network-based transfer. To evaluate the prediction performance of instance-based transfer with DTW, the models for 20 subjects in the target domain were independently trained on the extended target domain and evaluated separately. For comparison, 14 subjects were randomly selected from the source domain to construct an extended target domain with the target subjects, and the model training and evaluation were carried out in the same way. For the instance-based transfer with DTW, the network-based transfer was also used to obtain a personalized model which had improved generalization ability. Therefore, from the source domain with the target subjects excluded, 15 subjects were randomly selected to train a pre-trained model. As a result, all models of the subjects in target domain shared this pretrained model for network-based transfer. The performance of the models generated by different transfer methods is shown in Table 3. Table 3 shows the performance of the three transfer methods. The proposed transfer method represents that the transfer method combined with instance-based deep transfer learning with minimum DTW distance and network-based deep transfer learning. The instance transfer with DTW represented the minimum DTW distance to construct the extended target domain. The randomized instance transfer represented the random construction of the extended target domain.
The experimental results in Table 3 showed that for both PHs of 30 min and 60 min, using the minimum DTW distance for instance-based transfer improved the prediction accuracy significantly compared with using randomly selected samples. From the DTW distance calculation principle, the main factor affecting the DTW distance was the shape difference between the time series. The glucose measurement sequences that were extracted according to the minimum DTW distance with the target domain had the highest similarity versus the glucose measurement sequences that were extracted randomly. For the purposes of model learning, using glucose measurement sequences with minimum difference with the target subject made the model more sensitive to the specificity of glucose fluctuations, which would allow the model to perform better when the target subject has similar glucose fluctuations in the future.
Network-based transfer based on instance-based transfer with DTW improved the generalization ability and performance of the model. In addition, we found that the choice of the network layer for network-based transfer was crucial to the transfer results. As far as the network used in this experiment was concerned, only reusing the convolutional layer without the subsequent GRU and fully connected layer would result in negative transfer. Moreover, when the full model was fine-tuned, it was also susceptible to the influence of poorly pre-trained models and may result in negative transfer. Therefore, adopting a more reasonable network-based transfer method was very important for improving the prediction accuracy of the model. In addition to evaluating the prediction accuracy of the model, it may be more valuable to evaluate the clinical acceptability. The PRED-EGA of 20 subjects is shown in Fig. 6.
The PRED-EGA showed the clinical acceptability of the glucose estimation by the proposed method for each of the 5420 estimations from the target subjects. It was found that there was no hypoglycemia in the tested blood glucose. In the predictions for euglycemia (5183 points in total), 99.57% was accurate, 0.39% was benign, and 0.04% was error. In the predictions for hyperglycemia (237 points in total), 86.50% was accurate, 11.39% was benign, 2.11% was error. This confirmed that our proposed predictive model and transfer learning method performed well on clinical dataset. In addition, we also specifically analyzed the prediction curve of the three different deep transfer learning methods on a subject in the target domain, as shown in Fig. 7.
As shown in Fig. 7, the main source of prediction error was the rapid increase and decrease of glucose. A large prediction delay could also be observed in the model. In addition, for relatively flat glucose levels, the prediction model was prone to provide fluctuating glucose predictions, which was also the main source of prediction error. In contrast, the prediction results of the randomized instance-based transfer had too many spikes, while the prediction results of the instance-based transfer with DTW were relatively flat and the prediction performance was better. Furthermore, the use of network-based transfer improved the forecasting performance in some situations and strengthened the overall prediction results. In future work, a more appropriate deep transfer learning method may be used to improve the sensitivity and stability of the model, and further improve the performance of the deep learning model in glucose prediction.  Table 4, and the prediction accuracy of the selected subject in the target domain are shown in Fig. 8. As can be seen in Table 4, the deep transfer learning method achieved the best prediction performance for both  the 30-and 60-min PH. This result proved that the deep transfer learning method with segmented data achieved better prediction accuracy than traditional machine learning methods. It also showed that our proposed network-based transfer and instance-based transfer with minimum DTW distance is a feasible deep transfer learning method for glucose prediction. It can be seen from Fig. 8 that the prediction results of the deep transfer learning method had the best performance, and the glucose fluctuations were well captured. The prediction results of the SVR method had a high degree of fit with the raw data, were smoother than the deep transfer learning method overall, and performed better in places with minor blood glucose fluctuations. Therefore, this hybrid method using deep learning methods combined with other machine learning methods has the potential to obtain better prediction performance.

Conclusions
This study proposed a deep transfer learning framework to establish a personalized deep learning model for new patients with insufficient historical data. The minimum DTW distance was used for instance-based transfer and extended target domain, while the network-based transfer was applied to improve the comprehensive performance of the model in the target domain. The effectiveness of this method was verified through a series of blood glucose prediction comparison experiments using clinical data. One limitation of the method was that when the daily blood glucose level fluctuated significantly due to changes in the treatment plan or huge environmental changes in the clinical patients, the performance of the transfer model suffered due to changes in the target domain. To address this issue, the mean of daily differences (MODD) of blood glucose can be used, as it is a commonly used clinical indicator which reflects the consistency and stability of day-to-day blood glucose patterns [50]. Online updates can then be provided for blood glucose prediction methods with fast tracking ability for time-varying processes in this situation. Future work will also consider combining online updated machine learning methods with deep learning networks to obtain a more adaptable blood glucose prediction model. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.