Blood glucose prediction using multi-objective grammatical evolution: analysis of the “agnostic” and “what-if” scenarios

In this paper we investigate the benefits of applying a multi-objective approach for solving a symbolic regression problem by means of Grammatical Evolution. In particular, we extend previous work, obtaining mathematical expressions to model glucose levels in the blood of diabetic patients. Here we use a multi-objective Grammatical Evolution approach based on the NSGA-II algorithm, considering the root-mean-square error and an ad-hoc fitness function as objectives. This ad-hoc function is based on the Clarke Error Grid analysis, which is useful for showing the potential danger of mispredictions in diabetic patients. In this work, we use two datasets to analyse two different scenarios: What-if and Agnostic, the most common in daily clinical practice. In the What-if scenario, where future events are evaluated, results show that the multi-objective approach improves previous results in terms of Clarke Error Grid analysis by reducing the number of dangerous mispredictions. In the Agnostic situation, with no available information about future events, results suggest that we can obtain good predictions with only information from the previous hour for both Grammatical Evolution and Multi-Objective Grammatical Evolution.


Introduction
Symbolic Regression (SR) is one of the best-known applications of Genetic Programming (GP) and its variants, such as Grammatical Evolution (GE). The purpose of SR is to obtain closed-form equations that represent a set of data points, i.e., to optimally fit a proper mathematical function to the data. Therefore, SR can be used in many domains like data analysis, modeling, classification and identification.
Healthcare is one of the fields where the above-mentioned domains are becoming more and more important. In this regard, one of the diseases with the highest increase in prevalence is Diabetes Mellitus (DM), or simply Diabetes. 1 Diabetes is a chronic disease caused by a defect either in the production or in the action of the insulin generated by the pancreatic system, corresponding to the two main types of Diabetes: Type 1 (T1DM) and Type 2 (T2DM). The pancreas in patients with T1DM is not able to generate enough insulin to process the sugar produced after food ingestion. Hence, patients need to inject some additional artificial insulin with each meal, and sometimes between meals, to maintain healthy levels of glucose. In the case of T2DM patients, the insulin generated by the pancreas is not working properly, in a phenomenon known as insulin resistance. In advanced stages of the disease, many T2DM patients also need to inject some insulin.
Insulin injection can be performed following two different alternatives. The first one is through a Continuous Subcutaneous Insulin Infuser (CSII) device, also known as an insulin pump. This device can be programmed and adjusted to administer the desired amount of insulin in different time slots. The alternative is the Multiple Dose of Insulin (MDI) approach which consists of injecting long-acting insulin once or twice daily as a background dose, and rapid-acting insulin injections at each meal time. In both alternatives, the decisions about the amount of insulin to be injected are challenging and many factors have to be considered. Selecting the right amount of insulin is critical. If too much insulin is injected, hypoglycemia may occur, while insufficient injections cause excessively high glucose levels. In both cases, the goal of a Blood Glucose (BG) control system for a patient is to maintain BG levels within the target range most of the time, usually between 70 and 180 mg/dl [44]. It has been shown that when these values are not maintained or there is high variability then both short-term and long-term complications can emerge [13].
Control of BG in insulin-dependent patients requires predicting future glucose values to determine the amount of insulin to inject. This amount depends on many factors, of which four are the most important: (i) the glucose value at the time of the injection; (ii) the estimated amount of ingested food, usually measured in carbohydrate rations; (iii) the previously injected insulin; (iv) the estimated ratio of currently active insulin at the time of the injection. Recently, with the appearance of new smart devices, other variables can also be taken into account. Making all these estimates by hand is a complicated process that must be repeated several times every day. Fortunately, recent advances in both devices and algorithms allow automation of some parts of this control process. Different kinds of BG control strategies have been proposed [24]: manual, semi-automated [52], and automated solutions based on the Artificial Pancreas (AP) [2]. For all of them, it is extremely important to develop mathematical models or artificial intelligence systems to describe the interaction between the glucose system and the insulin control method using current measurements and stored data.
In [37], three different scenarios were defined for BG prediction: What-if, Agnostic and Inertial. The differences among them correspond to the data that are used to produce the models for prognosis. In the What-if scenario the idea is to consider future events for some of the input variables. For instance, the carbohydrates that the patient is going to eat between the current time and the prediction horizon could be used in this case. Models generated under this scenario are very useful for designing insulin or carbohydrate recommendation systems.
The Agnostic scenario only takes into account the values from the input variables that are previous to the current time. Theoretically, under this approach it would be more difficult to achieve accurate predictions, since a model trained under Agnostic conditions needs to implicitly predict events in the future for all possible values of the input variables. Intuitively, Agnostic models will be more appropriate in cases where the number of input variables is small or the number of samples is very large and varied.
Finally, under the Inertial view, models are generated using only samples from the Continuous Glucose Monitoring (CGM) system in the prediction time window. Data from other input variables cannot be used, which is an unrealistic situation since ingestion of food, exercise or stress may occur.
However, as noted by [37], the inertial model can be realistic in the case of nighttime predictions. While the patient sleeps, many of the input variables (such as carbohydrates or exercise) will take a zero value and other variables (such as insulin or stress) will not undergo significant changes. For this reason, Inertial models can be useful for the study of specific situations, such as the study of the Dawn Phenomenon [43], an early morning BG level rise, which is a special situation that appears in some diabetic patients.
In this paper, we focus only on the first two scenarios, excluding the Inertial approach.
The main objective of this work is to investigate the efficiency and benefits of multi-objective GE for BG level prediction using real data collected from patients with T1DM under insulin pump therapy with CGM. We continue with the investigation of a multi-objective approach for modeling glucose and constructing predictive models from the short-term (30 min) to the medium-term (120 min) time horizons. We use GE with two objective functions: Root Mean Squared Error (RMSE) and an adapted fitness, denoted as F CLARKE , which is based on the Clarke Error Grid (CEG) metric [8]. This paper is an extended version of the work presented at the EvoStar 2020 Conference [10] and the continuation of the machine learning research developed at the Universidad Complutense de Madrid, Adaptive and Bioinspired Systems Research Group since 2013 [18,[21][22][23][24] and [50]. The contribution 1 3 of this paper in relation to [10] is twofold: on the one hand we have extended the analysis to the Agnostic scenario using a new dataset also coming from real diabetic patients and, on the other hand, we have performed a deeper analysis of the results where we graphically compare the solutions obtained, allowing us to draw medical conclusions.
In particular, we tackle the following research questions: is the multi-objective approach able to improve single-objective predictions?. Does the multi-objective approach obtain solutions that are clinically better?. Is it always better to use more information from the past or can we obtain equivalent predictions with information from the last hour instead of the last two?
As will be seen, the development of a problem-specific fitness function does not only considerably improve the quality and the robustness of the GE algorithm as an SR tool in the What-if scenario, as we show in [10], but also allows us to obtain good Agnostic prediction models using only information from the last hour previous to the time of prediction.
The best single-objective solutions may suffer from a significant number of erroneous predictions that could lead to incorrect treatments that may be dangerous to the patient's health. Later, we will show that the multi-objective strategy improves this situation by reducing the number of wrong predictions and, therefore, the severity of incorrect treatments.
The rest of the paper is organized as follows. In Sect. 2, we review the previous work. In Sect. 3, we explain the multi-objective approach. Section 4 gives details about the dataset and discusses the results, comparing it with previous work and analyzing the contributions of the multi-objective approach. We finish the paper with the conclusions in Sect. 5.

Related work
The problem of modeling and predicting glucose levels and glucose-insulin interaction has been an intensive area of research for the last ten years. We will focus on predicting glucose levels for a forecasting horizon of up to 2 h, as an aid in the daily management of insulin. Two hours is usually the time needed to decide if the dose of insulin after a meal was correct. Hidalgo et al. proposed the application of GE to obtain customized models of patients. The proposal has been tested using in silico patients' and real patients' data [24]. The work has been extended recently in [28] and [49], where it has been shown that data augmentation and structured GE increase the quality of the prediction results.
Inspired by Hidalgo's group, Contreras et al. presented a hybrid model for predicting glucose in the medium term (120 min) for T1DM patients [12]. The system uses synthetic data generated by the UVA/PADOVA simulator [29]. Both the fitness function of the evolutionary grammar and the performance metrics use a penalty factor to take into account the physical damage caused by deviations in BG prediction according to the CEG. The authors generate four models for each patient corresponding to different phases of the day: night, breakfast, lunch, and dinner. The data obtained for the night phase are quite good; however, results with real patients' data have not been reported. In this work, we use a multi-fitness approach instead of a penalization function, which is more appropriate for a multi-objective problem.
Although more centered on classification, i.e., the prediction of a class instead of a glucose value, there are other interesting approaches. For instance, in [40], the authors developed a method for predicting postprandial hypoglycemia using a classification approach with machine learning techniques personalized for each patient. They described the process of generating a hypoglycemic prediction model by Support Vector Machine (SVM) for binary classification, trained and tested using Scikit-Learn. They use hypoglycemia risk as a feature and as a class-labeling factor. The results demonstrate acceptable performance for all patients (in terms of specificity and sensitivity) and the feasibility of predicting postprandial hypoglycemic events from a classification perspective. In [46], a dual mode adaptive basal-bolus advisor based on reinforcement learning is presented. Authors proposed an Adaptive Basal-Bolus Algorithm (ABBA) which provides a personalized recommendation for the daily insulin doses using information from the previous day. Regarding the level of customization of the models, some proposals provide models for the average case [41], and others provide personalized models for each patient. Several papers apply classical modeling techniques, resulting in models or profiles defined by linear equations with a limited set of inputs [35].
The treatment for subjects with T1DM uses rates of basal insulin delivery, insulin to carbohydrate ratios and individual correction factors, typically from observations by the endocrinologist.
However, the data collected in clinical studies with T1DM patients do not cover sufficiently long periods of time or the different situations of work, stress, etc. that characterize the patient's life. Therefore, the models generated from these data will be inaccurate [54].
There are also some models, used in AP systems or closed-loop control models, that try to emulate the action of the pancreas [14]. They are based on the assumption that it is possible to reach reasonable control with approximate models, provided that the model is related to the control objective [19]. Experimental results suggest that these approaches, due to the lack of accurate individualized models, have a significant risk of under-administration and thus the possibility of BG levels entering a zone of hyperglycemia with, eventually, long-term complications. Even worse, a major misprediction of BG may lead to the injection of an excessive amount of insulin, causing BG levels to fall into a zone of hypoglycemia which can have immediate fatal consequences for the patient's health. Our evolutionary models try to avoid both situations.
De Falco et al. [15] presented a work on GP-based induction of a glucose-dynamics model for telemedicine. The work aims to create a regression model that allows the determination of the BG value from interstitial glucose in patients with T1DM with the idea of using it in a telemedicine portal. The work is divided into two parts. In the first part, an imputation of the missing values in the database is carried out using an extrapolation technique based on the Steil-Rebrin model [45]. To make the most accurate estimate, the parameters of this equation are adjusted using an evolutionary algorithm with the RMSE as the fitness function. Since there are many more 1 3 estimated BG values than actual values, a correction factor must be applied to avoid deviations in the extraction of the model.
Recently, several approaches have solved the prediction problem by using artificial Neural Networks (NNs) with results similar in quality to the solutions based on GP and GE. There are several kinds of NNs and, due to the availability of new high-performance computing architectures, their development and applicability have increased exponentially during the last five years. Long Short-Term Memory (LSTM) networks have achieved very good performance in modeling several timedependent phenomena. This is why most of the approximations found in the literature for the prediction of BG levels are LSTM since BG data are usually obtained as a time series from CGMs. In [47], the authors make a glucose prediction using a sequential model consisting of an LSTM, a Bi-directional LSTM (Bi-LSTM), each with four units, and three fully connected layers with 8, 64 and 8 units, respectively. The authors of [1] develop a predictive model fpr BG to define future insulin therapy consisting of two stacked LSTMs working in parallel. Meijner and Persson [34] propose to exploit LSTM in a model which takes CGM values, insulin dose and carbohydrate intake as inputs. The proposal tackles only prediction horizons of 30 and 60 min. The training phase experienced initialization problems which led to local optimums. Another interesting work is [32] from an Ohio University group with more than ten years of experience in machine learning for diabetes. In [36], they propose an LSTM architecture and they improve it in [37] by proposing a novel architecture called memLSTM, which is a mixture of LSTM with a type of network called Neuronal Attention Models (NAM) or Augmented Memory Models (AMM). The work presented in [33] analyses 10 machine learning techniques combined with oversampling methods, concluding that the choice of the best algorithm depends on the glycemic range. Convolutional Recurrent Neural Networks (CRNNs) have also been explored to predict the BG level [27] for in silico patients. Almost all of these works use real and in silico patients with short-term prediction horizons (up to 60 min). The proposed architectures obtain a performance similar to the current state of the art.

Description of the problem
The work that we present here extends those works based on GE, and explores the use of a multi-objective approach, based on the Non-dominated Sorting Genetic Algorithm (NSGA-II) [16], applied to real data from diabetic patients. We are interested not only in the problem-specific, but also in the performance of the multiobjective implementation. We integrate the different possibilities of using GE for short-term and medium-term glucose prediction in diabetic patients, for the Agnostic and What-if scenarios, and for prediction horizons of 30, 60, 90 and 120 min, using information from 60 and 120 min prior to the time of prediction.
The complete workflow can be seen in Fig. 1. Real data is gathered using different electronic devices such as CGM systems, smart bands and CSII (insulin pump), as well as annotations by the patients. These data are processed to form the different data sets (combining 2 historical windows and 4 prediction horizons) that will be used in the two scenarios studied in this research. In Sects. 3.1 and 3.2 we thoroughly explain the two scenarios but we can summarize that in the What-if case the model predicts future glucose values by taking into account not only past values of the three variables (glucose, carbohydrates and insulin) but also future carbohydrate intake and insulin injections from the present until the prediction horizon. In the Agnostic scenario, the model predicts future glucose values based only on past values. For both scenarios, we train and test, using cross-validation, a GE model and a multi-objective GE model.
One of the advantages of using GE as a modeling approach is that we can tackle the different possibilities mentioned in the paragraph above using the same engine and only making changes in the grammar for each prediction horizon, scenario and information configuration. Each variable, feature or attribute to be used in the model can be easily included in the grammar. In addition, with this procedure we are able to study the contribution of each feature, or group of features, to the quality of the solution.

What-if scenario or insulin-carbohydrate recommendation
As mentioned in the introduction, the What-if scenario models may have access to input events, so we can predict the BG level after m minutes, supposing that the patient eats a meal with C grams of carbohydrates and I units of insulin are injected in t minutes from the time of prediction. In other words, our objective is to construct an insulin-carbohydrate recommendation tool. Thus, we look for predictive models that help us in evaluating those recommendations. In this scenario, models can use the following data: • Interstitial glucose using a Medtronic CGM device that gives us observations every 5 min. • Notes of estimated carbohydrate units ingested, taken by each patient. • Insulin injected using an insulin infuser device from Medtronic, which registers injections of both basal and bolus insulin every 5 min.
Once all the information has been collected, we process the data to fill in gaps using cubic splines and to match all the events to the closest timestamp in order to construct a set of matched time series, corresponding to glucose, insulin, and carbohydrate values. We also process the set of features available at the time of modeling. The approach of using the cubic spline technique to fill gaps in time series is the most widely used strategy not only in medical literature but also in physical science research [26,51]. At each time point, t, data from the previous 2 h are available for prediction. We process these data to define a set of new features as we did in [24]. To this aim, we first define two sets: The set of the sums of the units of insulin injected in window w: , and I 0−0 (t) = I(t) are the actual values at prediction time for glucose, carbohydrates, and insulin, respectively. We also define the set of prediction horizons, i.e., the sample times in future to forecast the BG, PH = {30, 60, 90, 120} min, so that Ĝ ph (t) with ph ∈ PH is the predicted BG ph minutes ahead in time, whereas I ph (t) and C ph (t) are the values of insulin and carbohydrates ph minutes ahead in time. Hence, our prediction problem can be stated as finding an expression for the predicted BG level, Ĝ sp (t) , given by equation (1), that minimizes the objective functions RMSE and F CLARKE .

Agnostic scenario or glucose prediction in absence of event information
Recently, with the appearance of new smart devices, the possibility of incorporating more predictive variables into models has been raised, in order to improve the accuracy of the models. For example, the activity bracelets on the market provide accurate information on many variables including exercise, sleep, heart rate, body temperature, caloric consumption, and more. When all this information is incorporated to the dataset it becomes much more complicated to make models of the type What-if since the number of possible combinations of the variables and assumptions involved in an event is enormous and its usefulness would be very limited. However, it is possible to produce an Agnostic-type model in which there is no access to information on future events in the prediction phase. This type of model needs to predict those events in an implicit way. For example, the model must identify fasting periods or physical exercise.
Our GE-based model generation engine allows us to make these types of models without substantial changes to our methodology. It only requires adjusting the grammars to reflect access only to the variables available. In the case of Eq. (1), we only need to eliminate access to future values, i.e., C t+H (t) and I t+H (t) , so the general equation for the models would be expressed by Eq. (2).
In this paper, we solve the Agnostic problem in relationship with the data described in [30]. This dataset includes information on BG levels from a CGM, insulin doses from an insulin pump, self-reported information and data from an activity band. More details can be found in the previous paper. We processed the data in a similar way to that described in Sect. 3.1 in order to obtain aggregated features with information every 15 min, including CGM BG level, insulin doses, carbohydrate estimates, Galvanic Skin Response (GSR), skin temperature, air temperature and acceleration. The influence of these variables in diabetic patients is documented in [9,25,48]. However, the analysis of their particular impact is left for future collaborations with medical staff, since it is beyond the scope of the multi-objective analysis made in this paper.
Although the Agnostic scenario may be considered a subset of the What-if scenario, in this paper, we consider them as different problems since we use different data.
In addition to the analysis, we compare two options regarding the information available at the time of prediction. On the one hand, we use the events from the previous 2 h to construct the models as expressed in Eq. (2). On the other hand, we use the information of only the previous hour before the time of prediction. As we will see in the experimental section, this comparison is useful for the applicability of the models, since most of the time the patient has little or no information from the past. Incorporating the information of the variables obtained from the smart devices, we will search for models by applying (3) with window size, ws ∈ {60, 120} min.
Where we incorporate the variables GSR, skin temperature, and magnitude of acceleration to those defined in equation (1): The set of skin temperatures measured sp minutes before t:

Fitness functions
Again, GE allows us to use the same configuration for both scenarios and both algorithms: single-objective (GE) and multi-objective (MO-GE). GE relies on RMSE to guide the search. RMSE is a common fitness function when adjusting data in SR problems. Equation (4) shows its definition.
In the case of MO-GE, we use a second objective function, denoted as F CLARKE , which was defined in [21]. This function follows the CEG criterion used to test the clinical significance of differences between a glucose measurement technique and venous BG reference measurements [8]. F CLARKE is defined in equation (5).
To compute w i we used the next equation: In this regard, CEG considers a grid divided into five zones (A to E) depending on the severity of the misprediction. The values that fall within zones A and B are clinically exact and/or acceptable and thus the clinical treatment will be correct. We consider A and B as a single category with no contribution to equation (5). Values in zone C can be dangerous in some situations. Although less dangerous than D and E zones, we should also try to minimize predictions in these zones, so predicted points in these zones contribute a value of 1 to equation (5). Finally, zones D and E represent potentially dangerous areas since the prediction is far from being acceptable and the suggested treatment will be different from the correct treatment. Each prediction in zone D adds 10 to equation (5), while predictions in zone E add 100. The inequalities and conditions of Eq. (6) delimit these zones according to [8]. RMSE is a standard linear regression metric that measures the raw quality of a model. Intuitively, it can be expected that the greater the RMSE the greater the F CLARKE . However, a good model in terms of RMSE can, at the same time, be dangerous from the medical point of view. For instance, an error of 20 mg/dl for an expected value of 50 mg/dl is worse than the same error for 100 mg/dl, since, in the first case, the patient will be in a hypoglycemic situation, whereas in the second case the patient is within the normal glucose range. These kinds of situations are not well identified by RMSE. On the contrary, F CLARKE is able to amplify the effect of those dangerous situations. Similarly, a good model in terms of F CLARKE , with all predictions in zones A and B could reach significant values of RMSE. Therefore, these objective functions are not fully orthogonal, but they help to identify good models in terms of both raw and medical qualities.
Previous experimentation showed poor RMSE results of F CLARKE in single-objective GE. Therefore, for the sake of space and the clarity of the discussion, we have discarded these experiments.

Multi-objective grammatical evolution
In this paper we propose a multi-objective approach to GE [42]. We use the same approach and grammars of [49], where the interested reader can find more details about applying GE for the creation of models in this scenario. As is well known, the GE method is powered by an evolutionary computation algorithm, usually an adapted implementation of a Genetic Algorithm or a Particle Swarm Optimization algorithm. There exist some other multi-objective implementations such as [20]. However, we use our own library, which is publicly available through GitHub.
Here we also use a bi-objective approach, using equations (4) and (5) as fitness functions. As an evolutionary engine, we apply NSGA-II, which is perhaps the most effective way of optimizing and searching for solutions to multi-objective problems with evolutionary computation when dealing with 2 or 3 objectives. One of the important issues for selecting a multi-objective approach is to study the fitness functions, i.e., the objectives. Although the objectives may work in the same direction, it is not desirable that both measure similar features. Fitness functions should guide the algorithm through the search space in different manners, although with a common search. This is the case for Eqs. (4) and (5), where both try to improve the 1 3 quality of the solution by simultaneously minimizing error and obtaining solutions with 100% of the predictions in zones A and B. Figure 2 represents an extract of the grammar in Backus-Naur Form (BNF) format designed to find a predictive model for future BG levels. This is a typical grammar for SR adapted to the variables and features for each diabetic patient's dataset. It is a recursive grammar and the operators are reduced to addition, subtraction and multiplication, based on the conclusions of [38] and our previous experimental experience. Despite the size of the search space being infinite because it is a recursive grammar, the algorithm is able to efficiently explore the search space.
Notice that the grammar is the tool which allows knowledge of the problem to be included in the optimization process. For instance, the grammar proposed sets the particular glucose, carbohydrate and insulin variables to be used, as well as the operands and precision factors available in order to produce model expressions. This way, the search is guided to a certain part of the solutions space.

Experimental results
We carry out two types of experiments testing the two scenarios explained in Sects. 3.1 and 3.2 : the What-if and Agnostic scenarios. Hence, we use two different datasets containing the information and variables required by each scenario. In Sect. 4.1, we describe the experiments corresponding to the What-if scenario, which deals with the same dataset of ten real patients, as explained in [10]. In Sect. 4.2, we expand the study to the Agnostic scenario, using a dataset from six new patients with data containing more information that has recently been added by using smart devices that incorporate other variables.
To find the models that will later be tested, we divided the data into two sets, training and test, using the k-fold cross-validation technique. This technique generally results in a less biased estimate of the model's performance than a simple train/ test division. After shuffling the data, the total data set is divided into k subsets (in this article k = 10). Over k iterations, the models are tested on one of the subsets and trained on the other (k-1) subsets. The final results are the union of the k iterations. We could expect that the variance across these small subsets would contribute a biased estimation but, in the literature, we can find research that does not support this idea [3]. It is also interesting to note that here each individual point (value to be predicted) consists of a BG value plus a historical window with glucose, carbohydrate and insulin data. This experimental setup is illustrated in Fig. 3.
We would like to emphasize that obtaining data from real patients is difficult because this is sensitive information that usually requires special authorisation from both the patients and the medical authorities. In addition, patients must be committed to the research since for data completion it is important to wear the devices correctly on a continuous basis, while taking notes and registering any unexpected event that could affect the data. Besides, it is usual to discard part of the data collected because of mistakes made by patients. Therefore, studies usually deal with small datasets, as shown in [7,17,31,53], with 10, 11, 7 and 7 T1DM patients, respectively, with different gender distributions.
The implementation of both GE and MO-GE algorithms is done in Java, and the code is publicly available at the GitHub repository called JECO, which stands for Java Evolutionary COmputation library [23]. As stated above, the multi-objective approach uses the NSGA-II algorithm as an optimization engine. Table 1 summarizes the configuration of the evolutionary engine for the MO-GE approach, which is based on our previous experience, and on a set of preliminary experiments where we studied the convergence of the algorithm as well as the wrapping value. In particular, little use was made of wrapping in the preliminary experiments. On the contrary, the algorithms tended to generate solutions of moderate size. Therefore, we decided to set the maximum number of wrappings to a small value, evolving some complex expressions. In order to maintain a consistent computational effort, the same configuration was used for GE, which only considers RMSE as a fitness function. Finally, for each patient, ten runs were executed with the training data. The best model obtained was later used to calculate its performance on the test data. In addition to the GE models, we have also fitted Auto-Regressive Integrated Moving Average models, ARIMA(p, d, q), to estimate glucose values. Equation (7) presents the expression of such a model where (t) is the random error at time t, and p, q, and d are integers, called the orders of the model. This model only includes glucose values and does not use exogenous variables such as insulin doses or carbohydrates.
We have tested ARIMA models of different orders in 0 < p, q ≤ 12 , with 12 being the number of samples in 1 h for Ohio patients to mimic the same time windows as the GE models. The best results have been for p = q = 5 . We disregard the integrative component, that is, d = 0 , because previous experiments show no improvement for i ≠ 0 . With every new glucose sample, the 12 model's coefficients are estimated using maximum likelihood in a 4-h time window using the last 48 samplesincluding the last one-of the univariate glucose time series, G(t). Once the model is fitted, it is warmed up using the last 24 samples, and then it estimates the glucose prediction for the four prediction horizons in PH.

What-if scenario
Ten T1DM patients have been selected for the observational study, based on conditions of good glucose control. To be included in the study, three main conditions must be met: (1) at least one year since T1DM diagnosis; (2) absence of diagnosis of major psychiatric disorder; (3) no serious breakthrough disease in the last six months. The study was approved by the Ethics Committee of the Hospital Príncipe de Asturias in Alcalá de Henares, Spain, and all patients signed a prior informed consent. Data acquisition was carried out by two nurses from the Endocrinology and Nutrition Service at the hospital. Data from patients were acquired over multiple weeks using the devices described in Sect. 3. Log entries were stored in 5-min intervals. In this dataset, we have at least 10 complete days of data for each patient. These days are not necessarily consecutive. Each log entry contains the date and time, the BG value, the amount of insulin (injected via pump), and the amount of carbohydrate intakes estimated by the patients. The population characterization is female (80%), average age 42.   horizons. CDE is the sum of the number of points in zones C, D, and E for each model. Since F CLARKE is a weighted sum of the number of points in those areas, we prefer to represent CDE because it allows us to compare how similar values of RMSE may present different numbers of dangerous mispredictions, accounted for by CDE. This representation allows the comparison of solutions in terms of both the raw quality of a model, given by the RMSE value, and the medical quality of the model, given by the number of points in dangerous regions. It is important to note that a solution with two predictions in the C zone is acceptable and much better than a solution with two prediction points in zones D or E. Figure 5 shows the solutions obtained for patients 4 and 7 with both a singleobjective GE evolution approach using the RMSE value as fitness function (red points) and the solutions obtained with the MO-GE approach (green points). As can be seen, the use of F CLARKE as an additional objective improves the fitness of the solutions, not only for the new fitness function, but also in terms of RMSE. We can also observe that all the solutions generated by GE are dominated by solutions generated by MO-GE. Moreover, the distribution of the solutions suggests that our MO-GE is more robust than the GE since most solutions it finds are close to the approximation of the Pareto front. This is observed for all four prediction horizons. Similar results were obtained with the rest of the patients in the dataset. 3 From a medical point of view, this result is a measure of the method's robustness since patient 4 is very different from patient 7 according to the descriptive statistics. Patient 7 has the worst glucose control since the patient's glucose values lie within a healthy range only for a very short time (27%) in comparison with medical recommendations ( ≥ 80% ). The average glucose level is also poor for this patient since it is the highest in the group (176.33 mg/dl). Nevertheless, we can see in Fig. 5 that GE and MO-GE reach good solutions for the four prediction horizons in both patients, regardless of the a priori difficulty of the dataset.
We would like to highlight that MO-GE is able to reach the optimum solution for patient 1, that is, the solution with both objectives equal to 0 and, hence, with neither error nor dangerous predictions.
It is also important to note that, although there are some non-dominated solutions with a very low number of points in zones CDE, those are not necessarily the best ones. For example, let us analyse the solutions from MO-GE for the 120 min horizon, patient 5, with the lowest value of CDE in the graph (the magenta colored points close to the y axis around point (71 , 7)). This solution is very precise; however, the bad predictions could be very dangerous for the patients since, as the RMSE indicates, the deviation from the correct prediction must be very high. When selecting the final model or predictor, the decision maker should take these characteristics into account. Another interesting feature that can be extracted from Fig. 6 is that the longer the prediction horizon, the worse the RMSE value, which is consistent with the intuitive idea of the difficulty of prediction for long horizons. However, this situation does not happen with CEG. Notice, for instance, that a small set of solutions for the 120 min horizon in patients 5 and 10 are located close to the CDE value of 10, which is quite small. This means that, despite those solutions having a high RMSE value (around 70), the predictions are well located in terms of the grid defined by CEG, which provides a good CDE value. Figure 6 shows the distribution of the solutions in the multi-objective space with both approaches, GE and MO-GE, for three patients: 4 1, 5 and 10. As expected, the error and the number of points in dangerous prediction zones increase with the prediction horizon. However, solutions are restricted to a limited area of the graph (maximum RMSE by maximum number of points in zones C, D and E).
In addition, it is observed that the variability of solutions for a prediction horizon of 120 min is the highest, and the variance of the data along the horizontal axis decreases as the prediction horizon gets closer. This means that good CDE values can be reached for every horizon (as well as poor ones), and this variability is reduced with the reduction of the prediction horizon. These facts prove that the two objectives are not correlated and, hence, the multi-objective approach is correct. Figure 7 presents an additional analysis where we can see the distribution of the solutions in the multi-objective space for patient 6 with both approaches, single-objective (GE) and multi-objective (MO-GE) with the four prediction horizons PH = {30, 60, 90, 120} min. Each color represents the solutions obtained with one of the folds of the 10-fold cross-validation strategy for a prediction horizon. Two questions arise from them. First, some of the folds are more difficult to solve than others, and a deeper study of the data should be done in order to improve the algorithm, for instance, by detecting some kind of pattern and determining when the MO-GE works better, as proposed in [11].
We performed a quantitative analysis of the solutions by comparing the 40 different instances (10 patients by 4 different time horizons) for both GE and MO-GE methods. For the sake of space we do not include all the plots. However, in all cases, solutions obtained with the MO-GE method dominate solutions obtained with GE. Table 2 shows the aggregated results for both the single-objective and multiobjective approaches, GE and MO-GE, respectively. It also shows the results obtained with the ARIMA approach. For each time horizon, the percentage of predictions in zones A and B (A + B column), C, D and E are depicted. Since we try to maximize the points in zones A + B and minimize the points in dangerous zones C, D and E, the bold figures highlight the best method. That is, the figures reported in bold in column A + B point out the method that has obtained the maximum value for each time horizon, while the figures reported in bold in columns C, D and E point out the method that has obtained the minimum value for each time horizon. Again, the MO-GE algorithm reaches the best performance, reducing predictions in the most dangerous zones D and E for short-term predictions (30 and 60 min). For medium-term predictions (90 and 120 min), the sum of D and E points are very similar, on average. However, MO-GE finds better solutions considering the distribution of all points in the different zones of the CEG. Results were tested for statistical significance and we found significant differences in the number of points in zones D and E, where the multi-objective approach reduces the most dangerous predictions. With regard to the technique that has a larger percentage of points in the A + B zone, we found that MO-GE is placed first for short-term horizons, whereas GE achieves better results for

Agnostic scenario
The six patients of the dataset used in the Agnostic scenario were selected from the OhioT1DM dataset for Blood Glucose Level Prediction: update 2020 [30]. This dataset contains twelve patients with T1DM and was first released in 2018 with half its current size, containing data for only six patients. We selected the new patients incorporated in this update because the sensor band used in these new patients (Empatica Embrace) registers some of the physiological variables with more precision than the sensor band used in the old dataset (Basic Peak). To protect the data contributors and to ensure that the data be used only for research purposes, we had to sign a Data Use Agreement with the University of Ohio before using the dataset in our research. Data were acquired over multiple weeks using insulin pump therapy with CGM. Patients wore Medtronic 530G or 630G insulin pumps and used the Medtronic Enlite CGM system. They also wore the Empatica Embrace device, which reported life-event data via a custom smartphone app and provided physiological data from a fitness band. Log entries were stored in 5-min intervals. In this dataset we have at least 53 complete days of data for each patient. These days are not necessarily consecutive nor the same days for all the patients. Each log entry contains the date and time, the BG value, the amount of insulin and the amount of carbohydrate intake as estimated by the patients. Additionally, the dataset includes: self-reported meal times with carbohydrate estimates; self-reported times of exercise, sleep, work, stress, and illness; and data from the Empatica Embrace band, which includes 1-min aggregations of GSR, skin temperature, and magnitude of acceleration. Both bands indicated the time they detected that the wearer was asleep, and this information is included when available. However, not all data contributors wore their sensor bands overnight. The population in this case has 83% of male patients with an average age between 33.33 ± 16.33 and 53.33 ± 16.33 years. The average number of days with data is 56.33 ± 2.06. Figure 8 shows the glucose level of the patients from this dataset in the same way as in the What-if scenario.
For this dataset, we construct models with GE and MO-GE for the four prediction horizons PH = {30, 60, 90, 120} min. We also compare models with access to the information of WS = {60, 120} min before the time of predictions (see equation 3). In Figs. 9 and 10 , we analyze the differences in the multi-objective approach (MO-GE) when compared to the single-objective (GE) for patients 567 and 584, taking into account the WS values for each figure. As in the previous section, these figures represent solutions in the multi-objective space, and each point represents a solution referenced by its coordinates (RMSE, CDE).
As can be seen, there are some cases where solutions obtained with historical values of ws = 60 min with GE are dominated by solutions generated with MO-GE, as seen in Fig. 9 for P584 ws = 60 and ph = 30 min. Besides that, there are other cases where solutions obtained with MO-GE are dominated by solutions generated with GE, as seen in Fig. 9 for P567 with ws = 60 and ph = 120 min. The solutions are similar when using historical values of ws = 120 min. Figure 11 shows the results for patient 552 for both historical values. In this case, we find solutions in the Pareto front for both historical values and the different time horizons, so there is no dominance between methods. Similar results were obtained with the rest of the patients in the dataset.
As  Table 3 shows the aggregated results for ws = 120 min for the single-objective (GE), the multi-objective (MO-GE) and the ARIMA approaches. For each time horizon, the percentage of predictions in zones A and B (A + B column), C, D and E are depicted and the bold figures highlight the best method. As in the What-if scenario, the MO-GE algorithm reaches the best performance (in all horizons except for 30 min, where ARIMA gets the best results) and returns the best global solutions.
In a similar fashion, MO-GE achieves the lowest percentage of points in zone E for all horizons. But if we look at the joint zone D + E, we find a different situation: MO-GE is only placed first in the 60 min horizon, GE is the best for the mediumterm (90 and 120 min) and the winner for the very short-term (30 min) is ARIMA. It is interesting to analyze the complexity of the solutions in terms of number of parameters and length of the solutions. Figure 12 shows the results obtained for both the What-if and Agnostic scenarios. The figure represents the average RMSE for each number of parameters found in the solutions for both the GE and MO-GE  In order to study how the different historical values WS = {60, 120} min contributed to the models, a deeper analysis would be required to assess the statistical significance of the results. To carry out this task, we first created density plots using a Kernel Density Estimation (KDE) for the distribution of the samples. The objective is to visualize whether the data meets the conditions for a parametric test, which  is not the case. Figure 13 shows the results obtained with GE and MO-GE for the RMSE objective in the Agnostic scenario results. Using Gaussian distribution, the variance is the same for all the cases. Data distribution is non-unimodal and a nonparametric test is necessary. Similar results have been obtained with CDE, but are not shown for the sake of space. All the plots were obtained following the method explained in [5]. Then, we followed the Bayesian model of [4,5] based on the Plackett-Luce distribution over historical values and time horizons, taking into account  Figure 14 shows the probability of being the best method, denoted as probability of winning, and its standard deviation for the results obtained with GE and RMSE as objective functions. First, it can be seen that the prediction horizon ph = 30 min is the best, since both configurations with this horizon reach the highest probabilities of winning. Also, it can be seen that the historical value of ws = 60 min has the highest probability. Even so, as the confidence interval overlaps with the results obtained with historical value of ws = 120 min, there is no statistical evidence that one method is better than the other. All the intervals for the rest of the time horizons overlap in the same way, and similar results have been obtained for the rest of the cases.
All the experiments were performed on an Intel(R) Core (TM) i7-7700CPU at 3.60 GHz with 16 GB RAM Memory on Windows 10. Experiments were run with 8 threads in parallel to benefit from all the core-threads of the computer without affecting performance and no other task running at the same time. The average time to obtain a model with GE is 6.51 min and 3.10 h for the MO-GE approach, almost 29 times slower.

Conclusions
In this paper, which extends the work presented at the EvoStar 2020 Conference [10], we investigate the benefits and drawbacks of a multi-objective implementation of GE in the generation of models for BG prediction in diabetes patients. In particular, we implement a multi-objective GE based on the classic NSGA-II algorithm, which is guided by two fitness functions: RMSE and F CLARKE , which is a function designed following the CEG metric.
We have produced experimental evidence in two different scenarios: on the one hand, the What-if scenario, which takes into account future input data from the  [6] to analyse the GE results with RMSE as objective function for the Agnostic scenario prediction horizon in order to predict glucose values. Data from ten real diabetic patients were studied in this scenario, as in [10]. On the other hand, we have studied the Agnostic scenario, where no information about future events is available. In this case, data from six real patients from the OhioT1DM dataset [30] were studied.
Our results show that the multi-objective approach produces better models, reducing the number of predictions in the most dangerous zones of the CEG metric for both scenarios. These results are achieved due to the fact that the multi-objective approach has a better ability than the single-objective approach to traverse the different areas of the search space defined by the F CLARKE objective function. This is an important conclusion, since medical criteria is included in this function, which penalizes the models' most dangerous mispredictions. In addition, we have also found that GE is able to obtain good results in terms of CEG for both scenarios, despite it not being considered an objective function. However, the multi-objective approach can be considered safer since the CEG metric is explicitly included, and a decision maker could examine non-dominated models and decide which one best fits a patient.
In addition to the new experimental results, we have performed a statistical analysis of the results for the Agnostic scenario. According to our tests, there is no statistical evidence of significant difference between using historical values of 60 and 120 min utes in terms of better models. Hence, Using historical values of 60 min is recommended since the complexity of the model is lower and the execution time of the algorithm is shorter, requiring only half of the data.
We are developing a framework, running in the cloud, to generate and deploy models via glucmodel.ucm.es and glucnet.ucm.es. This way, models generated with the methodology presented in this paper are directly applicable to daily clinical practice since mobile and web apps will have access to the predictions in the cloud, helping the user to decide the dose of insulin and other future actions. As we previously explained, MO-GE generated models for What-if scenarios are practically useful for designing insulin or carbohydrate recommendation systems. Moreover, our models are useful for testing the potential modifications of treatments without risking the patients' health. Agnostic models are useful when more smart devices are available. In particular we are working on a smart watch application that generates alarm signals when the patient is in at-risk situations.
Our procedures and models have been evaluated by the medical staff collaborating on the project and they have concluded that these can be useful for correcting and improving patients' lifestyles and therapies. Subject to further clinical validation and regulatory approval, the applications we are developing have the potential to be useful for the daily management of diabetes in order to improve glycaemic control and increase patients' quality of life and autonomy.
Our technique has the potential to be applied to other areas of medicine where a similar set of physical variables are available for measurement. For example, alerts for potentially dangerous heart rates can be generated by developing models based on the historical values measured by smart devices.
By applying our modelling technique, we can obtain white box models that can potentially be interpreted in terms of the variables used although perhaps not totally explainable in terms of physiological aspects of the body, the advantage being that we do not need an initial model and the search is not limited to previously adopted models.
Our current work is focused on the refinement of the models. Datasets from real patients include many different input variables whose influence on the models changes. Therefore, an analysis of each variable's contribution could help with the creation of more precise models, even for different times of day (e.g. morning, afternoon and night).