1 Introduction

Modern technologies provide sustainable and feasible solutions to many real-world problems. One area where these technologies have provided solutions in recent years is agriculture. Precision agriculture applies innovative technologies to the agricultural world to reduce costs, increase profit and achieve sustainability [1]. A comprehensive review of the state of the art use of artificial intelligence (AI) in smart greenhouses is provided by [2]. This review focused on the optimization of crop yields, reduction of water consumption, fertilizers, diseases, pests, and the search for improved agricultural sustainability. Therefore, the status of various AI technologies in smart greenhouses is reviewed by discussing the extent to which technologies have been successfully applied in an agricultural context and the options for optimizing their usability.

Among the challenges facing precision agriculture is the adaptation of processes to climate change [3]. To monitor crop status to face sudden weather changes that occur mainly in semi-arid climates, farmers use technologies such as the Internet of Things (IoT) to monitor their plots and/or greenhouses [4, 5]. The data generated by these systems also feed into decision support systems to perform intelligent and automatic actions on the plots. Several leading examples for these include climate control in greenhouses [6] or frost prevention in a fruit orchard through smart irrigation [7].

Although decision support systems have numerous advantages and can make decisions in anticipation of future climatic conditions, they have the disadvantage of needing to create local models to achieve high accuracy in predicting climate variables [8, 9]. This disadvantage translates into the need to have historical data on the location of the plot to train and create an accurate model according to the farmer’s needs. This would mean installing the IoT system to collect data but not accurately using the prediction system until there is sufficient historical data to create the prediction model. In [10], the authors review four bio-inspired intelligent algorithms used for agricultural applications, such as ecological, swarm intelligence-based, ecology-based, and multi-objective-based algorithms. Some observed that no universal algorithm could perform multiple functions on farms; therefore, different algorithms were designed according to the specific functions to be performed.

Despite being in the era of Big Data, there is still a lack of quality data to address local problems such as the one mentioned above [11]. Recently, AI techniques have emerged that can generate artificial data of equal or higher quality than the original data, thus solving the problem of the amount of data needed to train local models [12]. Among these techniques, generative adversarial networks (GANs) – deep artificial neural networks capable of generating artificial data – [13] have obtained interesting results in different applications, including image processing [14], speech recognition [15] and other [16].

Within the field of precision agriculture, GANs have recently been applied to image processing tasks such as image augmentation [17, 18] and other tasks within computer vision [19]. However, to the best of our knowledge, synthetic data generation has not been applied to time series data generation in precision agriculture for climate control. In this study, we propose and evaluate synthetic data generation strategies to increase the accuracy of forecasting models for greenhouse climate control.

Greenhouses are agricultural structures that must be tightly controlled to avoid extreme weather conditions to achieve high crop yields [20]. Therefore, farmers are increasingly installing greenhouses controlled by IoT systems to monitor their crops in real time. However, using these data to generate a greenhouse climate model that allows intelligent and automatic control to reduce resources used while increasing crop production is challenging. Therefore, to develop this predictive model, the historical data set to train this model is crucial. These data are not available for the specific location where the greenhouse is installed until the IoT system starts operating. To solve the data problem, this study proposes the creation of synthetic greenhouse data using GAN techniques, to design a prediction system for climatic variables, specifically focusing on temperature, as it is one of the most influential monitored variables [21]. The findings of this study include:

  • Creation of synthetic datasets using GANs techniques considering different time granularities.

  • Study of the best prediction technique using neural networks to predict the temperature of a greenhouse, considering various granularities.

  • Analysis and comparison of the different models created with both synthetic and original data, as well as with the fusion of both types of data.

The remainder of the paper is organized as follows. Section 2 summarizes state-of-the-art related studies regarding synthetic data generation in a time series. Section 3 describes the proposed GAN technique for creating synthetic time series data, as well as the techniques used for evaluating such synthetic data, including the description of the data and evaluation metrics used for the assessment. Section 4 shows the results, analysis and discussion. Section 5 highlights the conclusions and directions for future works

2 Related works

Data collection and capture is one of the mayor features of an open and well-served society. Innovative technologies allow us to capture, analyze and merge data from a variety of sources. However, data are not always accessible, because of privacy or because there is no local data collection system for a problem [22]. In this situation, new AI technologies provide tools and techniques capable of creating synthetic data. Synthetic data is a simulation of ground truth data that allows us to have a greater amount of information, to obtain more robust and accurate techniques [23]. When creating synthetic data, it is important to consider the type of data to be created. The creation of synthetic image data is useful and is widely used for health problems [24] or disease detection in crops [25]. However, the need for larger data sets is not exclusive to the world of image processing. Furthermore, in all contexts that require data for ad-hoc training, they also require large datasets, whether regarding IoT (where time series data predominates) or open contexts (where tabular data predominates). In [26], the authors review the role of IoT devices in smart greenhouses and precision agriculture, where variables such as the cost of agricultural production, environmental conservation, ecological degradation and sustainability have been analyzed. It shows how the economic benefits of using IoT applications in smart greenhouses have long-term benefits in commercial agriculture.

Focusing on the generation of synthetic data for time series data, synthetic data generation methods based on long-short term memory (LSTM) techniques are widely used. In [27], using LSTM, a method for completing synthetic well logs from existing log data was established. This method allowed, at no additional cost, synthetic logs to be generated from input log datasets, considering variation trend and context information. Furthermore, combining standard LSTM with a cascade system was proposed, demonstrating that this method gives better results than traditional neural network methods, and the cascade system improved the use of a stand-alone LSTM network, providing an accurate and cost-effective way to generate synthetic well logs.

Another of the most widely used techniques for synthetic data generation in recent years is GANs [28]. The use of GANs in time series has been widely used to detect anomalies, both in univariate [29,30,31] and multivariate models [32]. This scheme is widely used when working with unsupervised learning where anomaly detection is of particular importance for class labeling. The works on synthetic generation of time series data are not focused on agriculture; they are general works where techniques are proposed and evaluated with benchmarks or work focused on other areas. Yoon et al. [13] proposed a framework for the generation of synthetic time series data, where supervised and unsupervised techniques are combined. Specifically, the authors propose an unsupervised GAN with supervised training using autoregressive models.

However, in agriculture, using time series GANs is rarely used. Some studies have used agricultural data as benchmark data [33, 34], but to the best of our knowledge, there are no publications that focus on solving precision agriculture problems using GANs. In this study, the usefulness of synthetic data is investigated by assessing whether they preserve the distribution of individual attributes, the accuracy of the ML models and pairwise correlation.

3 Materials and methods

This section shows the datasets used and their characteristics. The synthetic data generation model was introduced before AI models were used to validate the effectiveness of the synthetic data described. Finally, different training strategies followed to achieve the objective are presented.

3.1 Dataset

The creation of synthetic data must first take a ground truth dataset from the particular domain for which synthetic data will be generated. In this case, the actual data are obtained from an operational greenhouse located in a semi-arid region of south-eastern Spain (Murcia). ground truth data is obtained from an IoT infrastructure that measures the inside temperature (ºC) of this greenhouse, which has been in continuous operation since 2018. This infrastructure sends 5 minutes of data grouped into 15-minutes, 30-minutes and 60-minutes respectively by performing the standard average.

Because the greenhouse is located in a semi-arid region, the thermal differences between summer and winter are remarkable; therefore, it has been considered that the ground truth data should be divided into winter and summer periods as well. Table 1 shows the ground truth datasets we have created for evaluation purposes. It shows the starting and ending date of the data, and the total number of values available. Datasets ending with a W indicate the end of the training data in winter and datasets ending with an S indicate the end of the training data in summer.

Table 1 Description of ground truth dataset
Fig. 1
figure 1

Architecture of the DoppelGANger used for the synthetic data generation

3.2 Synthetic data generation using GANs

For the generation of synthetic data, this study used DoppelGANger; a GAN architecture for sequential data proposed in [35]. Figure 1 shows the GAN architecture used that is based on the established architecture of strawman GANs for time series generation. It uses Recurrent Neural Networks (RNNs) to generate synthetic time series data. The generative part of DoppelGANger is based on a layer of LSTM cells with 100 units, following a batch generation strategy. Therefore, the model generates, in each pass, S consecutive records of the synthetic time series data, instead of a single one, as do most of the traditional approaches (e.g., \(R_1, R_2,..,R_S\) in Fig. 1). According to authors, this allows us to better capture the temporal correlation of long series and reduce the number of passes required by the model to generate the synthetic data. Furthermore, the GAN also includes a normalization mechanism for each input time series to tackle the well-known model-collapse problem of many GAN models. Then, the discriminator, which is a multilayer perceptron (MLP) with up to five layers of 200 neurons each followed by a ReLU activation function, uses the Wasserstein loss to report the differences between the ground truth and the fake data.

3.3 Deep Learning models

To assess the impact on the accuracy of ground truth and synthetic time series, four deep learning models have been considered: (1) MLP, (2) CNN, (3) LSTM and (4) a combination of CNN and LSTM.

  • MultiLayer Perceptron (MLP): The multilayer perceptron is an artificial neural network made up of multiple layers that forms a directed graph through the different connections between the neurons that make up the layers. This neural network attempts to simulate the biological behavior of neurons. MLP can solve non-linearly separable problems, because each neuron, apart from the inputs, has a non-linear activation function. The MLP is based on the backpropagation method. This method attempts to adjust the weights of the network connections to minimize the prediction error between the output produced by the network and the desired output. Layers can be classified into three types: The input layer comprises the neurons that input the data; no computation occurs in these neurons. Hidden layers can be as numerous as necessary depending on the complexity of the data; these layers comprise neurons whose input comes from previous layers and whose output and settings are passed on to subsequent layers. Finally, the output layer comprises neurons whose values correspond to the number of outputs of the network. In this study, a three-layer MLP comprising input, hidden and output layers are used. The first receives the input features; the hidden layer is where the inputs are processed so that the output layer generates the output of the MLP. The hidden layer learns any complex relationship between the input and the output due to the activation functions of its neurons [36].

  • Convolutional Neural Network (CNN): Convolutional neural networks are a type of supervised learning artificial neural network that processes its layers by mimicking the visual cortex of the human eye to identify different features in the inputs. These layers perform operations that modify the data to understand its particular characteristics. The three most common layers are: convolution, activation or ReLU, and clustering. The convolutional layer applies a set of convolutional filters to the input data where each filter activates different features. The rectified linear unit holds positive values and sets negative values to zero, allowing for faster and more efficient training, also known as activation, as only activated features proceed to the next layer. The clustering layer simplifies the output by a non-linear reduction of the sampling rate, which reduces the number of parameters the network must learn. These operations are repeated in tens or hundreds of layers; each layer learns to identify different features. After learning features in various layers, the architecture of a CNN moves on to classification. The penultimate layer is fully connected and generates a K-dimensional vector. The final layer of the CNN architecture uses a classification layer to provide the final classification output. The difference between a CNN and a traditional neural network is that a CNN has shared weights and bias values, which are the same for all hidden neurons in a given layer. Although the use of convolutional neural network models is more associated with the image classification domain, they are also used in different applications and domains, such as regression, where they can be used with time series by transforming the data to adapt them to the input of the convolutional network [37].

  • Long Short-Term Memory (LSTM): The LSTM model has a recurrent neural architecture with state memory, having the advantage of allowing long-term memory, and is therefore widely used in time series. LSTM is an evolution of standard recurrent neural networks, used in machine learning problems where time is involved, because their architecture as cells and loops allows the transmission and recall of information in different steps. LSTM comprises an architecture that allows information to be stored over long time intervals. This is because the memory cells of the network comprise several layers with loss functions (instead of one as in usual recurrent networks) of sigmoid type that allow us to bypass or add information to the main information line of the neural network, controlled by a hyperbolic tangent function. The information passes from one cell to another, first passing through a sigmoid layer, which is called the forget gate layer. It compares input and output, and returns a value between 0 and 1. If it is 1, the information is stored, if it is 0, it is disregarded. The next step comprises the second sigmoid layer and the hyperbolic tangent layer. It is used to decide which new information will be stored in the cell. The sigmoid layer called the input gate layer decides which value will be updated, and the hyperbolic tangent layer creates a vector of possible values decided by the previous one to be added to the state. The last step is a sigmoid layer that decides what the output will be, followed by a hyperbolic tangent layer that decides which values go to the network output according to the sign by which they are multiplied [38].

  • Convolutional Neural Network + Long Short-Term Memory (CNN+LSTM): This model, known as ConvLSTM, is a DL model that combines a CNN and an LSTM network. The architecture of this technique shares parts of the CNN and LSTM architectures with differences based on the connection point. In the CNN model, the fully connected end layer is replaced by the input layer of an LSTM. Thus the LSTM would keep its complete architecture, described above, and the CNN modifies its last layer. Therefore, the CNN network automatically extracts the input features, while the LSTM network obtains the regression results. This combination allows for the benefits of both models, creating a robust model for time series problems [39].

3.4 Preparation of datasets for training and testing

To accurately assess the impact of the synthetically generated data, five training and testing strategies are proposed to assess the performance of the ML models previously presented. The first strategy (that is, the ground truth dataset) is based only on the ground truth dataset (see Section 3.1). This dataset is divided into two datasets: (1) the training dataset, comprising all the data except the last day, and (2) the test dataset, comprising the last day of the available data. As these are time series data, it is impossible to perform a cross-validation or a validation with any other dataset than the latest values of the time series. time series require preserving the order and dependence between the data.

The second strategy for training and testing (namely, Synthetic dataset) only relays on the synthetic data generated with the GAN model previously presented. The synthetic dataset is divided into two datasets: (1) the data used for training, i.e., the synthetic data generated and the data used for testing that, in this case, are obtained from the ground truth dataset and (2) the data used for testing; i.e., the last day of the time series. The evaluation data are removed, and instead, the evaluation data are taken from the ground truth dataset, so the impact of the synthetic data on a real scenario can be rigorously evaluated.

The third strategy (namely, Synthetic + Ground truth dataset) combines synthetic and ground truth data. The ground truth dataset has been extended by adding data at the beginning of the dataset from the synthetic dataset to extend the time series and thus increasing the size of the dataset for training. Likewise, the models are trained using the entire dataset described above, removing the last day, which is reserved for testing.

The fourth strategy (namely, Synthetic + Ground truth with reinforcement learning dataset) is inspired by reinforcement learning. It also uses synthetic data with ground truth data but here, the training is performed by only using synthetic data. Once the model has been trained, the model is re-trained by using ground truth data. This is because the greenhouse will be continuously operating, and thus, data will be increasingly generated. Then, it can be used to increase the performance of the models over time. Likewise, the test strategy uses the last ground truth day to evaluate accuracy.

The fifth strategy (Shuffled synthetic + Ground truth dataset) uses synthetic and ground truth datasets. This test is like the third strategy, but, the synthetic dataset is shuffled before being concatenated at the beginning of the ground truth dataset. Like previous strategies, the last day of the ground truth dataset is used for testing. This strategy is used to verify the validity of a criterion-generated time series, and it would not be valid to introduce mere random data.

4 Evaluation and discussion

This study considers two dimensions of the problem: (1) the use of GANs for synthetic data generation (time series data) and (2) the impact on the accuracy of AI models depending on whether ground truth or synthetic data are used.

4.1 Exploratory data analisys

All the hyperparameters that have been used for using the GAN model are specified, described and explained in the following list:

  • Max sequence length: Length of time series sequences, variable length sequences are not supported, so all training and generated data will have the same length sequences. Used value is: Lenght of the time serie for one day (96, 48 or 24), deppends on the dataset.

  • Sample length: Time series steps to generate from each LSTM cell in DGAN, must be a divisor of max_sequence_len. Used value is: Lenght of the time serie for one day (96, 48 or 24), deppends on the dataset.

  • Batch size: Number of examples used in batches, for both training and generation. Used value is: min(1000, length of the dataset).

  • Apply feature scaling: Scale each continuous variable to [0,1] or [-1,1] (based on normalization param) before training and rescale to original range during generation. Used value is: True.

  • Apply example scaling: Compute midpoint and halfrange (equivalent to min/max) for each time series variable and include these as additional attributes that are generated, this provides better support for time series with highly variable ranges. Used value is: False.

  • Use attribute discriminator: Use separate discriminator only on attributes, helps DGAN match attribute distributions. Used value is: False.

  • Generator learning rate: Learning rate for Adam optimizer. Used value is: 0.0001.

  • Discriminator learning rate: Learning rate for Adam optimizer. Used value is: 0.0001.

  • Epochs: Number of epochs to train model. Used value is: 100000.

Table 2 Comparison of ground truth and synthetic temperature time series distribution
Fig. 2
figure 2

Box plot comparing ground truth and synthetic data distributions according to sampling frequency

Table 2 shows the main statistical values of the ground truth time series sampled every 15, 30 and 60 minutes during two and a half years together with the same descriptive statistics of the synthetic series over 288, 144 and 72 years.

Most are the usual statistical values. In particular, the standard error of the mean (SEM) measures how much discrepancy is likely in a sample’s mean compared with the population mean. Kurtosis is the degree of peakedness of a distribution, if the value is close to 0, then a normal distribution is often assumed. Skewness is usually described as a measure of a dataset’s symmetry, a value between -0.5 and 0.5, the data are fairly symmetrical. The statistics for skewness and kurtosis simply do not provide any useful information beyond that already given by the measures of location and dispersion but is another element to compare in the last column. Root-mean-square error (RMSE) is a frequently used measure of the differences between values, in our case ground truth and synthetic predicted values.

Fig. 3
figure 3

Kernel density function comparing ground truth and synthetic data sets according to sampling frequency

As can be observed, RMSE, calculated from the ground truth and synthetic column of each sampling rate, is a notably small value for all statistical measures shown. In addition, we can check the standardised mean difference (SMD) which tests for differences in means between ground truth and synthetic time series. Normally, a value of less than 0.1 is considered a “small” difference.

Fig. 4
figure 4

Q-Q plot comparing ground truth and synthetic data sets according to sampling frequency

Table 2 shows a notably statistical similarity between the ground truth and synthetic values, especially because so many years are artificially generated. The data to see the distribution of the time series helps identify possible numerical anomalies such as outliers that would cause similar statistical values for different distributions. That is why these conclusions must be visually corroborated by looking at the box-and-whisker diagram shown in Fig. 2, the Kernel Density Function shown in Fig. 3 and the three Q-Q plots shown in Fig. 4 that compare the ground truth (line) and synthetic (plots) sampling frequency for the three sampling rates.

$$\begin{aligned} SMD=\frac{\mid \bar{x_{1}} - \bar{x_{2}} \mid }{\sqrt{\frac{(s_{1}^{2}+s_{2}^{2})}{2}}} \end{aligned}$$
(1)
Fig. 5
figure 5

Comparison of the same week of the three sampling rates with respect to their corresponding generated time series

To corroborate the conclusion that the generated synthetic time series will be useful to enrich the training of predictive models with tens of thousands of samples that we lack in reality, we compare on the timeline the three sets of generated series. Figure 5 shows a comparison of one week sampled every 15, 30 and 60 minutes between ground truth and synthetic data sets.

Table 3 Average of correlations between ground truth and synthetic data by time period and sampling frequency

Visually, the synthetic time series is adjusted to the periodicity of each actual day. It is not perfect but significant correlations between each pair of ground truth and synthetic datasets are reported. However, they are not statistically significant when analyzing the correlation month-to-month or, year-to-year (see Table 3). A priori, this is not a problem for the intention to use the synthetic results to improve prediction models based on deep learning because the objective is to advance the prediction in a close time period.

In the following sections, this hypothesis is validated; i.e., that the generated data improve the training results of the proposed predictive model.

Table 4 Hyperparameters used for each model. (-) indicates model has no parameter

4.2 Model evaluation

Table 4 shows the models and hyperparameters used for assessment purposes.

The results of each model described in Section 3.3 using the above parameters are presented next. We have used three metrics to perform such an evaluation, the mean absolute error (MAE), the root mean squared error (RMSE) and coefficient of determination (\(R^2\)). These are some of the most common metrics used to measure accuracy for continuous variables. MAE and RMSE are suitable for model comparisons as they express the average model prediction error in units of the variable of interest. Their definition is as follows:

$$\begin{aligned} MAE= & {} \frac{1}{n}{\sum \limits _{i = 1}^n {|y_{i} - \hat{y}_{i}|} }\\ RMSE= & {} \sqrt{\frac{1}{n}{\sum \limits _{i = 1}^n {(y_{i} - \hat{y}_{i} }})^{2}}\\ R^2= & {} \frac{\sum (e_i^2)}{\sum \limits _{i = 1}^n {(y_{i} - \hat{y}_{i} })^{2}} \end{aligned}$$

where, \(y_i\) is the real (ground truth) value of the climatological variable, \(\hat{y_i}\) is the predicted value, \(e_i^2\) is the error term and n is the number of observations.

Table 5 shows the values of the metrics for the MLP for the five train strategies described in seconds (secs.). 3.4. As seen, the strategy following a reinforcement learning approach achieved the best scores in most metrics and time horizons. This is especially remarkable for the datasets with a time frequency of 15 minutes (GreenHouse-15m-W and GreenHouse-15m-S ). Furthermore, such a reinforcement approach provided more accurate MLP models than those solely relying on ground truth data. The \(R^2\) of the former approach was 0.936 for GreenHouse-15m-S whereas the score of the latter strategy was only 0.644 given a 12-h time horizon. Similar behavior was observed for the 24-h period given the same dataset, 0.957 vs 0.835 \(R^2\). The strategy using a shuffled version of the synthetic time series achieved larger errors than the one combining the time series because the GAN directly generated them. Concerning the sensitivity of the results, the accuracy of the MLPs trained following the synthetic or the synthetic + ground truth policies seem to slightly decrease with the frequency increases up to 60 min. For example, the R\(^2\) score of the synthetic dataset MLP was 0.913 and 0.886 for frequencies 15 and 30 min given the summer dataset but it dropped to 0.749 when the frequency is set to 60 min. However, this pattern is not observed in the other policies in Table 5.

Table 5 Results of the MLP technique using ground truth, synthetic, a combination of ground truth + synthetic, ground truth + synthetic with reinforcement learning and shuffled synthetic + ground truth datasets

Table 6 shows the results obtained from the CNN model. Here, the three strategies that incorporated synthetic data during the training stage improve results than the one solely relying on the ground truth data. The combination of synthetic and ground truth data strategies achieved the best scores for all metrics and time horizons for the GreenHouse-15m-W feed. A similar behavior was observed in GreenHouse-30m-W. However, when the frequency increased to 60 min in the winter feed (GreenHouse-60m-W), reinforcement learning or only the use of synthetic data strategies provided better results. However, the summer datasets showed, a slightly different pattern. The CNN models trained with synthetic or the reinforcement-learning strategies were more accurate for the 30-min frequency (GreenHouse-30m-S dataset), but the combination of synthetic and ground truth strategy provided the most accurate CNN model for 15-min and 60-min frequencies. This reveals that combining the synthetic with the ground truth data approach improved the training of the CNN with high time frequencies (15 min) but for lower frequencies the other two synthetic-based approaches were also suitable. In terms of sensitivity, the models following ground truth or shuffled synthetic+ground truth approaches improve results when the frequency increases from 15 min to 60 min. However, the other three approaches follow the opposite trend with a slight accuracy improvement when decreasing the frequency of the time series (e.g. the R\(^2\) score of the MLP with Synthetic + ground truth approach moved from 0.798 to 0.869 when the frequency of the GreenHouse-60m-S decreased from 60 to 30 min. This suggests that, for the MLP model, the combination of synthetic and real data must be better considered for time series with frequencies below 30 min.

Table 6 Results of the CNN technique using ground truth, synthetic, a combination of ground truth + synthetic, ground truth + synthetic with reinforcement learning and shuffled synthetic + ground truth datasets
Table 7 Results of the LSTM technique using ground truth, synthetic, a combination of ground truth + synthetic, ground truth + synthetic with reinforcement learning and shuffled synthetic + ground truth datasets
Table 8 Results of the CNN+LSTM technique using ground truth, synthetic, a combination of ground truth + synthetic, ground truth + synthetic with reinforcement learning and shuffled synthetic + ground truth datasets

Table 7 summarizes the evaluation of the LSTM model. The three synthetic-based training strategies outperformed the approach that only used ground truth data, considering most metrics, time horizons and datasets. For example, the RMSE of the LSTM trained only using ground truth data was 6.358 for the GreenHouse-15m-S dataset when considering a 24-h time horizon the same model trained with synthetic data achieved a much lower RMSE, 3.829. Furthermore, the LSTM model exhibited differences in terms of accuracy depending on the time frequency of the model, as already observed with the CNN model. Therefore, Table 7 shows that the reinforcement-learning approach allowed the LSTM model to improve its accuracy for most of the datasets with low time frequencies (GreenHouse-30m-S, GreenHouse-60m-W and GreenHouse-60m-S). Furthermore, the approach that relies solely on synthetic data to train the model generated more accurate predictions datasets with higher time frequencies (i.e., GreenHouse-15m-W and GreenHouse-15m-S) at least for the 12-h time horizon. The training strategy based on a shuffled version of the synthetic time series achieved larger RMSE and MAE values than the three versions using the original synthetic time series, as well as the LSTM model just trained only with ground truth data. Table 7 also shows that all the models trained with the four policies, including ground truth data, were sensitive to the frequency of the input time series. The R\(^2\) score exhibited an increase in the ground truth, synthetic, synthetic + ground truth, and shuffled synthetic + ground truth policies when the frequency of the time series moved from 30 to 60 min. In contrast, a different behavior was observed for the LSTM solely trained with synthetic data, its more accurate results were obtained with the frequency of the input time series was set to 15 min.

Last, Table 8 comprises the evaluation results of the CNN+LSTM model. The three training alternatives that used synthetic time series improved results, than the one that was based solely on ground truth data. Furthermore, we can see that the strategy that combined ground truth with synthetic data achieved the best results especially for the 15, or 30 min datasets. For example, the RMSE of the model for a 12-h prediction when trained was 0.932 for the GreenHouse-30m-W. This was a lower error than the one obtained by the variation trained only with ground truth data (i.e., 1.645). Furthermore, the CNN+LSTM model, trained only with synthetic data, achieves the best results for the two datasets with a 60-min frequency. Unlike the previous models, the reinforcement-learning strategy performed sligthly worse than the other alternatives. Moreover, the training using shuffled synthetic data, achieved slightly higher errors than the other four alternatives in most cases. Regarding sensitivity, CNN+LSTM variations improved scores with the 24-h time horizon than with the 12-h configuration. Furthermore, CNN+LSTM solely trained with ground truth data obtained better results for the summer than for the winter feeds considering its R\(^2\) score (e.g., 0.928 vs 0.945 for the 60 min with 24 h as prediction horizon according to Table 8). This seasonal sensitivity was also observed in the other four policies incorporating synthetic data.

In this study, there are common patterns in the results of the four evaluated models. 1) The training of the forecasting algorithms leveraging the synthetic time series improved their prediction capabilities regarding the alternative of relying on ground truth data. 2) Common behavior is that using a shuffled version of the synthetic data did not provided no meaningful improvement regarding the models with just ground truth data. 3) The strategy combining ground truth with synthetic data provided the most robust models for 15-min and 30-min frequencies, at least for the CNN and LSTM variants. For larger frequencies, the reinforcement learning strategy provided more reliable predictors.

Evaluating the strategies has also revealed a sensitivity of the models to the frequency and season of the input time series. However, how these two factors affect the accuracy of the predictors strongly varies across models and training strategies with no global sensitivity pattern. Although the MLP and CNN with ground truth data performed better in the winter season, the other alternatives with synthetic data seem to provide better results in the summer time series. However, the CNN and CNN+LSTM alternatives do not follow such seasonal trends and show slightly better results in summer than in winter, regardless of the particular training strategy used to compose the predictor.

This has important implications in operational terms as it would be necessary to consider the relevance of the season and the frequency of the time series in order to eventually select a training strategy and the predictive algorithm. For example, in the case of greenhouse settings where the summer season was the most important part of the year, the evaluation showed that a CNN or CNN+LSTM instance trained with a synthetic + ground truth policy would be the most suitable configuration. The evaluation has shown that, for example, the RMSE of the CNN+LSTM model solely trained with ground truth data was above 3.00 for all the summer feeds (Table 8) whereas the CNN+LSTM fed with synthetic and ground truth data was below 2.42 for the same summer feeds.

These findings confirm the main hypothesis of this work, the usage of coherent synthetic time series, to enlarge the training sets of a forecasting model, helps to improve their final accuracy. Furthermore, shuffled series also shows that this improvement does not occur because we added more data to the training corpus, but because of the use of a synthetic series that actually behave in a similar manner to the target one.

5 Conclusion and future work

Precision agriculture is moving from tele-control systems to intelligent control systems by exploiting the data generated from the IoT system for a more sustainable and efficient crop management. This transition requires substantial amounts of reliable and ready-to-use data from the deployment of the system to train ML/DL models that meet expectations.

In this context, this novel study shows the reliability and suitability of using synthetic time series to expand the training corpus of deep-learning to forecast algorithms. The goal of these algorithms is to predict the internal temperature of greenhouses to anticipate future actions to keep this internal temperature within a suitable range. Five training strategies have been defined to optimally fuse ground truth and synthetic data.

The models trained with some of these fusion strategies outperformed the alternative models solely trained with the raw measurements from the temperature sensors by considering different time frequencies, evaluation metrics and time horizons. The metrics evaluated were affected by the frequency of the target time series and the season under consideration (winter or summer). This calls for a careful procedure to select the model and the training strategy based on the period of the year under study and the characteristics in terms of frequency and data curation applied on the input sequences of data.

This work opens a novel and promising research line for studying the most suitable training strategies for combining raw and synthetic time series in the development of a smart greenhouse. Future work will focus on: 1) Developing other combinations of ground truth and synthetic data to further improve the prediction of AI/ML models; 2) Using other synthetic data generation techniques and evaluating their effectiveness; 3) Apply the transfer learning technique for time series models of synthetic data generation; 4) Generate synthetic data and AI models in their multivariate version that consider all the variables that exist in a greenhouse; 5) Apply synthetic data generation methods and AI models in contexts other than those of precision agriculture in greenhouses.