Introduction

It is well known that water is an essential resource for economic development, for obtaining food, for the availability of healthy ecosystems and, in short, for the survival of living beings. However, the water availability is becoming increasingly limited due to the rapid growth of the world population. An adequate management, understood as the activity of planning, developing and distributing of resources, is essential to optimize the use of water.

The companies and entities managing the supply of drinking water have the objective of supplying the demand of the consumers every day with the greatest possible efficiency. However, in many cases the operation is only managed to cover the instantaneous water demand without using any advanced technique for predicting consumption. Under these conditions, being able to predict in advance the pattern of demand in the short term (1-48 h) is a valuable tool for optimizing the management of water reserves and the use of associated equipment. Thus, for example, it could allow planning the schedules of the supply pumps to take advantage of the periods with more economic tariffs. Several authors have quantified how operation planning based on demand prediction can lead to energy cost savings, in many cases in excess of 18% (Cembrano et al. 2000; Salomons et al. 2017; Kang et al. 2014).

In this environment, the analysis of water-related time series and, in particular, their prediction, is a tool that can help improve the management of the integral water cycle for drinking water supply or crops irrigation, waste water generation, natural sources, etc. Within this field, this work focuses on the analysis and prediction of drinking water consumption in Murcia, which is a city located in southeastern Spain.

In general, the analysis of time series has two main objectives: to identify the nature of the phenomena represented by this time series and its prediction. The forecasting techniques based on time series models (Trull et al. 2019, 2020) have been widely developed and applied in very diverse disciplines, such as economics, meteorology, medicine or resource management.

In the last decades, machine learning techniques (Talavera-Llames et al. 2016, 2019) from the Artificial Intelligence field have been successfully applied to forecasting problems, in particular artificial neural networks (ANN) (Rana et al. 2014; Lin et al. 2019). In recent years the enormous amount of time series measurements collected from smart devices has made deep learning necessary, thus giving birth to deep neural networks (DNN) (Torres et al. 2018, 2019).

In this work, we propose a DNN for the purpose of prediction of the water demand. First, an analysis of the dataset composed of the water consumption measurements collected every 10 min is carried out. Then, the methodology based on a deep feed forward network is presented, providing an improved and robust way to evaluate the learning of a time series model preserving its temporal order. An exhaustive experimentation using real-world water consumption data is provided, obtaining an error of 3% approximately. Finally, a comparison is also performed, making use of the k nearest neighbors, random forest, extreme gradient boosting, a classical time series model and two persistence models based on the real values of the previous day or week.

In summary, the main contributions of this work can be summarized as follows:

  1. 1.

    A deep learning model specifically designed for water consumption forecasting.

  2. 2.

    A robust way to evaluate the learning of a time series model preserving its temporal order.

  3. 3.

    Analysis of the behaviour of the water consumption in a city of Spain.

  4. 4.

    Reported error results of 3% for the real-world water consumption.

  5. 5.

    Comparison of prediction accuracy with other state-of-the-art forecasting methods and statistical test in order to validate the results.

The rest of the paper is structured as follows: the previous researches related to the paper’s topics are presented in Sect. 2. Section 3 defines the forecasting problem to be solved. Section 4 summarizes the main characteristics of the water consumption of Murcia and describes how the proposed algorithm works and the methodology carried out in order to evaluate a time series model. The experimental setting and the results obtained are shown in Sect. 5. Finally, the conclusions and future works are provided in Sect. 6.

State of the art

This section reviews all recently published works related to water demand forecasting.

Classical time series models have shown to be competitive for water consumption forecasting problems. The authors in Anele et al. (2017) obtained predictions of water consumption in southwest Spain by means of AR, MA, ARMA and ARMAX models using water consumption data combined with meteorological information. In Lee and Derrible (2020), Lee et al. applied regression techniques, namely linear, Lasso and Bayesian, to predict daily water consumption using demographic data and housing information. The regression models were compared with other widely used machine learning techniques such as gradient boosting (GB).

The prediction of water demand using machine learning techniques has been intensively studied in recent years due to the increasing availability of easy access to large amounts of data. One of the most widely used approaches has been the tree-based techniques. Nunes-Carvalho et al. trained a random forest (RF) model, among others, using socio-demographic information and historical water consumption data, to predict water demand patterns in the city of Fortaleza, Brazil (Nunes-Carvalho et al. 2021). Bolorinos et al. trained a RF method for detecting changes in consumption (Bolorinos et al. 2020). A RF was also used to forecast daily consumption in southwest China in Chen et al. (2017). A tree-based model, namely a GB, was proposed by Xenochristou et al. in Xenochristou et al. (2020) to predict water demand at different scales and to establish a comparison between the results obtained for each one of them. In Villarin and Rodriguez-Galiano (2019), the authors compared the performance of classification and regression trees (CART) and RF to forecast time series of water demand in the city of Seville in Spain.

Several works published in the last years proposed the support vector regression (SVR) method to obtain accurate predictions of water consumption. Chen et al. designed a model based on SVR to predict hourly water demand using two different data sources in order to optimize pumping operations and to detect anomalies (Candelieri 2017). A least squares SVR was also applied to predict residential, industrial and commercial water demand in the city of Bogotá, Colombia in Peña-Guzmán et al. (2016). Different machine learning models for forecasting water demand were compared in Herrera et al. (2010) using data from an urban area in a city in southeastern Spain. In particular, artificial neural networks, projection pursuit regression, multivariate adaptive regression splines, random forests and SVR were tested, obtaining the SVR method the best results.

Recently, many architectures of neural networks have been also proposed for water consumption forecasting. Ghiassi et al. (2017) used two neural networks and a model based on nearest neighbors for daily, weekly and monthly forecasting of water demand in the city of Tehran, obtaining highly competitive results. A deep belief network was performed by the authors in Xu et al. (2019) for the prediction of hourly water demand. In Mouatadid and Adamowski (2017) the performance of various machine learning methods was evaluated to forecast urban water demand for one day and three days ahead, with the extreme learning machine (ELM) model having the lowest prediction error.

Finally, ensemble models are booming because they tend to achieve better results than a stand-alone method. A technique based on stacking models, including artificial neural networks and deep learning architectures, to predict daily water demand using real data from United Kingdom was proposed in Xenochristou and Kapelan (2020). Ambrosio et al. combined different models, including the multilayer perceptron, for water demand prediction in Ambrosio et al. (2019). A weighted strategy that gathers the advantages of the different machine learning techniques such as neural networks, random forests, support vector machines and k-nearest neighbors was suggested in Antunes et al. (2018) and compared with an autoregressive integrated moving average (ARIMA).

In addition to forecasting tasks, other studies have also been carried out to analyze the water consumption. For instance, the authors in Coelho et al. (2017) proposed a metaheuristic based on deep learning and graphic processing units (GPU) to analyze time series of water consumption in big data environments. Clustering techniques have been also applied to water consumption data. The application of a mixture of non-homogeneous hidden Markov models to cluster time series that share the same transition dynamics was proposed in Leyli-Abadi et al. (2018). A similar study was carried out in Padulano and Giudice (2018), where first clustering and then classification techniques were applied to data from consumption meters in a household in Soccavo, in the city of Naples (Italy).

A correct selection of the predictive variables is important since a large number of features does not always leads to a significant improvement in the results. Some authors have used climatological, population or even urban mobility data as predictive variables, in addition to the previous values of water consumption (Smolak et al. 2020).

In summary, it can be concluded that water forecasts have been made in many different geographic areas and population centers. Antunes et al. (2018) obtained forecasts of the water demand for two cities in Portugal. Tiwari et al. (2016), as well as Bougadis et al. (2005); Bata et al. (2020), used several population centers in Canada. Smolak et al. (2020) provided predictions of water consumption for several towns in Poland and (Duerr et al. 2018) for several towns in Florida (USA). Pacchin et al. (2019) and Gagliardi et al. (2017) carried out a comparison of different prediction techniques applied to some places in Italy. Ren and Li (2016) obtained consumption predictions for the city of Shanghai in China. And other works developed the water prediction at the level of individual users, such as households or certain businesses and industries (Rahim et al. 2019; Farah et al. 2019).

With respect to prediction horizons, although most of the predictions are made for short-term, the most common prediction horizon being 24 h, several authors made forecasts for longer periods, such as weeks, months, or even years (Bata et al. 2020; Tian and Xue 2017).

Although all previously cited works present significant differences regarding the models or even the scope in some cases, a summary of results is provided in order to offer a general overview about the performance. Antunes et al. (2018) obtained a mean absolute percentage error (MAPE) between 8.3% and 17.6% for the next 24 h using an ensemble of models. Recently, Bata et al. in Bata et al. (2020) obtained a MAPE of 12.3% for the day-ahead water forecasting and Smolak et al. of 9.6% in Smolak et al. (2020).

After a thorough review of the previously published works, it can be concluded that machine learning techniques have generally provided better results than classical techniques, but also that there is no optimal model of machine learning that is the most appropriate for all cases. On the contrary, several works (Makridakis et al. 2018) concluded that classical prediction methods may have better performance than those based on machine learning in the prediction of certain time series. These two points reinforce the idea that it is necessary to analyze each case with its particularities.

Problem description

In this paper we will analyze and predict the demand of drinking water in the short term in the city of Murcia, located in southeastern Spain, one of the areas of Europe suffering greater water stress. These predictions could be used later for two purposes: the optimization of its management and the detection of anomalies.

The goal of the time series analysis is to obtain mathematical models that allow to explain the behavior observed in a time series and that can be applied to the prediction of future values. To do this, we propose to develop a model of machine learning, based on deep neural networks, as accurately as possible, for the drinking water demand forecasting in the city of Murcia in the short-term, namely, four hours. As a time horizon of prediction we have considered a value of four hours, since it is a sufficient time to plan some of the main tasks carried out every day in the management systems of a city of these characteristics. To obtain a prediction of the water consumption for the next 4 h, we will need to make a multi-step prediction as the samples are acquired every 10 min. Therefore, the model will provide 24 values in each run. In addition, the required computations to obtain the prediction must be performed every 10 min.

Finally, these predictions would then be used to optimize the operation and to detect anomalous consumption patterns due to breakdowns in the distribution network.

Proposed methodology

Data

The city of Murcia is located in the region of Murcia in southeast Spain and has a population of 453258 inhabitants, with an average annual growth of 0.8% in the last five years. It is the seventh largest city in Spain in terms of population and geographically includes 52 districts covering an area of 882 km2 as shown in Fig. 1. The network of distribution pipes managed by the municipal water company of Murcia reaches 2203 km, and the consumption of drinking water per inhabitant is approximately 185 ls per day. The use of the water includes mainly domestic, industrial, service and garden irrigation.

Fig. 1
figure 1

Location of the city of Murcia in the region of Murcia

The drinking water consumption data consists of measurements in cubic meters per hour (\(\text {m}^3/\text {h}\)) collected by the supervisory, control and data acquisition system of the company that manages water in Murcia. The data are recorded with a frequency of 10 min from January 1, 2019 to June 30, 2020. In short, the starting dataset is composed of 78773 samples and a summary of the main statistical values is shown in Table 1.

Table 1 Descriptive statistics of the water consumption time series

Figure 2 shows the water consumption in Murcia from January 2019 to December 2019 divided into quarters. It can be observed high seasonality, as well as that the consumption remains at stable values for most of the year, although it was significantly reduced during the summer period of 2019. This is possibly caused by the decline in the city’s population during the holiday period.

Fig. 2
figure 2

Water consumption from January 2019 to December 2019 divided into quarters

Figure 3 presents the values of water consumption for the week comprising the days from Monday 21 to Sunday 27 January 2019. It can be noted that working days from Monday to Friday show a similar pattern. However, weekends and holidays have a different consumption pattern related to the change in activity and schedules.

Fig. 3
figure 3

Water consumption for a particular week

Figure 4 depicts the water demand for one working day. It follows a pattern according to the activity and habits of the day, that is, consumption is very low in the early morning, with a peak in the early afternoon that decreases, and increases again in the early evening.

Fig. 4
figure 4

Water consumption for a particular working day

Figure 5 shows the water consumption for several weeks. It is not always known in detail what causes the variations that are observed, for example, between working days. They are sure to be very varied from the appearance of breakdowns to specific demands from large consumers or the presence of large events in the city. It should be noted that similar patterns occur in other utilities required by our society such as electricity demand (Galicia et al. 2018; Troncoso et al. 2004) or transportation (Yasdi 1999).

Fig. 5
figure 5

Water consumption for several weeks

Deep neural networks

There are currently a large number of DNN architectures such as feed forward, convolutional or recurrent networks, each specially designed for a particular type of application or data. A full survey of the deep learning for time series forecasting can be found in Torres et al. (2021).

In this work, a Deep Feed Forward Neural Network (DFFNN) has been designed for the water consumption forecasting. Its main advantages are to be able to learn both linear and nonlinear relationships present in the time series, the possibility of making multi-step and multivariate predictions and it needs fewer assumptions in its modeling compared with other techniques. On the other hand, deep learning techniques also have a number of drawbacks such as very poorly interpretable models or a high number of hyper-parameters. Although the explainability or interpretability of the models may be very relevant in other types of applications such as medicine or finance, among others, it is not for the water consumption forecasting.

DFFNN, also called multi-layer perceptron, arose due to the inability of single-layer neural networks to learn certain functions. The architecture of a DFFNN is composed of an input layer, an output layer and different hidden layers as shown in Fig. 6. In addition, each hidden layer has a certain number of neurons to be determined.

Fig. 6
figure 6

Basic architecture of a DFFNN for time series forecasting

The relationships between the neurons of two consecutive layers are modelled by weights, which are calculated during the training phase of the network. In particular, the weights are computed by minimizing a cost function by means of gradient descent optimization methods. Then, the back-propagation algorithm is used to calculate the gradient of the cost function. Once the weights are computed, the values of the output neurons of the network are obtained using a feed forward process defined by the following equation:

$$\begin{aligned} a^l = g(W_a^la^{l-1}+b_a^l) \end{aligned}$$
(1)

where \(a^l\) are the activation values in the l-th layer, that is, a vector composed of the values of the neurons of the l-th layer, \(W_a^l\) and \(b_a^l\) are the weights and bias corresponding to the l-th layer, and g is the activation function. Therefore, the \(a^l\) values are computed using the activation values of the \(l-1\) layer, \(a^{l-1}\), as input. In time series forecasting, the rectified linear unit function (ReLU) is commonly used as activation function for all layers, except for the output layer to obtain the predicted values which generally uses the hyperbolic tangent function (tanh).

For all network architectures, the values of some hyper-parameters have to be chosen in advance. These hyper-parameters, such as the number of layers and the number of neurons, define the network architecture, and other hyper-parameters, such as the learning rate, the momentum, number of iterations or minibatch size, among others, have a great influence on the convergence of the gradient descent methods. The optimal choice of these hyper-parameters is important as these values greatly influence the prediction results obtained by the network. The hyper-parameters will be discussed in more detail in Sect. 5.3.

Model evaluation

Classical techniques for the selection and evaluation of machine learning models have limitations when applied to time series forecasting. Thus, the hold-out technique with a single training and test set involves arbitrarily selecting a set of test. This set of test will correspond only to the final temporal range of the available values of the time series. Thus, an error measure that is not very representative of the model’s predictive performance can be obtained when applied at any other timestamp of the time series. However, the classical k-fold cross-validation implies not respecting the temporal order of the samples, an essential feature in time series (Bergmeir and Benítez 2012).

In this work, a nested cross-validation technique is used (Varma and Simon 2006). With this evaluation technique, the water consumption time series is studied in different time ranges, repeating the training and testing process for each of these ranges. Finally, a more robust and representative final error is obtained. This error is the average of the errors obtained for each aforementioned time range, as is depicted in Fig. 7. For the proposed DFFNN model, the re-training process is repeated for 10 different periods, using the datasets composed of the first 6, 10, 11, 12, 13, 14, 15, 16, 17 and 18 months, respectively.

Fig. 7
figure 7

Nested cross-validation procedure for model evaluation (adaptation of Cochrane et al. 2021)

The proposed model was periodically re-trained with all available data as the DFFNN model can obtain better results using a larger amount of data. Thus, a growing window strategy is applied instead of the typical sliding window, as shown in Fig. 8. The historical window of values used for each forecast is the number of neurons for the input layer of the DFFNN model and it is one of the parameters to be optimized. In this work, the percentage distribution of the data for the training, validation and test sets are 60%, 15% and 25%, respectively.

Fig. 8
figure 8

Growing window versus sliding window

Results

Quality measures

Four well-established metrics in the context of time series have been chosen in order to evaluate the performance of the DFFNN model proposed in this work.

The mean absolute percentage error (MAPE) is a relative error expressed as a percentage. It is used as a guideline to measure the goodness of the prediction method when comparing to other models:

$$\begin{aligned} \text {MAPE} (\%)= \frac{100}{n}\sum _{t=1}^{n}\frac{|y_t - \widehat{y}_t|}{y_t} \end{aligned}$$
(2)

The mean absolute error (MAE), expressed in \(\text {m}^3/\text {h}\), indicates the average deviation between actual and predicted values:

$$\begin{aligned} \text {MAE} = \frac{1}{n}\sum _{t=1}^{n}|y_t - \widehat{y}_t| \end{aligned}$$
(3)

The root mean squared error (RMSE), expressed in \(\text {m}^3/\text {h}\), is the square root of the average of squared differences between predicted and actual values. By using the squared values, all of them are forced to have a positive value and the errors of greater magnitude have, proportionally, a higher weight in the result.

$$\begin{aligned} \text {RMSE} = \sqrt{\frac{1}{n}\sum _{t=1}^{n}|y_t - \widehat{y}_t|^2} \end{aligned}$$
(4)

Finally, the coefficient of determination \(R^2\) provides a measure of the accuracy with which predictions match actual values. Its value is between 0 and 1, indicating poor fit or perfect fit, respectively.

$$\begin{aligned} R^2 = 1-\frac{\sum _{t=1}^{n} (y_t - \widehat{y}_t)^2}{\sum _{t=1}^{n}(y_t - \bar{y})^2} \end{aligned}$$
(5)

For all the equations above, \(y_t\) represents the actual value of the time series, \(\hat{y}_t\) represents the forecasted value, n represents the number of points included in the prediction and \(\bar{y}\) denotes the mean of the time series values.

Preprocessing

The quality of the input data is essential for any deep learning model to obtain accurate predictions. Therefore, an analysis of the performance of the DFFNN when applying different preprocessing techniques has been carried out.

The time series has a total of 33 missing values and no values equal to 0, null, or negative. In order to determine which technique is the most appropriate for the imputation of missing values, some values from the time series have been randomly removed. The assignation of these missing values has been performed using different methods such as forward fill, backward fill, linear interpolation, linear fill, cubic fill, mean of k nearest neighbors and seasonal mean. Then, the mean square error (MSE) is computed for a training set and the method providing the lowest MSE is selected. The best results have been obtained using linear interpolation.

In a time series the presence of statistically anomalous values or outliers is common. Some outliers can be simply due to the presence of errors in the system for measuring and recording the water consumption data. However, other outliers may be caused by real variations in consumption as undesired punctual situations (breakdowns in the transport and distribution networks), or occasional demands from large consumers (municipal swimming pools, industries, large events, etc.), which cannot always be known in advance. It is recommended to keep the outliers corresponding to high consumption that may occur periodically and that are not caused by failures for model learning. However, both types of outliers are indistinguishable and considering that our DFFNN model has a significant tolerance to the presence of these anomalous values, no special treatment for outliers has been considered.

The time series includes consumption values ranging from 301 to 2871 \(\text {m}^3/\text {h}\). It is known that the gradient descent technique used by the DFFNN model in the training phase works better if the variables are in a smaller range, being able to converge more quickly to its solution. The effect of standardization and scaling transformations to the ranges [0, 1] and \([-1,1]\) has been tested. It was observed the range \([-1,1]\) provided the best results.

Finally, transformations have been performed to make the time series of water consumption stationary, without obtaining any improvement in the accuracy of the predictions.

Table 2 shows a summary of the different techniques applied to preprocess the water consumption data regarding missing values, feature scaling and transformation to stationary time series and the technique selected according to the lowest mean square error.

Table 2 Summary of preprocessing techniques

Hyper-parameters

Most machine learning algorithms require the selection of several parameters, which are not directly learned by the model. These are called hyper-parameters. The hyper-parameters to be optimized for the DFFNN model are shown in Table 3. With the objective of minimizing the MSE, a grid search strategy was used to find the best values for the hyper-parameters. Thus, once the best parameters have been obtained from different possible combinations, the final model is trained. For the rest of the methods used in the comparison, all hyperparameters were optimized following the same grid search strategy. The most widespread search thresholds in the literature were established.

Table 3 Hyper-parameter search for DFFNN model

Analysis of results

Figure 9 illustrates a comparison between the original and predicted values by the DFFNN. It can be seen how the actual and predicted values are quite similar and how the forecast has been able to capture the seasonal component of the original series and differentiate the behavior between weekdays and weekends.

Fig. 9
figure 9

Actual vs predicted values

Furthermore, the evolution of the MSE loss function in the training and validation phases indicates that the model obtained does not have significant overfitting or underfitting, as illustrated in Fig. 10.

Fig. 10
figure 10

Evolution of the loss function versus the number of epochs for the training and validation sets

Figure 11 shows the correlation between the actual values and forecasted values obtained by the DFFNN model for the test set. A \(R^2\) value of 0.987 is displayed, showing how good the predictions are.

Fig. 11
figure 11

Correlation between actual and predicted values for the test set

Table 4 presents the largest errors obtained by the DFFNN model for the test set, ordered from largest to smallest. It can be observed that three errors correspond to the early morning of 24 March 2020. However, water consumption was higher than usual for those hours, possibly due to a breakdown or some other incident, as shown in Fig. 12.

Table 4 Maximum absolute errors for the DFFNN model
Fig. 12
figure 12

Actual vs predictions for the days from March 22 to March 25, 2020

The residuals are the difference between the time series and the predictions obtained by the forecasting model for the training set. An uncorrelated residual with a mean of zero indicates that the forecasting method is able to model most of the information available in the original data. This does not ensure that the model has a good performance when predicting the test set, but it suggests that there is little room for improvement with the available information. On the other hand, if these conditions are not met, it is important to clarify that the model can still provide predictions that satisfy the expectations according to the errors metrics depending on the application under study. Figure 13 shows the residual errors obtained by the DFFNN model. From the autocorrelation function, it can be observed that the residuals of the predictions model a white noise. Most of the values have a low value, below the 95% (solid line) and 99% (dotted line) confidence band.

Fig. 13
figure 13

Autocorrelation function of the residuals for the DFFNN model

Comparison with benchmarking methods

In order to compare the performance of the proposed DFFNN model to other possible forecasting techniques, six methods are considered such as K Nearest Neigbors (KNN), Random Forest (RF), Extreme Gradient Boosting (XGBoost), a Seasonal Autoregressive Integrated Moving Average (SARIMA) model and two baseline models.

The KNN has been successfully applied to obtain predictions of energy consumption in recent years and the prediction is based on the weighted linear combination of the time series values following in time order to the nearest neighbors, where the weights are determined depending on the distance of the neighbors to the past values (Talavera-Llames et al. 2019). In this work, the distance for the calculation of the neighbors has been the Manhattan distance and a single close neighbor has been considered.

RF and XGBoost are two methods based on ensembles of trees, but the training processes are very different. XGBoost train one tree at a time, while RF can train multiple trees in parallel. After extensive experimentation, 18 and 200 trees of maximum depth 14 and 2 have been used for RF and XGBoost, respectively.

The baseline models are based on a persistence algorithm, i.e., the prediction for a future time instant has the same value as in previous instants, so it represents a high correlation. In some works, this approximation is also known as seasonal naive (Livera et al. 2011). From the consumption patterns and correlation plots, a similarity between the measurement at instant t and the same instant of the previous day or week can be seen. Mathematically, the prediction is computed as follows:

$$\begin{aligned} \hat{y}_t= & {} y_{t-144} \end{aligned}$$
(6)
$$\begin{aligned} \hat{y}_t= & {} y_{t-1008} \end{aligned}$$
(7)

where \(\hat{y}_t\) denotes the predicted value at instant t and \(y_{t-144}\) and \(y_{t-1008}\) the actual values of the time series at same time instant of the previous day or week, respectively.

On the other hand, the performance of the DFFNN model has been compared with a classical time series model, in particular, the SARIMA model. This model has been successfully used in a large number of practical problems and offers a high interpretability of the results, being also able to obtain well-defined confidence intervals in the predictions (Arunraj et al. 2016). As for the disadvantages, it can only extract the linear relationships present in the time series. SARIMA is an extension of the ARIMA model for univariate time series, which also includes a seasonal component. For this reason, SARIMA is of special interest in time series that exhibit periodic characteristics such as the time series of water consumption. The SARIMA model has 7 hyperparameters: p, d and q for the autoregressive, differential and moving average components, respectively, and P, D and Q for these same components of the seasonal part, and finally, a value m including the number of samples for a single seasonal period. As in the case of the DFFNN model, a grid search has been used to find the best SARIMA model configuration. The metric used has been the Akaike information criterion (AIC), which allows to compare the performance of different statistical models. The AIC value is lower as the model output has a higher similarity to the data, but it also adds a penalty term depending on the number of hyper-parameters in the model in order to avoid overfitting. Therefore, a lower value of the AIC indicates a better model fit.

Table 5 shows the optimal values for the hyperparameters of the SARIMA model.

Table 5 Hyperparameter search for SARIMA model

Figure 14 shows the prediction made by the SARIMA model for the week of October 8–13, 2019, including the 95% confidence interval (shaded in grey colour). A certain similarity with the real series can be observed, but the error is significant at some time points. Even so, the model has been able to capture a good part of the seasonality of the water consumption.

Fig. 14
figure 14

Actual versus predicted values by the SARIMA for the week from October 8 to October 13, 2019

For the SARIMA model, the mean of the residuals is practically zero, but the residuals of the predictions show significant correlations, as shown in the correlogram and histogram in Fig. 15. Therefore, very accurate predictions are not expected by the SARIMA model.

Fig. 15
figure 15

Diagnosis of the residuals for the SARIMA model

Table 6 shows the average of the MAPE, MAE, RSME and \(R^2\) errors when predicting the test set for a total of 10 runs. The DFFNN model provides the best performance. The second best method is the RF, although it is 0.7% above the DFFNN. The persistence model has the advantage of its great simplicity, although it obtains greater errors than the DFFNN model and all other methods, except the SARIMA model. The SARIMA model does not improve the performance of the persistence models, which confirms, once again, that a more complex model does not necessarily always give better results. In addition, the predictions obtained by the DFFNN model range within a small interval as the standard deviation is low for the water consumption.

Table 6 Errors for DFFNN, KNN, RF, XGBoost, SARIMA and persistence models for the test set

In order to increase the confidence in the results, a statistical significance test has been used, in particular, the Wilcoxon test (García et al. 2010). The Wilcoxon test is nonparametric, i.e. it does not assume a specific distribution of the data and is suitable for use with paired results. The null hypothesis in the Wilcoxon test consists of assuming that the results being compared come from the same population, and that, therefore, they have the same statistical parameters. In this study, a value of 0.05 has been considered as the level of significance \(\alpha \). If the p-value obtained from the test set is less than \(\alpha \), it can be concluded that the distributions of the results are different, and therefore, the observed differences are not random, i. e. the differences between forecasting methods are statistically significant. Table 7 shows the p-values obtained for the MAPE in the Wilcoxon test. The p-values have been adjusted using the Holm procedure. Similar results were obtained regarding the MAE, RSME and \(R^2\). Note that since a multiple testing has been applied, the Bonferroni correction is necessary, being a statistically significant difference if \(\alpha \) is less than 0.0024 as 21 comparisons of paired samples is made. It can be observed that the DFFNN presents significant differences with all forecasting methods according to the p-values. KNN does not present significant differences with XGBoost and the 1-week based baseline, and SARIMA with the 1-day based baseline either.

Table 7 Statistical tests for the MAPE for all algorithms

Application: Anomaly detection

The predictions obtained with DFFNN can be used for water consumption anomalies detection. The methodology consists of analyzing which values of the time series differ significantly from the prediction made by DFFNN. For this purpose a band is defined through a lower and upper margin of the prediction obtained by the DFFNN. In particular, when the values of the original series fall outside this band, the possible presence of anomalous values or outliers can be predicted.

Figure 16 shows the prediction of the DFFNN model along with a certain upper and lower margin of 15%. It can be seen how this methodology points to the water consumption values occurring in the early morning of March 24, 2020 as possible outliers as shown also in Fig. 12.

Fig. 16
figure 16

Detection of anomalous water consumption

Conclusions

In this paper the DFFNN deep learning approach based on feed-forward neural networks has been proposed to forecast water consumption in the short-term. A grid search has been carried out in order to tune the multiple hyper-parameters involved in the performance of the DFFNN and an evaluation methodology based on growing windows is introduced in order to preserve the temporal order of the time series. Prediction results have been reported using a dataset of water consumption in the city of Murcia in Spain. The proposed DFFNN method has been evaluated according to the MAE, MAPE, RMSE and \(R^2\), yielding an average error close to 3%. The comparison results show that the DFFNN model improves significantly the forecasting performance compared with the KNN, RF, XGBoost, SARIMA seasonal method and two persistence models. The statistical significance of the DFFNN model developed has been assessed through the Wilcoxon signed-rank test, showing p-values smaller than 0.05 for all the paired combinations.

Future work will be directed towards developing other types of deep neural networks, applying learning transfer from other fields such as electricity consumption as well as making predictions for medium or long-term horizons.