The traditional statistical and benchmark methods presented in Sect. 9.1 often assume some relatively simple relationship between the dependent and independent variables, be that linear trends, particular seasonalities or autoregressive behaviours. They have performed quite successfully for load forecasting, being quite accurate, even with low amounts of data, and can easily be interpreted by practitioners. However, the methods described in Sect. 9.1 may be less suitable for modelling more complex and highly nonlinear relationships. As data has become more ubiquitous due to increased monitoring, machine learning methods are becoming increasingly common as they can find complicated and subtle patterns in the data.

Recall from Eqs. 5.27 and 5.28 that defined the functional forms of the 1-step and m-step ahead forecasting problem. It describes the relationship of the load for m steps ahead, \({L}_{n+m}\), for forecast origin at time step n, with autoregressive features \(L_1, \ldots , L_n\), explanatory features \(Z_1, \ldots , Z_k\) and function f. As explained in Sect. 4.2, this function f can be learned from training data, i.e., the load forecasting task can be modelled as a supervised learning task, where a machine learning model is trained to learn the possibly complex relationship of the load with some features. As the load forecasting task is typically expressed as a numeric value, it is in most cases a regression problem.

The following sections introduce a few popular machine learning methods that can be used for time series forecasting. Section 10.1 introduces k -nearest neighbour regression (k-NN), a relatively simple model that can, together with multiple linear regression, function as a good benchmark model for datasets which are not too large. Support vector regression (Sect. 10.2) has been a popular model in the early 2000s as it can provide accurate forecasts with nonlinear relationships, but only on data sets which are relatively small. Tree-based ensemble models like random forest regression and gradient-boosted regression trees (Sect. 10.3) are powerful, robust models that often perform very good on structured data and are therefore strong contenders for many practical time series problems, even with complex relationships of independent variables with many variables. They also scale well to many observations.

However, as in many other domains, artificial neural networks have become increasingly popular including for time series tasks and, in particular, load forecasting. While regular feed-forward neural networks are relatively capable, recurrent neural networks and their more sophisticated deep variants like the long-short term memory  (LSTM) and gated recurrent unit (GRU) have also been successful for time series tasks since they are able to model the autoregressive relationships (Sect. 10.5). More recently, convolutional neural networks (CNN) have also provided state-of-the-art results and been used in favour of recurrent architectures as they can be trained more efficiently. This creates interesting architectures, especially for large time series data sets when training on smart meter data for many consumers and distribution level networks.

This book will only briefly give an overview of more recent developments like transformer networks and specifically designed neural network architectures that have shown promising results. However, as those are most relevant in research, we omit the details and refer the interested reader to some of the core literature.

We note that the machine learning models mentioned above can be used for regression and classification tasks. However, most typically, load forecasting is a regression task, and therefore their functionality is explained within the regression context, which may differ from other explanations. For instance, in k-nearest neighbours or random forests, their predictions are averaged for the regression case, but they may use majority voting for the classification case. In artificial neural networks, the difference is the usage of different loss functions (i.e., mean square error in regression vs cross entropy loss in classifications) and the activation function of the final layers (i.e., linear activation in regression vs softmax function as activation in classification).

10.1 k-Nearest Neighbour Regression

k-nearest neighbour regression (k-NN) and multiple linear regression (see Sect. 9.3) are often considered the two most simple supervised learning methods. Linear regression can be considered a high bias model as it places strong assumptions on the linear relationship of the variables and the distributions of residuals. In contrast, k-NN makes no parametric assumptions and is therefore considered a low bias model. However, it is worth noting that the level of bias depends on the data, the choice of parameters (and hyperparameters), and how well a model captures the underlying relationships.

The basic algorithm makes a prediction by finding the k instances in the training data set that are most similar to the instance for which the prediction is made. So consider a training data set \(\mathcal {X}=\left\{ (\textbf{x}_1,\textbf{y}_1), (\textbf{x}_2,\textbf{y}_2), \ldots , (\textbf{x}_j,\textbf{y}_j), (\textbf{x}_N,\textbf{y}_N) \right\} \) and a new instance \(\textbf{x}'\) for which a prediction should be made. Then k-NN looks of the k instances \((\textbf{x}''_1,\textbf{y}''_1), (\textbf{x}''_2,\textbf{y}''_2), (\textbf{x}''_i,\textbf{y}''_i), \ldots , (\textbf{x}''_k,\textbf{y}''_k) \in \mathcal {X}\), where the \(\textbf{x}''_i\) are most similar to \(\textbf{x}'\) according to some distance measure. In a regression problem, the prediction \(\hat{\textbf{x}}\) is the average of \(\textbf{y}''_1, \textbf{y}''_2, \textbf{y}''_i, \dots , \textbf{y}''_k)\) and the majority vote in classification. The fact that the choice of distance measure and the aggregation function can be chosen quite flexibly and depending on the circumstances and the application, makes k-NN an extremely versatile method.

Fig. 10.1
figure 1

A simple illustration of k-Nearest Neighbours. The k closest points defined by the red crosses are averaged to produce the prediction (black cross)

In a naïve implementation, a prediction is made by calculating the distance between the test data and all the training points. Then it selects the k number of training points closest to the test data according to the similarity measure. However, in practice, efficient data structures like ball trees make it unnecessary to compare the test data to all training data. Note, that some of these techniques require certain properties of the distance measure, like some of the metric properties (see discussion below). A simple example of a k-NN for an energy profile is shown in Fig. 10.1. The similarity is how close the points are within the day with the k points used to estimate one period (black cross) bounded within the vertical blue lines.

Fig. 10.2
figure 2

Effect of the value of k on the k-nearest neighbour estimate

The parameter k is the most important hyperparameter to tune and it controls the under- and over-fitting of the model. Choosing k too small may cause overfitting since the prediction is made based on only a few data points. A k is too large, the estimate is based on too many observations and, therefore, may underfit. An illustration of the effect of the hyperparameter is shown in Fig. 10.2 for different values of k. The larger the k, the smoother the fit but also, the higher the bias, and also notice the peaks are less well approximated.

As the algorithm relies on a distance metric, it is important to normalise the data, as otherwise, the results may depend on the scale of the features (e.g. cause different predictions if the temperature is in \(^{\circ }\)C or \(^{\circ }\)F or load in W or kW). Therefore, the choice of the normalising procedure is, especially for k-NN, an important design choice. For a discussion on normalisation techniques, see Sect. 6.1.3).

As discussed above, there are two main steps of k-nearest neighbour regression: determining the most similar instances and combining the corresponding targets. Both steps can be seen as design decisions of the algorithm and varied for specific applications. For the first step, it is often useful to explore the usage of different distance measures. As a default, k-NN uses the Euclidean distance and combines the selected targets using the arithmetic mean. The Euclidean distance is defined as the squared difference of the elements, i.e., the 2-norm introduced in Sect. 7.1. The arithmetic mean is a natural choice for the Euclidean distance since, for a finite sample, it minimizes the sum of squared distances (Sect. 8.2.1). However, in certain applications, the choice of the medoid, a representative from the sample which has a minimal squared distance to all the other points, can be a reasonable choice.

The Euclidean distance is a lockstep or “point-wise” distance measure since it measures the distance between individual elements of the input sequences before aggregating them. In time series, this means that the evaluation is performed by matching values at the same time step. In contrast, the group of so-called elastic distance measures works by first optimally aligning the time series in the temporal domain so that the overall cost in terms of a cost function of this alignment is minimal. This property can be useful when working with load profiles in the low-voltage grid that exhibit high volatility, to avoid the double-penalty effect (see Sect. 13.3 for a special elastic distance measure).

k-NN allows the use of arbitrary distance measures for finding the most similar neighbours. However, if the distance is not a metric, this search may need to resort to brute force and hence may not scale well with larger datasets. In order for a distance measure to be a metric, it must have the following properties:

  • Non-negativity: \(\text {D}(\textbf{X},\textbf{Y}) \ge 0\)

  • Identity of indiscernibles: \(\text {D}(\textbf{X},\textbf{X}) = 0\)

  • Symmetry: \(\text {D}(\textbf{X},\textbf{Y}) = \text {D}(\textbf{Y},\textbf{X})\)

  • Triangle inequality: \(\text {D}(\textbf{X},\textbf{Y}) \le \text {D}(\textbf{X},\textbf{Z}) + \text {D}(\textbf{Z},\textbf{Y})\)

Recall that the p-norms in Sect. 7.1 are all metrics.

Many algorithmic improvements to speed up similarity search rely on metric properties, most importantly the triangle inequality. Computing a sample mean for an arbitrary metric is often intractable. Hence, approaches resort to using approximate solutions or use the medoid instead of the sample mean. For large datasets, subsets of the training set may be used to reduce computational costs.

The most popular elastic distance measure is dynamic time warping  (DTW). It was first introduced for the application of speech recognition and has been shown to perform well on many datasets.Footnote 1 It is considered an elastic measure, as it finds an optimal alignment between two time series by stretching or “warping” them, minimizing the Euclidean norm between the aligned points. Figure 10.4 shows such an optimal alignment of two time series \(\textbf{X}\) and \(\textbf{Y}\). It maps the first peak of the top profile to the peak of the same height in the bottom profile. Then the second smaller peak is aligned with the peak of the same height occurring later. In contrast, Fig. 10.3 shows the “point-wise” Euclidean distance.

DTW can be recursively defined by:

$$\begin{aligned} \text {DTW}(X_{:i}, X_{:j}) = \text {D}(X_i, Y_j) + \min \left\{ \begin{array}{l} \text {DTW}\left( X_{:i-1}, Y_{:j-1}\right) , \\ \text {DTW}\left( X_{:i}, Y_{:j-1}\right) , \\ \text {DTW}\left( X_{:i-1}, Y_{:j}\right) \end{array} \right\} \end{aligned}$$
(10.1)

Then the DTW distance between time series \(\textbf{X}\) and \(\textbf{Y}\) is

$$\begin{aligned} \text {DTW}(\textbf{X}, \textbf{Y})=\text {DTW}(X_{:T}, Y_{:T}). \end{aligned}$$
(10.2)

As mentioned above, \(\text {D}\) can be any distance function but is generally the Euclidean distance. A naive recursive implementation would lead to exponential run time. The most popular deterministic approach is an algorithm based on dynamic programming that leads to quadratic run time, i.e., scales quadratically with the length of the input. The optimal solution of the DTW algorithm can be represented by the warping path, the path along the cost matrix that contains the cost for each individual aligned points \(X_i\) and \(Y_j\) (i.e. the \(i^{th}\) row and \(j^{th}\) column of the matrix is the distance between the elements \(X_i\) and \(Y_j{_:}\)). Figure 10.6 (left) shows the cost matrix of aligning each element of the above vectors \({\textbf {X}}\) and \({\textbf {Y}}\) shown in Figs. 10.3, 10.4 and 10.5. The black line shows the warping path.

Fig. 10.3
figure 3

Euclidean distance, no alignment, \(\text {ED}({\textbf {X}}, {\textbf {Y}})=7.68\)

Fig. 10.4
figure 4

Optimal alignment for DTW distance, \(\text {DTW}({\textbf {X}}, {\textbf {Y}})=3.00\)

Fig. 10.5
figure 5

Optimal alignment for cDTW distance with \(c=3\), \(\text {cDTW}({\textbf {X}}, {\textbf {Y}}; 3)=3.32\)

Fig. 10.6
figure 6

Cost matrix for aligning \({\textbf {X}}\) and \({\textbf {Y}}\) with the optimal warping path (left), and the cost matrix constrained by the Sakoe-Chuba Band with its warping path (right)

As DTW is popular, many adjustments have been proposed. The most common adaption is the introduction of a constraint that limits the values in the cost matrix to be within some radius r, which is often referred to as the Sakoe-Chiba Band. This version is referred to as constrained DTW, or cDTW. Figure 10.6 (right) shows the constrained cost matrix with \(r=3\) with the resulting warping path. Figure 10.5 shows the associated optimal constrained DTW-alignment of the constrained DTW. In this case, the early peaks are aligned instead of the smaller peak being aligned to the later one of the same height. This results in a slightly larger distance (\(\text {cDTW}(\textbf{X}, \textbf{Y})=3.32\) versus \(\text {DTW}(\textbf{X}, \textbf{Y})=3.0.\)) in the example given. Due to its popularity and consistent effectiveness, DTW is often a default choice of benchmark for many problems. In particular, DTW may be one choice for household level forecasts as will be illustrated in more detail in Sect. 13.3.

10.2 Support Vector Regression

Support Vector Regression (SVR) is a popular machine learning method used for time-series prediction. To begin, consider a time series \(L_1, L_2, \ldots \) and n explanatory time series variables \(X_{1,t}, X_{2, t}, \ldots , X_{n, t}\) at each time step t which are related to the load, \(L_t\), via a simple multiple linear model, as introduced in Sect. 9.3, described by

$$\begin{aligned} \hat{L}_{N+1} = \sum _{k=1}^{n} \beta _k X_{k,N+1}+b, \end{aligned}$$
(10.3)

for some constant b. For simplifying the notation write this in the matrix-vector form

$$\begin{aligned} \hat{L}_{N+1} = \boldsymbol{\beta }^T \textbf{X}_{N+1} + b, \end{aligned}$$
(10.4)

where \(\textbf{X}_{t} = (X_{1,t}, X_{2, t}, \ldots , X_{n, t})^T\) and \(\boldsymbol{\beta }= (\beta _1, \ldots , \beta _n)^T\) are the vectors of explanatory variables and regression parameters respectively.

A standard way to find the relevant parameters, \(\boldsymbol{\beta }\), is to minimise the least squares difference, between the model and the observations i.e.

$$\begin{aligned} \hat{\boldsymbol{\beta }} = {\arg \min }_{\boldsymbol{\beta }\in \mathcal {B}} \sum _{t=1}^N (L_{t} - \boldsymbol{\beta }^T \textbf{X}_{t}-b)^2. \end{aligned}$$
(10.5)

This can be extended to a regularised form such as LASSO or ridge regression as shown in Sect. 8.2.4 where an additional terms \(||\boldsymbol{\beta }||_p\) is added to minimise the overall size of the coefficients.

In contrast, support vector regression (SVR) fits the linear models for all observations whose errors are within a certain threshold, \(\epsilon \ge 0\). This can be expressed as

$$\begin{aligned} -\epsilon \le L_{t} - \sum _{k=1}^{n} \beta _k X_{k, t} -b \le \epsilon . \end{aligned}$$
(10.6)

The aim of SVR is to minimise the parameter size

$$\begin{aligned} \frac{1}{2}||\boldsymbol{\beta }||^2_2, \end{aligned}$$
(10.7)

subject to the constraint (10.6). So in SVR the different sizes of the errors don’t matter as long as they are within a certain threshold. The optimisation in Eq. (10.7) maximises the ‘flatness’ or complexity of the model and is comparable to the approach of ridge regularisation for linear least squares models (see Sect. 8.2.4).

Fig. 10.7
figure 7

Example of \(\epsilon \) threshold region for model (black line), threshold bounds (dashed line) and observations (red crosses)

To illustrate support vector regression, consider a simple 1-dimensional example for a particular \(\epsilon \)-precision value as shown in Fig. 10.7. In some cases it may be that there are no points which can be approximated with \(\epsilon \) precision, or an allowance for larger errors may be desired, in this case slack variables, \(\xi \) can be added to make the problem feasible and accept larger errors. In this updated form the aim is to minimise the cost function:

$$\begin{aligned} \frac{1}{2}||\boldsymbol{\beta }||^2_2 + C \sum _{t=1}^N(\xi _t + \xi _{t}^*), \end{aligned}$$
(10.8)

with respect to \(\boldsymbol{\beta }\),

$$\begin{aligned} \text {Subject to} {\left\{ \begin{array}{ll} L_{t} - \sum _{k=1}^{n} \beta _k X_{k,t} - b \le \epsilon + \xi _t, &{} \\ \sum _{k=1}^{n} \beta _k X_{k,t}+b - L_{t} \le \epsilon + + \xi ^*_t &{} \\ \xi _t, \xi _t^* \ge 0. \end{array}\right. } \end{aligned}$$
(10.9)

The constant \(C >0\) is a trade-off between maximising the flatness and minimising the allowable deviation beyond \(\epsilon \). Both C and \(\epsilon \) must be found to implement SVR for a linear model. The optimal parameters can be found via cross-validation by testing a variety of values over the validation set (Sect. 8.1.3).

Often the linear SVR problem is solved more easily in its dual form, in which case the forecast model can be shown to be of the form

$$\begin{aligned} \hat{L}_{N+1}=f(\textbf{X}) = \sum _{t=1}^N \alpha _t <\textbf{X}_{t}, \textbf{X}> + b, \end{aligned}$$
(10.10)

where \(<,>\) represents an inner-product function (for example dot product) and \(\alpha _t\) are coefficients derived from the Lagrange multipliers of the optimisation (see [1] for more details).

An advantage of SVR is that it can also be extended to nonlinear regressions. This is achieved by mapping the input features to a higher dimensional space using a transformation function, \(\Phi \). In the nonlinear case, the following transformed multiple linear equation is considered

$$\begin{aligned} \hat{L}_{N+1} = \boldsymbol{\beta }^T \Phi (\textbf{X}_{N+1}) + b. \end{aligned}$$
(10.11)

To solve this problem in practice only requires knowing the kernel function \(K(\textbf{X}_i, \textbf{X}_j) = <\Phi (\textbf{X}_{i}), \Phi (\textbf{X}_{j})>\) where \(<,>\) represents an inner product as before (see references in Appendix D for more details). As with the linear form the final forecast model can be written in the dual form

$$\begin{aligned} \hat{L}_{N+1}=f(\textbf{X}) = \sum _{t=1}^N \alpha _t K(\textbf{X}_{t}, \textbf{X}) + b. \end{aligned}$$
(10.12)

There are several kernels that can be chosen. Some of the most popular are the Gaussian Radial Basis Function (RBF) given by

$$\begin{aligned} K(\textbf{X}_i, \textbf{X}_j) = \exp \left( -\gamma ||\textbf{X}_i -\textbf{X}_j||^2 \right) , \end{aligned}$$
(10.13)

and the polynomial given by

$$\begin{aligned} K(\textbf{X}_i, \textbf{X}_j) = (1+<\textbf{X}_i, \textbf{X}_j>)^p, \end{aligned}$$
(10.14)

where p is the order of the polynomial. The larger the order the more flexibility in the regression fit. As usual, to choose the best model and parameters is achieved by comparison on the validation set. The power of the kernel method is that a nonlinear problem has essentially been transformed to a linear problem, simply by transforming the original variables.

10.3 Tree-Based Regression Methods

10.3.1 Decision Tree Regression

Used on their own, decision trees are not particularly accurate and have limited usefulness for forecasting. However, when multiple decision trees are taken together they produce some of the most powerful and accurate machine learning models. This includes random forest (see Sect. 10.3.2), bagging methods and gradient boosted decision trees (see Sect. 10.3.3). Regression trees can be used to either classify discrete/categorical data, or to regress on continuous data. The latter will be of most interest for load forecasting and are discussed in this and the next couple of sections.

As in Sect. 10.4, the aim is to learn a function \(f:\textbf{X} \longrightarrow \mathbb {R}\) based on M inputs \((X_{1, t}, X_{2, t}, \ldots , X_{M, t})^T \in \textbf{X}\) at time t which predicts the load \(L_{N+1}\) at time \(t=N+1\),

$$\begin{aligned} \hat{L}_{N+1\,N} = f(\{X_{1, t}, \ldots , X_{M, t}\}, \boldsymbol{\beta }), \end{aligned}$$
(10.15)

where \(\boldsymbol{\beta }\) is the parameters necessary for defining the decision tree. As in previous sections, the inputs \(X_{1, t}, X_{2, t}, \ldots , X_{M, t}\) are quite general and can include, for example, historical loads \(L_{t}\) for \(t\le N\), or other explanatory variables such as temperature forecasts.

A decision tree defines a function by splitting the training observations in the domain, \(\textbf{X}\), into disjoint subsets which are distinct and non-overlapping. The function is simply the average values of the historical observations of the dependent variable (in this case the load \(L_t\)) within each disjoint subset. A disjoint partition of a 2D variable space into four disjoint sets is illustrated in Fig. 10.8 which presents the types of splits that decision trees can produce. The algorithm starts with the full domain, in this case the square area \([-1, 1] \times [0, 2]\), shown in the plot with the black boundary. A split of one of the variables is made which optimises the split of the domain according to some criteria (This will be investigated in detail later, but for example, for regression this could be the split which maximally reduces the mean squared error (RMSE) (see Chap. 7) between the observations and the model). In this illustrative example the best split is to cut when \(x = 0\) (represented by the red line). In the next iteration, the process is repeated and tries to find the next best split on the two sections just produced in the last iteration. This turns out to be the horizontal line \(y=0.5\) shown by the yellow line. In the next iteration a final cut at \(x=0.6\) is then chosen which is given by the purple line. The splitting can be written as a tree with each split of the tree representing another partition in the domain. The corresponding decision tree representing the domain partitions in Fig. 10.8 is shown in Fig. 10.9. The final nodes, labelled with \(C1, \ldots , C4\) are the end, or leaf, nodes and represent the final partitions of the domain.

Fig. 10.8
figure 8

An illustration of how a decision tree may split a 2 variable domain into a disjoint ‘optimal’ partition

Fig. 10.9
figure 9

The decision tree which produces the partition in Fig. 10.8. Each split in the tree represents a split in the domain

Now consider a series of pairs \((L_{t+1}, \textbf{X}_t)\) (\(t=1,\ldots , N\)) of dependent and independent variables with \(\textbf{X}_t=(X_{1, t}, X_{2, t}, \ldots , X_{M, t})\) a set of M attributes at time t, and \(L_{t+1}\) an observation at the next timestep, \(t+1\). Note the assumption here is that the observation are continuous real-valued variables (discrete/categorical dependent variables are not usually considered in load forecasting applications but also can be included, e.g. see the dummy variables in Sect. 6.2.6). Given a partition of the domain of \(\textbf{X}\) into P disjoint sets, denoted by \(C_1, C_2, \ldots , C_P\) each of which contain \(N_1, N_2, \ldots , N_P\) points respectively, define a piecewise function \(f_P()\) (where the P is to indicate the dependence on the partition P) which is constant on each disjoint set and can be written as

$$\begin{aligned} f_P(\textbf{X})=\sum _{p=1}^P \alpha _p \chi _p(\textbf{X}), \end{aligned}$$
(10.16)

where \(\chi _p()\) is a characteristic (or indicator) function defined by

$$\chi _p(\textbf{X}) = {\left\{ \begin{array}{ll} 1, &{} \text {if }\textbf{X}~\text {is in set }C_p \\ 0, &{} \text {otherwise} \end{array}\right. }. $$

and

$$\begin{aligned} \alpha _p = \frac{1}{N_p}\sum _{t=1}^{N} L_t \chi _p(\textbf{X}_t), \end{aligned}$$
(10.17)

is the average of the dependent variables where the corresponding independent variables are within the partition set \(C_p\). Hence for each partition defined by the decision tree a corresponding piecewise function can be defined. This is known as a decision tree regression. The cost function used to define the split in a decision tree regression is often the mean square errors between the estimate, defined by Eq. (10.16), and the observations and is given by

$$\begin{aligned} MSE = \frac{1}{N} \sum _{t=1}^N (L_t-f_P(\textbf{X}_t))^2. \end{aligned}$$
(10.18)

Of course the decision tree can continue splitting into smaller partitions and reduce the MSE until each set only contains a single observation. However, this would likely lead to overfitting (Sect. 8.1.2) of the regression tree and hence poor forecast estimates. Instead the fit can be controlled by calibrating a number of parameters of the decision tree or defining a stopping criteria. There are several different parameters or combinations of parameters which could be chosen in order to optimise the generalisability of the regression tree, some of the most common are

  • Fixing a minimum number of observations \(\min _{p \in {1, \ldots , P}}{N_p}\) in each leaf node.

  • Stopping when the MSE decreases less than some threshold, \(\tau \), when an additional split is added.

  • Fixing a maximum number of branch nodes (i.e. maximum value of partitions P).

  • Maximum depth of the tree (i.e. maximum number of splits).

The value for these parameters are typically chosen using cross validation (see Sect. 8.1.3) where a variety of different models are trained with different parameters on the training set and the best models are chosen based on how they perform on a validation set.

Fig. 10.10
figure 10

A set of noisy observations are generated from the curve used to illustrate the decision tree regression

Fig. 10.11
figure 11

Two regression trees fit to the noisy observations together with the original curve. The regression trees use a different minimum number of observations in the leaf nodes. In this case 10 (grey curve) and 2 (red dashed curve)

To illustrate the process for generating a regression tree, consider a simple 1D case as shown in Fig. 10.10. Observations are generated by sampling 40 points from the curve and adding a small amount of noise. Two different regression trees are generated using different choices for the minimum number of observations, \(\min _{p \in {1, \ldots , P}}{N_p}\), in each leaf node, in this case 10 and 2. These are trained to the noisy observations to produce two functions given by Eq. (10.16). The graphs for the final functions for these two regression trees are shown in Fig. 10.11 together with the original curve. Notice the regression trees have finer resolution and the regressions have better matching when the observations are more densely packed. In particular, it should be noted that the ends of the function (at \(x<2\) and \(x>8\)) are not accurately estimated. Like many machine learning techniques, the estimates may not accurately extrapolate to points outside the domain of the observations. This can make such methods difficult to estimate outside of the training data in forecast applications.

10.3.2 Random Forest Regression

As mentioned in Sect. 10.3.1, decision trees are often not useful as time series forecasting models and typically produce a models with high variance (see Sect. 8.1.2). However, their power comes from being used as building blocks for other, more powerful methods. One of the most common of such methods is random forest regression (RFR) described in this section.

The basic premise of RFR is to generate many regression trees but only applied to a random sample (usually sampled with replacement) of the observations. Further, unlike normal regression trees, each tree only splits on a subset of the variables/features. The final RFR is then an average of the regression functions across all trees generated. By only using a sample of the features in each split, the algorithm prevents the regression from being overtrained on strong predictions and causing correlated trees. Thus each of the regression trees (also called weak learners) focuses on different input features. Random forest is an ensemble technique because it considers an ‘ensemble’ of weak learners to produce a single strong learner.

Fig. 10.12
figure 12

Random forest regression fit to the data from the example in Sect. 10.3.1 generated from 100 regression trees. Also shown in black is the original data from which the observations were generated

Now consider the example from Sect. 10.3.1, with the observations used to train the regression trees given in Fig. 10.10. A random forest regression applied to this data using 100 regression trees is shown in Fig. 10.12. Notice in comparison to the individual regression tree as shown in Fig. 10.11 the RFR fit is much more accurate as well as much more continuous than the regression trees. This is because by randomly sampling the training data and also the variables used in each split, the RFR finds a balance between generalising the function and not overfitting, in other words regression trees often produce a model with a good bias-variance trade-off (See Sect. 8.1.2). This fit would be even smoother if more trees where used on more data.

There are many different parameters in the random forest that can be optimised via cross-validation (see Sect. 8.1.3) with some of the most important being:

  1. 1.

    Number of trees. The more trees the more accurate the model. However, this effects how long it takes to generate the estimates.

  2. 2.

    How many variables/features to select at each node split. For regression a common approach is to select a third of the attributes at each node split. It is best to not use too many variables to avoid overfitting.

  3. 3.

    Minimum number of observations in the terminal/leaf nodes.

The idea for cross validation is to try a large number of regression trees with different selections of the above parameters and choose the mix of parameters which gives the minimum MSE (10.18) on the validation set.

A useful property of random forests is the feature importance tool which can look across all trees to assess the importance of each feature. This can be achieved because not all trees use all variables. Hence a comparison can be made which compares the improvement produced when a feature is included in a model versus when it is not used. For regression, a measure is made of how much the feature reduces the variance. This average across trees gives the final importance of each feature and also helps to interpret the strongest drivers for accurately predicting the outputs.

Random forest is popular because it is easy to implement while maintaining a good bias-variance trade-off. They also have a number of other advantages. They can handle thousands of input variables without overfitting and can be used together with the feature importance to perform feature selection. However, their main disadvantage for use in time series forecasting is that they are not very effective at predictions for out-of-sample data. To illustrate, consider the example in Fig. 10.12. Any estimates outside of the observed domain [1, 10] have the same fixed constant values and are unlikely to be accurate.

10.3.3 Gradient-Boosted Regression Trees

The former section introduced random forest regression, a powerful prediction model that uses an ensemble of simple Decision tree models to produce an accurate forecast. Gradient-boosted Regression Trees (GBRT) are also an ensemble technique, using the similar basic idea of combining weak learners to create an accurate strong learner.

There are several related models, but most are variations of the Gradient Boosting Machine (GBM) introduced in [2], also referred to as Multiple Additive Regression Trees (MART) and the Generalised Boosting Model. Within random forests, simple decision tree regression models are trained in parallel, and their predictions are combined, e.g., through averaging. In contrast, with GBRT, the base learners are trained in sequence, each trained to reduce the remaining errors in the residual series of the prior iterations. In other words, the main idea of gradient-boosting models is to iteratively improve the model by training new learners that explicitly improve on the current predictions according to some loss function. The optimisation process is guided by the loss function’s gradient. In a regression problem, like load forecasting, the loss is typically defined as the mean squared error (Eq. (10.18) in Sect. 10.3.1), while in classification tasks, it is the cross entropy. However, gradient boosting is general enough to minimise arbitrary differentiable loss functions. This makes it applicable also for more complicated tasks like predicting quantiles by minimising the quantile loss (see Sect. 7.2).

More precisely, the main objective of gradient boosting is to find the prediction via a weighted sum of weak prediction models which can be represented as

$$ f(\textbf{X}) = \sum _{i=1}^M {\gamma _i h_i(\textbf{X})}. $$

Finding the optimal weights \(\gamma _i\) and weak learner functions \(h_i\) is a computationally infeasible optimisation problem in general. Hence, gradient-boosting finds a solution iteratively with the aim of improving the model over M stages. After each stage i in stages \(1, 2, \ldots , i, \dots , M\) the aim is to find a model \(f_{i}\) that produces an improvement compared to the model of the previous iteration \(f_{i-1}\) by adding a new estimator \(h_i\), i.e.,

$$ f_{i}(\textbf{X}) = f_{i-1}(\textbf{X}) + \alpha \gamma _i h_{i}(\textbf{X}). $$

Here, \(\alpha \) is a constant step-size or learning rate (Sect. 4.3). The gradient-boosting process is guided by the direction of the steepest descent of the loss function (also Sect. 4.3). So let’s consider as an example the aim to minimise the mean square error  of the ground truth and the last iterations prediction \(f_{i-1}\),

$$ L_{MSE} = \frac{1}{2} \left( \textbf{L} - f_{i-1}(\textbf{X}) \right) ^2 $$

Note, the constant factor \(\frac{1}{2}\) is introduced for convenience (and without losing generality) to express the loss function’s derivative as:

$$ \frac{\partial L_{MSE}}{\partial f_{i-1}(\textbf{X})} = \textbf{L} - f_{i-1}(\textbf{X}) $$

Define \(h_i(\textbf{X})\) as this derivative of the loss function:

$$ h_{i}(\textbf{X}) = \textbf{L} - f_{i-1}(\mathbf {}{X}) $$

In the case of GBRT, regression trees are used to model function \(h_i\). Observe that in the case of the mean square loss, this means the regression tree \(h_i\) is being fit to the residuals of the last iterations forecast. The weight \(\gamma _i\) is determined by solving the optimisation problem of plugging the last iterations forecast \(f_{i-1}\) and the current iterations residual-fitted model \(h_i\) into the mean square error loss function:

$$ \mathop {\mathrm {arg\,min}}\limits _\gamma = \frac{1}{2} \left( \textbf{L} - \left( f_{i-1}(\textbf{X}) + \gamma h_i(\textbf{X}) \right) \right) ^2 $$

The details of solving this optimisation are not part of this book but additional reading is referenced in Appendix D. Generally, the optimisation implemented within gradient-boosting is related to gradient descent (recall Sect. 4.3). The process can be derived similarly for loss functions other than the mean square error, but this discussion is not explored in this book.

GBRT and its variants typically have two types of hyperparameters: ones related to the above-mentioned iterative gradient boosting optimisation process and ones related to the regression trees. One of the most important hyperparameters is the number of regression trees. Similarly to random forests, the depth of each individual tree (sometimes indirectly controlled by a parameter enforcing a lower bound on the number of samples in a leaf) is also relevant. In terms of the gradient boosting optimisation process, the most important hyperparameter is the learning rate \(\alpha \) or, in the case of gradient-boosting, also referred to as shrinkage. It determines how much each newly added tree contributes to the model. So smaller values make the model more robust against influences of specific individual trees, i.e., allow the model to generalise better and avoid overfitting. But a small learning rate requires a larger number of trees to converge to the optimal value and hence is more computationally expensive.

Fig. 10.13
figure 13

Deviance plot of the relationship of the number of trees and the train and generalisation error as a diagnostic tool

As introduced in Sect. 8.2.5, an important diagnostic tool to evaluate these hyperparameters is the so-called deviance plot that shows the training and testing error as a function of the number of trees. Figure 10.13 shows such a plot. The error on the training set decreases rapidly and then gradually slows down but continues to decrease as further trees are added. In contrast, the error on the test set also decreases but after slowing down and reaching a minimum the loss begins to increase again. This increasing gap between training and test error indicates overfitting of the model and the ideal point is determined by the learning rate and the number of trees. Compare this plot to the more general version of the plot, Fig. 8.5, introduced in Sect. 8.2.

Gradient boosting has a high model capacity and is hence prone to overfitting. Therefore other regularisation parameters may need to be tuned using cross-validation. Similar to Random Forests regression regularisation can be implicitly introduced by fitting models only on a subset of the features and instances, i.e., through subsampling. This can be controlled through parameters that limit the number of features and the share of instances used. Different gradient boosting implementations may provide additional explicit mechanisms to prevent overfitting that usually introduce more hyperparameters. Popular choices are, for instance, L1 and L2 regularisation on the weights (see Sect. 8.2.4 on regularisation) and early stopping that stops training if the loss is not improved above a certain threshold after a certain number of iterations (Sect. 8.2.5).

Note that gradient boosting is a general approach that can also be used with other base learners. However, it has become most popular to use decision and regression trees because they are relatively simple and efficient to train, hence not prone to overfit as a base learner, but can still already model non-linear relationships with interactions between the features. Due to the good performance of the approach for tabular data, many different related versions and implementations of the general GBM algorithm [2] have been introduced. See Appendix D for additional reading and the most popular implementations of the gradient-boosting framework. While it may seem discouraging to use gradient boosting methods due to the large number of hyperparameters, they are among some of the most powerful methods for accurate predictions on tabular data and, therefore, also in load forecasting. Unfortunately, the forecast accuracy comes at the cost of limited model transparency. As will be discussed in Sect. 10.6, tree-based methods like random forests and gradient boosting provide scores to assess feature importance. However, this should be seen merely as an indicator, and the methods don’t provide any understanding of the actual effect size of specific variables or their significance, in contrast to methods such as linear regression (cf. Sect. 9.3).

10.4 Artificial Neural Networks

This section introduces artificial neural networks (ANN),   a machine learning technique loosely inspired by biological neural networks, the building blocks of animal and human brains. ANNs consist of a collection of connected artificial neurons, and like the synapses in the brain, each artificial neuron can send a signal to neighbouring, connected neurons via dendrites. As in biological neuronal networks, “learning” is achieved by adjusting the connection between the neurons. However, the detailed mechanisms, such as the learning mechanism itself (i.e., backpropagation) or the representation as real numbers, are quite different from the biological role model (which is not yet fully understood). Nevertheless, ANNs are a powerful machine learning method which are being applied in numerous applications, from forecasting to image recognition, and many advancements are rapidly being developed. This section introduces the standard form, the feed-forward network, and an adaption designed to handle sequential data called recurrent neural networks.

Fig. 10.14
figure 14

A simple artificial neuron or cell

10.4.1 Feed-Forward Neural Networks

The simplest building block of ANNs is the artificial neuron. It is often referred to as a node, a unit or a cell of an ANN. A neural network with one artificial neuron and no hidden layers is called a perceptron. The perceptron can be used as a supervised learning algorithm that can learn nonlinear decision boundaries (classification) or functions (regression). Figure 10.14 illustrates how the perceptron, a single artificial neuron, can be used to forecast the load \(L_{t}\) at time t based on n input variables \(X_{1,t}, \ldots , X_{n,t}\) which also correspond to the same time t. The collective n inputs can be denoted as the vector \(\mathbf {X_t}\).

The artificial neuron must train the function \(h(\mathbf {X_t})\) so that when it operates on the inputs, \(\mathbf {X_t}\), it produces an accurate estimate of the final output. The output of an artificial neuron is called an activation. To compute the activation, the inputs are linearly combined and passed into an activation function g to produce the output signal (the activation):

$$\begin{aligned} \hat{L}_{N+1} = h(\textbf{X}) = g\left( \sum _{k=1}^{n} \beta _k X_{k,N+1}\right) = g\left( \boldsymbol{\beta }^T \textbf{X}_{N+1}\right) \end{aligned}$$
(10.19)
Fig. 10.15
figure 15

Comparison of popular activation functions (black) with their respective derivatives (red)

Note that technically a constant bias term is also added, but this is omitted here to improve readability. This can be achieved by concatenating a variable \(X_{0,t}=1\) to the input vector \(\mathbf {X_t}\).

In a perceptron, if the activation function, g, is ignored, this is just a multiple linear regression (compare the Eq. (10.19) with Eqs. (9.4) and (9.5)). However, the activation function introduces nonlinearity and increases the flexibility of the model compared to a simple linear regression. There are many choices for the activation function. Popular choices are the sigmoid function, hyperbolic tangent (tanh) and, more recently, versions of the ReLU function. Figure 10.15 shows some popular activation functions and their corresponding derivatives.

The sigmoid function is a nonlinear transformation function that maps values to the unit interval [0, 1]. This makes it a popular choice in neural networks since it can be directly used in the output layer of binary classifiers, as its output can be interpreted as a probability of being a member of one of the categories (Sect. 3.1). The sigmoid function is defined as:

$$\begin{aligned} \sigma (z) = \frac{1}{1+ \textrm{e}^{-z}}, \end{aligned}$$
(10.20)

However, it is comparatively computationally expensive and can cause training stability issues when used within a neural network model. This function is now generally only used for binary classification problems. A more generalised form of the sigmoid, the softmax function, is used in multi-class classification.

The hyperbolic tangent (tanh)  is another popular choice as a neural network activation function defined as

$$\begin{aligned} \textrm{tanh}(z) = \frac{(\textrm{e}^z - \textrm{e}^{-z})}{(\textrm{e}^z + \textrm{e}^{-z})}. \end{aligned}$$
(10.21)

It is of a similar S-shape as the sigmoid function but is defined between -1 and 1 and maps negative values to negative outputs and zero inputs to zero. However, it has similar stability issues as the sigmoid function and is hence seldom used in modern network architectures.

The rectified linear unit (ReLU) is very efficient to compute and does not lead to the same stability issues as the sigmoid and tanh functions, namely the vanishing gradient problem (see Sect. 10.4.2). This has made it the default activation function in many deep neural networks. It maps values below zero to zero but is equal to the input itself when it is greater or equal to zero, i.e. a linear activation, and is defined as:

$$\begin{aligned} \textrm{ReLU}(z) = \textrm{max}(0, z). \end{aligned}$$
(10.22)

The function and its derivative are both monotonic.

The fact that ReLU maps values below zero to zero and does not map larger activations to smaller numbers leads to different stability issues, namely the dying ReLU where many activations are zero, or exploding activations when repeated activation leads to increasingly larger values. An adjusted version, the leaky ReLU, attempts to solve the dying ReLU problem and can pose as an alternative when ReLU produces stability issues in model training. It introduces a positive parameter \(\alpha \) as a multiplier for negative values and is defined as:

$$\begin{aligned} \mathrm {Leaky~ReLU}_{\alpha }(z) = \textrm{max}(\alpha \cdot z, z). \end{aligned}$$
(10.23)

In the case of the perceptron, the activation of the neuron represents the final prediction \(\hat{L}_{t}\). However, the true power of neural networks comes from stacking several layers of neurons, where the activation of one layer can then be passed to the next layer in the network, enabling later layers to use the prediction of earlier layers as a feature. This is how neural networks can learn increasingly abstract representations of simpler features.

Feed-forward networks  are the simplest form of multi-layer neural networks. In a feed-forward neural network, each node in a layer is connected to all the nodes in the previous and successive layers (therefore also referred to as a fully-connected neural network). They are also referred to as multi-layer perceptrons or as vanilla neural networks, as in “vanilla” being the plainest and standard kind of ice cream.Footnote 2

Fig. 10.16
figure 16

The directed graph structure of a feed-forward neural network

The input layer has one node per feature in the dataset. The output layer has one node per target variable (in multivariate regression) or class (in classification). The layers in between are referred to as hidden layers with l hidden neurons. Figure 10.16 shows this basic structure with one hidden layer.

In the context of load forecasting, the ANN is used to forecast future load. In the example shown, the output consists of m values, which in the application of this book would normally be an estimate of the demand for m steps ahead, i.e., the load \(L_{N+1}, \dots , L_{N+m}\). To achieve this prediction, it takes several features as input. In the case of load forecasting, this could, for instance, be past values of the time series itself as well as some past (and possible forecasted) explanatory variables, e.g., the outside temperature.

To understand how ANNs work, consider trying to accurately predict the load \(L_{t}\) at time \(t=N+1\) using n inputs \(X_{1, N+1}, X_{2, N+1}, X_{k, N+1}, \ldots , X_{n, N+1}\) by training a model f using the ANN framework.

In other words, the aim is to model the following relationship

$$\begin{aligned} \hat{L}_{N+1\,N} = f(\{X_{1, N+1}, \ldots , X_{n, N+1}\}, \boldsymbol{\beta }), \end{aligned}$$
(10.24)

where \(\boldsymbol{\beta }\) are the weights of the ANN (which will be described below). As in previous cases, the inputs can be historical loads or other explanatory variables. In the context of time series forecasting, the input \(X_{k, N+1}\) typically include the prior values of the target \(L_{{N}},L_{{N-1}},L_{{N-2}},\ldots ,L_{{N-W}}\) to model the autocorrelation with past values of the time series. In the context of ANN the amount of historical values up to W is sometimes referred to as perceptive field to mirror the biological analogue. Additionally, the features typically include also some other external features related to time step \(t=N+1\) (cf. Sect. 6.2 for more on typical features). ANNs that include past values of the load \(\textbf{L}\) and the external values \(\textbf{X}\) are sometimes referred to as a nonlinear autoregressive exogenous model (NARX).

For simplicity, the following discussion considers only one hidden layer with l nodes as in Fig. 10.16, the extension of the algorithm to further layers is analogous. The activations of the hidden layer can be calculated similarly to the activation of the individual perceptron (see Eq. (10.19)) but extended to each neuron i of the layer:

$$\begin{aligned} h_i(\textbf{X}) = g\left( \sum _{k=1}^{n} \beta ^{(1)}_{k,i} X_{k,N+1}\right) \end{aligned}$$
(10.25)

Then we can write the definition of the full neural network as:

$$\begin{aligned} \hat{L}_{N+j} = g_o\left( \sum _{i=1}^{l} \beta ^{(2)}_{i,j} h_i(\textbf{X}) \right) , \end{aligned}$$
(10.26)

where \(g_o\) is the activation function applied to the linear summation of the outputs from the hidden layer.

A neural network with a single hidden layer with a large number of units has the ability to approximate very complex functions. To use a neural network for prediction, one needs to determine the optimal values for the weights in the graph, here the ones connecting the input to the hidden layer \(\beta ^{(1)}_{k,i}\) and from the hidden layer to the outputs \(\beta ^{(2)}_{i,j}\). As with supervised learning more generally (see Sect. 4.2.1), the aim is to find the parameters of the model which minimise a loss function, often denoted as J. In regression tasks, as encountered in load forecasting, this is most typically the mean squared error (MSE), as defined in Eq. 10.18 for decision trees. For ANN the MSE loss function can be written as a function of the current weights of the neural network \(\boldsymbol{\beta }\) with Equation (10.15) and some known ground truth \({\textbf {L}}\):

$$\begin{aligned} J_{MSE}({\textbf {L}}, \boldsymbol{\beta }) = \frac{1}{N} \sum _{t=1}^N (L_t-f(\textbf{X}_t, \boldsymbol{\beta }))^2. \end{aligned}$$
(10.27)

In the process of finding the optimal weights, this ground truth is the training data. This process of training the neural network is done by initially starting with random values. Then batches of instances of the training set are passed into the neural network and all the layers’ activations are calculated. The loss function and its gradient are computed. The weights are updated in the direction of the gradient in order to minimise the loss. Since the updates are calculated layer-wise, propagating backwards from the output layer to the input layer, this process is sometimes called back-propagation.

Whereas in linear regression this optimisation can be done in closed form based on the whole dataset, or through a simple least squares regression (Sect. 8.2.1), the task of finding the optimal weights in neural networks is more complex. Recall from Sect. 4.3 that this loss function is typically non-convex, i.e., it can have multiple local optima and saddle points. The weights are therefore adjusted using an optimiser as described in Sect. 4.3.

From the description above, there are a number of different choices in designing the ANN, i.e., choosing its hyperparameters, including

  • The number of nodes per hidden layer,

  • The number of hidden layers,

  • The choice of activation function,

  • The choice of optimiser and its hyperparameters.

Increasing the number of layers and nodes increases the number of parameters in the system and increases the chances of overtraining the model. This can be avoided by the same techniques as discussed in Sects. 8.1.3 and 8.2. One option is to choose the correct parameters and functions via cross-validation techniques, as discussed in Sect. 8.1.3, in which several models are trained with different combinations of the number of nodes and layers. The trained models can then be compared to each other based on their performance on the validation set. This process could be expensive, especially if training lots of models. An alternative method is to use regularisation as demonstrated in Sect. 8.2.4. These methods involve adding a penalty to the cost function proportional to the weights’ size, which encourages the parameters to stay small (hence reducing the complexity of the ANN). Another method to prevent overtraining is to use early stopping (Sect. 8.2.5), which stops the algorithm early to prevent the ANN from training too close to the noise in the data set. The choice of iteration to stop can also be decided by using cross-validation.

The activation functions depend on the application. For hidden layers, as discussed, common functions are the sigmoid or the tanh function. In deep neural networks the rectifier linear unit (ReLU) is the most popular choice. For the output layer the choice is determined by the type of problem. In binary classification the sigmoid function is used and in multi-class classification the softmax function. Then the loss function is the cross entropy loss. In regression the last layer is linear (i.e., no activation) and the loss is the mean squared function. For the optimiser and their hyperparameters see Sect. 4.3 for popular choices.

10.4.2 Recurrent Neural Networks

  The machine learning models in the last sections have mainly concentrated on fixed-length input data. When including past observations to compute the functional form as in Eq. (10.15), a window length W or receptive field must be specified to determine how many past values to include. This is the case because all the regression models considered are designed to handle tabular data, i.e., datasets of fixed-size input vectors. Further, the algorithms generally do not assume any structure over the columns, i.e., in a structured dataset, the order should not matter.Footnote 3 Given the fixed length, one cannot efficiently model dependencies that require a specific order of the columns, which often is the case for sequential data due to autocorrelation. Further, if one chooses a large W, one needs a lot of data to be able to model dependencies that exist at very different time scales.

However, when dealing with time series and other sequential data, e.g., DNA sequences, video analysis, sound patterns and language, it may make sense to be less restrictive on the length of input values to model both long and short-term dependencies. Instead of specifying the length of the input, i.e., the receptive field that should be considered, the model needs to learn the relevant length.Footnote 4

Recall the architecture of a feed-forward neural network (cf. Fig. 10.16). The network consists of fully-connected layers, and every node is connected to every node of the next layer. The structure of the network can be represented by a directed acyclic graph. Recall that in NARX sequential data is added in the form of the lagged values of the load \(\textbf{L}\) and some external features \(\textbf{X}\). Despite inputs potentially being sequences of arbitrary length, the input \(X_{1}, \ldots , X_{N}\) is required to be sequential of a fixed dimension n. As discussed before, this is because the fully-connected neural network can only compute the function \(f(X_{1}, X_{2}, \ldots , X_{N})\) on these fixed-length inputs. However how could a more flexible \(f(X_{1}, X_{2}, \ldots , X_{N})\) be calculated for variable values of N?

For that the inputs can be calculated recurrently by feeding the vector sequentially and passing in each step, not only the current value of X, \(X_k\) but also the value of the activation of the prior step, \(Z_{t-1}\), i.e.:

$$\begin{aligned} Z_{t} = h(Z_{t-1}, X_t)\text {, for } t=1,2,\ldots ,N \end{aligned}$$
(10.28)

The final prediction is then the output of the final calculation:

$$\begin{aligned} f(X_{1},X_{2},\ldots ,X_{N}) = Z_{N} \end{aligned}$$
(10.29)

Here, \(X_t\) can be a vector \(\{L_t, X_{1,t}, \ldots , X_{n,t}\}\) of the load L and n features, e.g., weather or calendar variables that belong to the time step t considered.

Recall, that h is an artificial neuron that applies an activation function introducing nonlinearities. Then Z is the activation, sometimes referred to as the intermediate hidden state.

Fig. 10.17
figure 17

A graph structure with recurrent connections of a recurrent neural network

Neural networks that consist of such recurrent connections are called recurrent neural networks (RNN)  and they are designed to capture the dynamics of sequences more appropriately than regular feed-forward neural networks. Figure 10.17 shows the structure of such an RNN with the recurrent connection shown as a loop.

Fig. 10.18
figure 18

An unfolded recurrent neural network

This activation over time can be thought of as multiple copies of the same network, each passing an activation to a successor. This makes RNNs “deep” neural networks, even if technically only one layer is modeled.Footnote 5 RNNs can be thought of as feed-forward neural networks where each layer’s parameters (both conventional and recurrent) are shared across time steps. While the standard connections in a feed-forward neural network are applied synchronously to propagate each layer’s activations to the subsequent layer at the same time step, the recurrent connections are propagating the activation also to the nodes in the same layer but over time. This can be visualised as in Fig. 10.18 by providing an unfolded view of the activations over time.

Fig. 10.19
figure 19

A visualisation of the internals of an RNN cell with the tanh activation function

To understand how the RNN neuron, henceforth referred to as a cell, computes the new activation based on the old activation and a new value, see Fig. 10.19. It visualises this activation within the cell by showing that the input \(X_t\) and the activation of the prior time step \(Z_{t-1}\) are concatenated and then passed to the activation activation function g, which in the context of standard RNN is often the tanh function. So the activation can be written as:

$$\begin{aligned} \mathbf {Z_t} = g\left( \boldsymbol{\beta }^T [\textbf{X}_{t},\mathbf {Z_{t-1}}]\right) = tanh\left( \boldsymbol{\beta }^T [\textbf{X}_{t},\mathbf {Z_{t-1}}]\right) \end{aligned}$$
(10.30)

Again \(\boldsymbol{\beta }\) denotes a weight vector, g the activation function and we use \([\bullet ]\) to denote the concatenation of the vectors \(\mathbf {X_t}\), the feature vector, and \(\mathbf {Z_{t-1}}\) the activation of the prior step t.

One can see from Eq. (10.30), that the recursive call of the activation function can lead to vanishing and exploding activations  and more importantly their gradients when computing the loss function and its gradients in model training, i.e., finding the optimal weights of the network. For instance, consider only the first three steps, this leads to

$$\begin{aligned} \mathbf {Z_3} = g\left( \boldsymbol{\beta }^T \left[ \textbf{X}_{3},g\left( \boldsymbol{\beta }^T \left[ \textbf{X}_{2},g\left( \boldsymbol{\beta }^T \left[ \textbf{X}_{1},\mathbf {Z_{0}}\right] \right) \right] \right) \right] \right) \end{aligned}$$
(10.31)

Here, repeated multiplication with the weights can lead to very small or very large values. While this can be alleviated by tricks such as gradient clipping, standard RNN tends to be unstable to train when longer dependencies are modelled. RNNs are typically only successful in modelling short-term dependencies. Hence, they have not proven practical for load forecasting where longer dependencies like weekly or even yearly seasonal patterns are typical. Therefore, in this section hyperparameters of the standard RNN model are not further discussed, and the introduction to RNN serves only as the background for more modern variants such as LSTM and GRU, which have been popular for sequence modelling and have proved successful for load forecasting. In particular, their training is more stable than the standard RNN. These more modern techniques will be explored in more detail in Sect. 10.5.

10.5 Deep Learning

The algorithms so far in this chapter are considered classical machine learning algorithms. This section introduces neural network architectures that are considered deep neural networks or as part of the subfield of machine learning called deep learning. While the notion of “deep” neural networks has been touched upon in the context of RNNs, where “deep” meant deep in time, it has been found that RNNs in their standard form are not able to model long-term dependencies in time. In this way they are more similar to a standard feed-forward neural network. This chapter introduces adaptations to RNNs that make them capable of modelling more long-term dependencies, hence can be considered neural networks with many layers, i.e., deep.

It should be noted that there is no clear definition of when an artificial neural network is considered “deep”. Recall the architectural graph from ANN in Fig. 10.16. The number of weights increases exponentially with the number of new layers. One way to distinguish standard feed-forward networks from deep neural networks is that deep neural networks often have so many layers that fully connected layers are infeasible.

But why add many layers in the first place? This is due to a second way of distinguishing classical machine learning from deep learning. In classical machine learning, the modelling flow is to first hand-design features and then fit a model that maps from the features to the target. As discussed in Chap. 4, in classical statistical modelling the goal is to avoid the curse of dimensionality and only include variables that help to understand the process. However, in machine learning, we care about making the best possible prediction and in deep learning, the objective is to find suitable feature embeddings or representations automatically that can be used by a predictive model. When stacking several layers, lower layers learn more straightforward representations that are passed to subsequent layers that can use these simple features to model more abstract features to be used in the final prediction model. This process is often also referred to as representation learning.

While this novel modelling flow of automated representation learning is now the default in disciplines such as computer vision and language modelling, where manual feature engineering has been predominantly displaced, for time series and load forecasting, often manual features are still a suitable approach, especially in settings where one does not have an abundance of data available.

Fig. 10.20
figure 20

An overview of an LSTM cell unrolled in time

10.5.1 Modern Recurrent Neural Networks

Recall from the last section that RNNs have issues due to numerical instability, leading to them only being capable of modelling short-term dependencies. This section introduces the two most popular extensions of RNNs, namely gated recurrent units (GRUs) and long short-term memory (LSTM). Recall from Sect. 10.4.2, in contrast to layers in feedforward networks, a layer in recurrent neural networks receives the input of the input layer \(\textbf{X}_t\), as well as the activation signal from the last time step of itself \(\textbf{Z}_{t-1}\), which was referred to as a hidden state.

LSTMs introduce a second hidden state, the cell state. Now the current state of a cell depends on the current value \(\textbf{X}_t\), as well as on the previous activation \(\textbf{Z}_{t-1}\) and the previous cell state \(\textbf{C}_{t-1}\). This cell state functions as a memory of the cell where the training determines how long- and short-term values should be memorised. To control how much of the input to forget, LSTMs introduce the forget gate, the input gate and the output gate. Figure 10.20 gives an overview of these parts of the LSTM. Note how the gates are essentially made up of ANN layers, i.e., weights and different activation functions.

Fig. 10.21
figure 21

The LSTM cell with the cell state and different gates highlighted

Figure 10.21 gives an overview of each of these gates. Figure 10.21a shows the forget gate that governs how much to keep from the previous cell state and how much to add from the current input of the previous activation. The last activation and the input are concatenated and passed through the sigmoid function that brings it to between 0 and 1. With the pointwise multiplication, this means the closer values are to 0, the more the cell state “forgets”. The closer the value is to 1, the more is kept. It’s hence computed as:

$$\begin{aligned} \mathbf {F_t} = \sigma \left( \boldsymbol{\beta }_F^T [\textbf{X}_{t},\mathbf {Z_{t-1}}]\right) \end{aligned}$$
(10.32)

Note that there is a weight vector \(\boldsymbol{\beta }_F\) that is specific to this gate.

The next part, as shown in Fig. 10.21b, is called the input gate. It adds or subtracts from the current state. It computes input value \(\textbf{I}_t\) and a candidate value \(\tilde{\textbf{C}}_t\) as:

$$\begin{aligned} \mathbf {I_t} = \sigma \left( \boldsymbol{\beta }_I^T [\textbf{X}_{t},\mathbf {Z_{t-1}}]\right) \\ \tilde{\textbf{C}}_t = tanh \left( \boldsymbol{\beta }_C^T [\textbf{X}_{t},\mathbf {Z_{t-1}}]\right) \end{aligned}$$

Then the updated cell state \(\textbf{C}_t\) is computed by:

$$\begin{aligned} \textbf{C}_t = \mathbf {F_t} \textbf{C}_{t-1} + \mathbf {I_t} {\tilde{\textbf{C}}}_{\textbf{t}} \end{aligned}$$
(10.33)

Again, note the weight vectors \(\boldsymbol{\beta }_I\) and \(\boldsymbol{\beta }_C\) that must be determined in the training process. Note, that by adding the activation to the values prior in time before feeding it into the tanh activation to compute the next activation, this is related to residual skip connections described in Sect. 10.5.3, as the network can learn if there is something useful that should be added from the current input, or if the old cell state should be kept.

The final part is called the output gate that learns what to output from the cell state as well as the former hidden state and the current input as the next hidden state:

$$\begin{aligned} \mathbf {O_t} = \sigma \left( \boldsymbol{\beta }_O^T [\textbf{X}_{t},\mathbf {Z_{t-1}}]\right) \\ \mathbf {Z_t} = \mathbf {O_t} * tanh \left( \textbf{C}_t \right) \end{aligned}$$

With these gates, LSTMs can be trained to be more stable than standard RNNs and have been popular in sequence modelling and also for time series and load forecasting. Over time, several variants have been proposed, like peephole connections that give each of the gates access to the current cell state and coupled forget and input gates that, instead of separately deciding what to forget and keep, make those decisions jointly. A full discussion of what parts are necessary or the most effective is not part of this book. For more information see [3, Chap. 10].

The most popular related architecture is the Gated Recurrent Unit (GRU)  cell. The main idea is similar to that of LSTMs, as it similarly introduces gates to control how much to remember from previous states. However, it is a little bit simpler and has fewer parameters. It does not have a dedicated cell state and introduces the reset and update gates. Figure 10.22 gives an overview of the GRU cell. We omit a detailed walkthrough from this book as it is simar to the LSTM. Generally, it is simpler than the LSTM and is hence faster to train and less prone to overfitting (e.g., when used with time series). See [3, Chap. 10] for a description of GRU.

Fig. 10.22
figure 22

An overview of a GRU cell unrolled in time

So far, only a single hidden layer has been discussed. In practice, multiple LSTM or GRU layers can be stacked on top of each other. However, this increases the number of parameters drastically and may lead to overfitting. Several layers should be explored, when there is a lot of data available, e.g., when fitting a global model trained on the data of several households, buildings or other metered instances. So in terms of hyperparameters, the design decisions are similar to feedforward neural networks, namely the number of layers and the number of hidden units per layer. The activation functions are as introduced in the descriptions before. Further, an optimiser and its hyperparameters need to be chosen.

While more stable than RNN, LSTM and GRU remain difficult to train and may lead to overfitting for time series. For longer relationships, up to, e.g. hundreds of steps back—not uncommon, for instance, with weekly seasonality—both LSTM and GRU can get quite deep for practical applications. One hundred steps back in time can be interpreted as a standard feedforward network with 100 layers. Generally, LSTM and GRU are still slow to train, as they are not easy to parallelise as the states have to be computed sequentially.

10.5.2 Convolutional Neural Networks

The beginning of this section has motivated the idea that stacking many layers can enable learning of increasingly complex representations of the input data. Consider modelling a high-resolution time series, for instance, 1-minute load data, with a long receptive field. This leads to a large number of lagged values that need to be included in the model to capture both the short-term local patterns and long-term trends. With a regular fully-connected neural network, this would require connecting each input neuron with a large number of neurons in the next hidden layer. Each node in the hidden layer is, in turn, connected to each neuron in the next hidden layer (and so forth). In particular, where there is multi-dimensional input, for instance, in other domains like images or even videos, then even a few hidden fully-connected layers would be infeasible as the neural network would have an excessive number of parameters.

One of the main drivers of the recent surge of machine learning has been the success of convolutional neural networks (CNN)  that cope with a large number of parameters by using a different architecture. For instance, to decide if an image contains a rabbit, a CNN can make use of the fact that it does not need to view the full image at once, but can instead view successively smaller parts of the image, as it does not matter where in the image the rabbit is. The architecture makes use of so-called invariances, namely locality and translational invariance. Standard CNNs make use of the successive stacking of convolutional layers and pooling layers, which will be explained below.

Convolutions can identify patterns in data points that are close together. In images, for instance, adjacent pixels are close as they are also close in the physical world that the image represents. Similarly, for many time series, neighbouring data points are also near, as certain behaviours may occur close in time. This locality can be exploited by convolutions. A convolution is a mathematical operation defined through two functions (in the continuous case) or matrices and sequences (in the discrete case). This section will henceforth only consider the discrete case due to the inherent discrete time steps of the load data which is analysed in this book. Note also, that only 1-dimensional convolutions will be considered in this book due to the focus on univariate time series data.

Consider two sequences \(\textbf{X}= (X_0, X_1, \dots , X_{n-1})\) and \(\textbf{K}= (K_{-p}, K_{-p+1}, \dots , K_0, \dots , K_{p})\) of length n and \(2p+1\) respectively. The notation for index of K starting at \(-p\) is because this simplifies the later calculations as will be shown.

The convolution \((\textbf{X}*\textbf{K})_n\) is calculated by reversing the order of one of the vectors (which notice will be easier to do with \(\textbf{K}\) due to the notation used above!) and taking a sliding dot product with the other sequence, \(\textbf{X}\). In other words the convolution creates a new sequence \(\textbf{Z} = (Z_0, Z_1, \dots , Z_{n-1})\) defined as

$$\begin{aligned} Z_n = (\textbf{X}*\textbf{K})_n = \sum _{m=-p}^{p} X_{n-m} \cdot K_{m}. \end{aligned}$$
(10.34)

Notice that in this summation it may require values which are outside the index of the defined sequences. In this case those values are simply set to zero (this is referred to as zero padding). For example

$$\begin{aligned} Z_0 = X_0 K_{0} + X_1 K_{-1}+ \cdots + X_{p-1} K_{-p} \end{aligned}$$
(10.35)

While generally there is no constraint on the length of either sequences, i.e., both can be of the same length, in the context of neural networks, one is typically longer (here the input sequence \(\textbf{X}\)) and one shorter (the so-called filter or kernel,Footnote 6 \(\textbf{K}\)). Figure 10.23 shows schematically how the convolution function can be used to calculate a target sequence \(\textbf{Z}\). In summary, the steps to compute a convolution are as follows:

  1. 1.

    Reverse the kernel sequence \(\textbf{K}\), then

  2. 2.

    shift the kernel along the input sequence \(\textbf{X}\) one point at a time, and

  3. 3.

    at each step, calculate the dot product of the two aligned sequences, i.e. multiplying the aligned values and adding these products.

The resulting sequence \(\textbf{Z}\) is the convolution of the kernel and the input sequence. In the context of convolutional neural networks, the result may be referred to as feature map, or more generally, it is a representation of the input data in the context of representation learning.

Note that formally a convolution is defined over indices from negative infinity to positive infinity. In practice, one adds padding of zeros on both sides as illustrated above in equation (10.35). Then one can choose to keep only the part of the convolution that is non-zero or limit the results to just those values where the kernel completely overlaps with the input sequence.

Fig. 10.23
figure 23

Schematic of the convolution operation

Fig. 10.24
figure 24

The effects of applying different kernels to a load profile at the household-level

To understand the effect of the convolution operation on typical input sequences, consider the example in Fig. 10.24. It shows a load profile at the household level as input \(\textbf{X}\) and the resulting feature map \(\textbf{Z}\). Figure 10.24a considers the effect of the application of a filter \(\textbf{K}=[0.2,0.2,0.2,0.2,0.2]\). Given that the weights add up to 1 and with the Definition (10.34), it becomes clear that this is simply a moving average of the values before and after the current value. It’s essentially smoothing the profile. A related operation is shown in Fig. 10.24b. The kernel \(\textbf{K}=[0.05, 0.24, 0.40, 0.24, 0.05]\) represents a Gaussian distribution, i.e., the centre point is weighted more than the edges. This also leads to an average, but the shape of the original load profile is more strongly preserved. In image processing, this operation is often called a Gaussian blur and is considered a more natural average filter than simply using the equally weighted filter. Finally, Fig. 10.24c shows the result of a kernel designed to highlight variation between neighbouring data points. In the context of images, this would detect edges. In the context of load profiles, it highlights the sudden increases and decreases in load.

These filters are not defined manually in a convolutional neural network. Similarly to weights in a feed-forward neural network, the filters’ values are determined by the optimiser in the training process. The output, i.e., the feature map, is then passed through an activation function so that it can be thought of as analogous to the activation in feed-forward neural networks. It can be regarded as the learned representation passed to subsequent layers. There are as many feature maps as filters after each layer. The filter can be thought of as a feature extractor as the optimisation process will enforce filters that specialise in finding specific recurring features, like the above-mentioned averaged profile or the highlighted edges, that are helpful for downstream layers. Filters in the first layers could learn basic shapes, such as edges or corners, while later layers can detect more complex compositional patterns.

Fig. 10.25
figure 25

Max pooling operation with pool size 2 and stride 2

Fig. 10.26
figure 26

Max pooling operation with pool size 3 and stride 1

The second operation commonly used in convolutional neural networks is pooling. 1D pooling is effectively down-sampling the input sequence using an aggregation function. This aggregation function can be the mean or, more commonly, the max function. Figures 10.25 and 10.26 show this schematically. In Fig. 10.25 the pooling factor or pool size is 2, i.e., two values are averaged. Hence, the final sequence is half the length of the input sequence. In Fig. 10.26, the pooling factor is 3, but instead of shifting the pooling operation by the pool size, it is only shifted by one step, the so-called stride.

Similar to above, consider the example in Fig. 10.27. It shows the same load profile as before as input \(\textbf{X}\) and the result of applying the pooling operation with both a pool size and stride of 4. On the left, it shows mean pooling. This operation is typically done when downsampling a load profile, here from a 15 min to 1 h resolution. Each point is the average of the current and prior 3 values. For a profile at the household level, this smoothes the profile and the distinct peaks that are related to high-power appliances that are only used briefly, like a kettle or hair drier. The right figure shows the max pooling operation. It more strongly preserves the peaks of each of the considered intervals compared to the averaging pooling. Note, that each of the sequences is now one-fourth of the length of the original sequence.

Fig. 10.27
figure 27

The effects of applying different pooling functions to a load profile at the household level (here with pooling size 4 and stride 4)

Fig. 10.28
figure 28

The structure of a convolutional neural network for time series data

A convolutional layer consists of passing the resulting feature maps through an activation function. These building blocks, convolutional layers and max pooling layers are the essential parts of CNNs. Figure 10.28 shows how they can be stacked for sequences as input, in the same manner as in more common 2D, 3D and 4D architectures, as they are used, for instance, in image and object recognition. After stacking several convolutional and max pooling layers, a CNN typically flattens and concatenates the last layer’s activation and feeds it into a fully-connected neural network that makes the final prediction as described in Sect. 10.4.1 using the feature representations extracted by the convolution and pooling layers.

The trainable parameters of CNNs are, therefore, the filters of the convolutional network layers and the weights of the fully connected layer. Note that the pooling layers are having no trainable parameters but are merely downsampling the output of the convolution layers. As the convolutional layers essentially function as a feature extractor for the fully connected network, it is possible that filters trained on one dataset are used for a completely new dataset without retraining but only training the fully connected layers. This is referred to as transfer learning or fine tuning (see also discussion in Sect. 13.4). The flattened output of the convolutional layers can further be concatenated with additional features denoted as \(\textbf{X}\) to condition the forecast on more external covariates where convolutional operations are not useful. For instance, [4] show that for residential load forecasting, it can improve the forecast to condition it in this way on calendar-based variables and the weather forecast.

CNNs have several hyperparameters and architectural choices. First of all, the number of convolutional and pooling layers. For convolutional layers, the number of filters and the filter size, i.e., the length of the kernel, are the important hyperparameters. It is common practice to choose odd filter sizes, as this makes implementation easier. Often 3 or 5 are reasonable choices. ReLU is most commonly used as an activation function in the convolution layers. For the pooling layers, the max function is the most common choice. Here, only the pool size needs to be chosen. In 2D, a pooling size of 2 reduces feature maps in both dimensions, e.g., to a quarter of the input size, which is, therefore, a reasonable choice. Similarly, for 1D time series, a size of 4 can also be a suitable initial choice. Then, finally, the structure of fully connected layers, the number of nodes per layer and their activation functions need to be chosen analogously to fully-connected neural networks. However, ReLU is a reasonable default. Again, as in other neural network models, an optimiser and its hyperparameters need to be chosen (see discussion in Sect. 4.3) as well as other parameters affecting training like the batch size and the maximum number of epochs to train.

CNNs, in general, have several strengths over recurrent neural networks. They can model long-term dependencies better than LSTM, as 100 steps back do not make the model “deeper” and thus they avoid the numerical issues discussed earlier. Further, they are much faster to train, as the calculation of filters is trivially parallelisable since one filter does not depend on the others. Also, as discussed, in CNN, parts of the architecture can be reused for new, but similar tasks. This process of transfer learning never really worked as well for LSTM and GRU. However, the transfer of a pre-trained model to a new task has proven an effective strategy in practice, that can improve generalisation and drastically decrease training time for a new task, i.e., effectively saving energy and, therefore costs and even CO\(_2\) emissions (cf. discussion on sustainable AI [5]).

However, in this standard form, CNNs have several problems in the context of time series. First, with many filters and several layers, a model can still be comparatively large and have many parameters that can overfit with time series, especially for large receptive fields. Hence, some more modern building blocks that made the training of very large CNN architectures possible have also been introduced to time series and will be discussed in the next section. Another problem with time series is the convolution operation itself. As one can see from the definition and Fig. 10.23, the convolution operation includes future values when calculating the dot product of the filter. For time series forecasting, this means that future values can leak from the test period into the training data, whereas this data is supposed to be unknown at the time of the forecast, and can produce an overly optimisatic prediction. For these reasons, it is therefore recommended to use adjusted versions of CNNs for load forecasting, as will be introduced in the next chapter.

10.5.3 Temporal Convolutional Networks

This section discusses adjustments to convolutional neural networks that have proven effective for working with time series, namely causal convolutions, dilated convolutions, residual skip connections and 1\(\,\times \,\)1 convolutions. While those have been features of the WaveNet architecture [6] that has been introduced as a generative model for handling raw audio data, neural network architectures based on these features for more general time series tasks have since been referred to as temporal convolutional networks (TCN)  [7]. In the following, we describe the most important building blocks.

The most important adjustment is made to avoid the possibility of leaking future data into the training data (see Sect. 13.6.4 on data leakage). While one could add zeros as padding instead of future values, there is a better solution: causal convolutions. For that, the convolution operation is simply shifted to only consider prior values when calculating the sliding dot product with the flipped kernel (cf. Sect. 10.5.2). Figure 10.29 shows this schematically (compare it to regular convolutions in Fig. 10.23). To formally define the operation, consider input sequence \(\textbf{X}= (X_0, X_1, \dots , X_{n-1})\) and kernel \(\textbf{K}= (K_{0}, K_{1}, \ldots , K_{k-1})\) of length n and k respectively. Then, we can then define causal convolution as:

$$\begin{aligned} Z_n = (\textbf{X}*_c \textbf{K})_n = \sum _{m=0}^{k-1} X_{n-m} \cdot K_{m}. \end{aligned}$$
(10.36)

A second problem that was discussed before is that a large receptive field, i.e., a large window of past values to include in the model, may lead to unnecessary many parameters as it requires too many filters, and it may be unnecessarily slow, as many convolutions may need to be computed. In traditional modelling, one addresses this by manually deciding to feed only relevant past values, e.g., only of the last day, the same day of a week ago, the same day two weeks ago, or similar. However, in deep learning, we would still want to pass a considerable amount of past values and have the model learn relevant features, i.e., internal representations, automatically. To achieve that, TCNs make use of dilated convolutions that solve this by introducing a dilation factor d that essentially adds steps between calculations of convolution operations. With this, it is still possible to cover a large receptive field by stacking several layers of dilated causal convolutions. Figure 10.30 compares regular causal convolutions with dilated convolutions, highlighting the operations that are necessary to cover the same receptive field with full causal convolutions (left, with filter size \(k=4\)) and causal dilated convolutions (right, filter size \(k=2\) and dilations \(d=[1,2,4]\)). The same receptive field can be covered with a smaller filter size (i.e., fewer parameters) and much fewer operations (i.e., much faster training). The dilated causal convolution operation is then defined for the same input sequence \(\textbf{X}= (X_0, X_1, \dots , X_{n-1})\) and kernel \(\textbf{K}= (K_{0}, K_{1}, \dots , K_{k-1})\) as:

$$\begin{aligned} Z_n = (\textbf{X}*_d \textbf{K})_n = \sum _{m=0}^{k-1} X_{n-d \cdot m} \cdot K_{m}. \end{aligned}$$
(10.37)
Fig. 10.29
figure 29

Causal convolution operation

Fig. 10.30
figure 30

Comparison of standard and dilated causal convolutions

One important idea to improve CNNs that is common in TCNs are residual skip connections that were popularised with ResNet [8]. Prior to these, there were limits to stacking more layers since training becomes unstable and accuracy reduces as more layers are added. However, in theory, adding more layers should not degrade performance, as ideally, if more layers would lead to worse performance, it would be preferable for the layer to simply learn the identity mapping, hence, not change performance. Skip connections allow for layers to essentially be set to the identity mapping, i.e., skipped, by adding the activation of previous layers with activations of successive layers. For dense layers, this can be achieved by concatenating those activations. More commonly, this is achieved by addition. So essentially, skip connections allow the model to choose whether layers add value to the outcome or not, thus improving results in large neural network architectures. Recall that this is similar to how the input gate works together with the cell state of prior time steps in LSTM (see Sect. 10.5.1).

Another idea that is common in modern TCNs is 1\(\,\times \,\)1 convolutions that were popularised with the Inception model [9]. A 1\(\,\times \,\)1 convolution helps reduce the dimensionality along the filters. When applied to only one feature map (or input), it simply scales each pixel by a constant weight. This would by itself not be useful. However, when applied to several filters, it essentially learns a linear projection of the filter maps, thus reducing the number of filters. This reduces the number of parameters, whilst retaining some feature-related information through the learned weights.

Fig. 10.31
figure 31

The structure of a TCN residual block as introduced in [7]

TCNs have similar hyperparameters to CNNs generally (see Sect. 10.5.2). However, when using dilated convolutions, two possible hyperparameters can be used to increase the receptive field: the dilation factor d and the filter size s. The most important decision is the architecture, i.e., how the building blocks are connected. Figure 10.31 shows the TCN architecture as introduced in [7]. One TCN residual block contains two dilated causal convolution blocks, each followed by weight normalisation, a ReLU activation and a dropout layer (see Sect. 8.2 on these regularisation techniques). These two blocks are bypassed by a residual skip connection with a \(1 \times 1\) convolution. If the block is in the first layer, it receives the input sequence, else it receives the activation of the prior block \(i-1\), \(Z_{i-1}\), and passes its activation \(Z_i\) to the next block. Note, that in the literature and in libraries, details in the implementation of what is referred to as TCN may vary. An architecture based on WaveNet has been used for residential load forecasting in [10]. In [11], TCNs are used for load forecasting at the system scale. [12] use only the causal dilated convolutions as part of a CNN architecture for fitting the parameters of different probabilistic models for residential load forecasting. Note, that modern TCNs don’t make use of pooling layers that are popular in image and object recognitions. Instead, dilation and 1\(\,\times \,\)1 convolutions are used to decrease the number of parameters.

Compared to LSTMs and GRUs, TCNs are more efficient to train (see discussion in the last section on CNNs). Compared to standard CNNs, TCNs solve, most importantly, the issue of causality. But several ideas like dilation, residual skip connections and 1\(\,\times \,\)1 convolutions have been shown to improve regular CNNs empirically, as they make training more stable, enable more efficient training and require fewer parameters. This can be beneficial for time series, where deep machine learning models are generally prone to overfit in settings without a lot of training data.

10.5.4 Outlook

As the last section has shown, with more modern deep architectures, it becomes less and less clear what is working in what situations. The field of deep learning is moving very rapidly with the successes in image and language modelling and some of these models are being utilised within the field of time series using more and more specific architectures. Many of the building blocks for such models have been introduced in the previous sections.

One example, N-BEATS [13], is a model based on fully connected residual networks, and has been successful as part of the M5 time series forecasting competition,Footnote 7 placing second. It includes an interpretable version that enforces individual stacks to learn the trend and seasonality independently. Another popular time series model is DeepAR [14], an RNN architecture that has been introduced for probabilistic intermittent demand forecasting. It fits a global model across different products and predicts one step ahead based on previous step values and some covariates (e.g., weather). It uses a Gaussian or Negative Binomial likelihood function, and its parameters are predicted by the neural network.

A whole different approach, not using the aforementioned blocks, is transformer models [15]. They have surpassed all other approaches in text processing in most tasks and have had initial successes working with image tasks. They may be well suited for time series as well, as compared to CNNs they are truly sequential, i.e., no perceptive field needs to be determined. This allows them to handle different length inputs, which is only possible with CNNs and TCNs using zero padding. Like RNNs, transformers are designed to process sequential input data. Some consider them to be a version of recurrent neural networks, however, unlike RNNs, transformers process the entire input data all at once. They use the so-called self-attention mechanism to provide context for any position in the input sequence. By not having to process one step at a time, it allows for much better parallelisation than RNNs and therefore reduces training times considerably. However, as transformers have not made it into the load forecasting literature (yet), they are not covered in more detail in this book.

This leads to a more general question. Given the many possibilities with deep models, it is unclear where to start. Time series forecasting has for a long time been approached by only statistical methods, as discussed in Chap. 9. Only recently, machine learning models have shown to be successful in certain situations (see the M4 and M5 forecasting competitions [16, 17]). As discussed in more detail in Chap. 12, start simple first! Compared to images and text, many time-series forecasting problems, have lower data availability. So when making predictions for only one instance, like one building, household or substation, and only with the data of that instance (see discussion of local and global modelling in Sect. 13.4), statistical and traditional time series models perform well. They also perform well, when the data is of low resolution (e.g., daily, weekly, or yearly time series), and when covariates like seasonalities and other external influences are well understood. Machine learning and deep learning models tend to perform better when fitting models across multiple time series, like one model for multiple buildings or households or hierarchical models (see Sect. 13.4 on this topic). In probabilistic forecasting, they perform well when densities are complex (e.g., multi-modal). Further, they can be useful when fitting models to processes with complex, non-linear external influences. However, one important finding of the M5 competition was that combinations of statistical and machine learning models can reach state-of-the-art results with the advantage of remaining at least partly interpretable, combining the advantages of both and hence are particularly useful for real-world applications.

10.6 Feature Importance and Explainable Machine Learning

This chapter introduced several machine-learning models for load forecasting. As discussed, in machine learning the objective is to learn representations for input data to improve the forecast automatically. This comes, however, at the cost of understanding the relationship between the input data and the prediction. As explainable machine learning is a research domain that has many different approaches, and none have shown to be dominant in the load forecasting practice, it will not be covered in depth in this book.

However, this section will briefly discuss the following methods:

  • Feature importance of tree-based methods (model specific),

  • permutation importance (model agnostic),

  • SHAP Values (model agnostic).

As introduced in Sect. 10.3, tree-based forecasting models have feature importance methods built in. These models attempt to determine the most relevant features for the internal representation of the data, i.e., fitting the model. Different measures can be used to assess the feature importance, most commonly, Gini importance or the mean decrease in impurity (MDI). These can be output with the model to give some indicator for the feature importance and to explain the model. Note, however, that these are biased towards high cardinality features. Also, they are computed on the training set in the model fitting phase, and may, therefore, not reflect the ability of the feature to be useful to make predictions that generalise to the test set and to the application. So they should merely be seen as an indicator. They work because with tree-based models only a sample of the variables are used to construct each tree (Sect. 10.3). That means a sample of trees do not utilise one of the input variables. This means the performance of using a particular variable can be compared to not utilising that variable. In short, the importance of that feature can be assessed.

Another popular method is permutation importance. It was introduced by Breiman in the context of random forests [18] (See Sect. 10.3.2 for more details on random forests). The idea is to randomly permute each feature’s values to analyse how it affects the final prediction outcome. This idea is simple and computationally cheap to implement. It can be applied to any fitted model and is not restricted to tree-based methods. While being popular due to the above reasons, they are generally not advised when the dataset contains strongly correlated features, as it may otherwise be biased in favour of correlated features.

Permutation importance and feature importance can be combined to help identify all relevant variables in a input dataset. The Boruta algorithm [19] adds duplications of some of the input variables for a random forest model but permutes them. Hence these permuted variables should have no relevance to the dependent variables and its feature importance can be compared to the other (non-permuted) variables to identify which ones have an importance lower than the random inputs. Those with a higher importance can be viewed as therefore being more relevant to the supervised learning model performance.

Finally, there are SHAP values (SHAP stands for SHapley Additive exPlanation). The theory is based on cooperative game theory. A detailed discussion of how they work is not part of this book but see [20] for a more detailed discussion on SHAP values. The output is also a feature importance score. They work well for correlated features, as interactions of variables are also analysed. However, they are expensive to calculate as one step includes building combinations of each of the features. This is, therefore, infeasible for a large number of inputs and hence, often only approximations are calculated.

10.7 Questions

As in the last chapter, for the questions which require using real demand data, try using some of the data as listed in Appendix D.4. Again, ideally choose data with at least a year of hourly or half hourly data and split it into training, validation and testing with a 3 : 1 : 1 ratio.

  1. 1.

    Explain the difference between the activation functions and the loss function in a neural network. How do they relate? Explain how each of them is chosen in the process of modelling a feed-forward neural network for a specific task.

  2. 2.

    The Exponential Linear Unit (ELU) is another activation function that is a strong alternative to ReLU. It is defined as:

    $$ \mathrm {ELU(z) =} {\left\{ \begin{array}{ll} z, &{} \text {for } z < 0 \\ \alpha ( \textrm{e}^{z} - 1), &{} \text {for }z \le 0 \end{array}\right. } $$

    Plot the function and its derivative. Discuss possible strengths and weaknesses compared to other activation functions discussed in this chapter.

  3. 3.

    Take an example load profile for a day that has some variation over the day, i.e., some distinct peaks. Use a convolution implementation of a library such as scipy or numpy in Python to compute some more manually selected kernels as was done in Fig. 10.24 and observe their resulting influence on the feature map. Try a kernel \(\textbf{K} = [-1.0, 2.0, -1.0]\). Before implementing, try to predict what the result will look like.

  4. 4.

    Using a neural network library such as TensorflowFootnote 8 or PyTorchFootnote 9 implement a simple NARX, i.e., a fully-connected neural network that accepts the last two weeks of a half-hourly load profile as input and outputs a prediction for the next day. How many neurons does the input layer have? How many the output layer? Add one fully-connected hidden layer with ReLU activation. Use the appropriate activation function for the output layer. Use the library to output the number of weights (or trainable parameters) each network architecture has and visualise it as a function of the number of layers. How does the number of parameters scale with the number of layers? Train your network with different hidden layers (e.g. one and five) and visualise the train and validation error over the number of epochs trained. Observe the different training times needed. Do you observe overfitting for the deeper network compared to the more shallow one? Add dropout of 10%, 20% and 50% to the hidden layers and observe if that changes the progress of the training and validation loss. Try another regularisation method from the ones that have been discussed in Sect. 8.2.