This chapter will give brief introduction to a range of more advanced topics which are beyond the primary aims of this book. The further reading section in Appendix D will give some references to other texts which will investigate some of them in more depth.

13.1 Combining Forecasts

In Sect. 10.3.2 it was shown that by combining several weaker models a stronger more accurate model could be produced. This principle can be extended to any selection of forecast models, even ones considered to be quite inaccurate. For example, an average could be taken of the outputs from a linear regression model, an ARIMAX model, and a random forest model to produce a new forecast. Each individual model will have its own strengths and weakness’s. By combining them, the idea is to mollify the weakness’s of the different models to produce an overall more accurate model. Although this may seem surprising (it could be reasoned that errors would accumulate), combining models has been shown time and time again to be an easy, but effective, way to produce a forecast which is more accurate then any individual model.

Ideally the aim should be to combine models, which are as different to each other as possible to capture different features of the forecast (i.e. models with autoregressive components vs. those with few autoregressive components). Differences can be achieved by having models with different assumptions (e.g., combining a machine learning and a statistical model), models that have different features (e.g., combining a model that uses weather information and one that does not) or different data (e.g., models that are trained on different parts of the available data).

Even when the amount of diversity is limited the combined forecast is often an improvement. Further, it may even be beneficial to retain the least accurate models in any combination as they may model different features of the system that other, more accurate, models, may not. For example, one model could better capture weather features, another time of day effects, and another may better estimate the peaks in the demand. The time series forecasting competitions, the M competitions,Footnote 1 have consistently shown that even compared to high performing singular methods, some of the most accurate models are those which combine traditional machine learning algorithms and statistical models.

The optimal way to combine forecasts is still an active research area but below some simple methods for combining multiple forecasts are introduced for both point and probabilistic forecasts.

There are a number of different ways to combine different point forecast models \(f_1, f_2, \ldots , f_n\) but one of the simplest and most popular is to take a linear weighted average

$$\begin{aligned} \hat{f}(x) = \sum _{i=1}^n w_i f_i(x). \end{aligned}$$
(13.1)

with \(\sum _{i=1}^n w_i=1\). Unless there is good reason, often a good initial combination forecast is to use equal weights for the forecasts, i.e. \(w_i = \frac{1}{n}\) for \(i=1, \ldots , n\). However, if there is sufficient data for testing the different combinations, and/or there is good evidence that a particular forecast may be more accurate on average than the others, then it may be worthwhile to train or create weights which are tailored towards the most accurate forecast. The optimial weights can be found by testing a range of values over a validation set, however this becomes more complex for larger sets of methods and requires enough data to properly train and validate the weightings. Alternatively, the accuracy of the model (e.g. assessed via the RMSE) could be used to give relative weights to the different forecasts with higher weights for more accurate forecasts. Some other methods are listed in the further reading section in Appendix D.2.

As with point forecasts, individual probabilistic forecasts can be combined together. The evidence is increasing that these combined forecasts are better than the individual methods. This topic is very much an active research area (e.g. see [1]) and therefore much of it is beyond the scope of this book but there are some simple, easy-to-apply techniques that have shown to be effective in certain situations.

Analogous to the weighted averaging presented for point forecasts above, a similar method can be applied to quantile forecasts by averaging the same quantiles from each method. For example, consider forecast models \(f_1^{\tau }, f_2^{\tau }, \ldots , f_n^{\tau }\) for the same quantile \(\tau \), then create a combined quantile forecast given by

$$\begin{aligned} \hat{f}^{\tau } = \sum _{i=1}^n w_i f_i^{\tau }, \end{aligned}$$
(13.2)

The same weights should be used for each quantile and are found by minimising a probabilistic scoring function such as CRPS via cross validation (See Chap. 7). As with point forecast model combination the weights should sum to one.

Further literature on combining forecasts is given in Appendix D.

13.2 Hierarchical Forecasting

Electricity networks are naturally hierarchical since traditionally electricity is generated and then transmitted to lower levels of the network. The network is becoming more complicated especially due to the increased installation of distributed generation sources like wind turbine and solar photovoltaic farms.

Matching supply and demand must be achieved throughout the network rather than simply at the aggregate level. Hierarchical forecasting refers to generating forecasts at different levels of the hierarchy. This is demonstrated in Fig. 13.1. In load forecasting this could include forecasting the demand for an individual household, the aggregated demand at the secondary or primary substation, and also at the national level. Of course you can also include all the intermediate levels inbetween.

A useful property of hierarchical forecasts is to ensure that they are coherent. In other words aggregations of forecasts at lower levels equal the forecasts of the aggregated level. This is useful to anticipate or co-ordinate multiple applications (e.g. flexibility services such as batteries) at different levels of the hierarchy.

Fig. 13.1
figure 1

Electricity network hierarchy showing multiple levels from Transmission level down to individual households and buildings. Dotted lines show potential linking between load points at the same levels, i.e. between bulk supply points

Fig. 13.2
figure 2

Illustration of the effect of demand switching on the total demand on a distribution substation

Note, that the aggregation of the demand monitored at the lower level is unlikely to match the demand at the higher level because not all loads are monitored (for example street furniture). Further, there is often switching on the network where demand is rerouted to other networks. Switching on a substation is illustrated in Fig. 13.2. Demand from a nearby network can be rerouted to another substation when there is, for example, a temporary fault. Hence the demand on a network may have a shift to a different regime of behaviour (see Sect. 13.6). Finally, there are electrical losses on a network since some energy is lost in the distribution process. This is another reason the aggregation of demand at a substation is unlikely to match the aggregation of the downstream connected loads. By extension, the forecasts of the demand at the substation are unlikely to match the aggregation of the forecasts of the individual loads (even in the unlikely situation of a perfect forecast!).

To simplify the following discussion lets assume that there is no switching, there is minimal loses and we happen to have access to all the major downstream loads connected to the substation. To make this more precise, let \(D_t\) be the demand at time t for a substation and let \(L^{(k)}_t\) be the downstream load from connection \(k \in \{1, \ldots , K \}\) at the same time. In this scenario the following holds

$$\begin{aligned} D_t = \sum _{k=1}^K L^{(k)}_t. \end{aligned}$$
(13.3)

Now assume that a forecast is produced for each time series, which we denote by using a hat, e.g. \(\hat{D}_t\) is the forecast estimate at time t for the substation demand, and similarly \(\hat{L}^{(k)}_t\) is the forecasts for the downstream demands. However, in general, due to forecast errors and biases it is unlikely that the forecasts match, i.e. the following

$$\begin{aligned} \hat{D}_t \ne \sum _{k=1}^K \hat{L}^{(k)}_t. \end{aligned}$$
(13.4)

A simple way to produce coherent forecasts would be to only produce the forecasts for the downstream levels \(\hat{L}^{(k)}_t\) and estimate the other levels through aggregation. The issue with this approach is that the aggregate forecast is likely to be less accurate than a direct forecast of \(D_t\). This is because the lower aggregate series are more volatile and therefore less easy to accurately forecast. Further if errors are correlated in the downstream forecasts, they may accumulate when aggregated. An alternative is to forecast the substation and then split the demand at the lower levels. However, given the complex behaviour across the downstream demands it is likely these forecasts will be inaccurate since it is not obvious how to disaggregate the demand, especially if it changes depending on time of day, time of year and on special days (Christmas, New Year etc.).

In fact let \(\hat{\textbf{L}}_t =(\hat{D_t}, \hat{L}^{(1)}_t, \ldots , \hat{L}^{(K)}_t)^T\) be the vector of forecasts for each time series in the hierarchy. A coherent forecast, \(\tilde{\textbf{L}}_t\) can be written generally as

$$\begin{aligned} \tilde{\textbf{L}}_t = \textbf{S} \textbf{G} \hat{\textbf{L}}_t \end{aligned}$$
(13.5)

where \(\textbf{S} \in \mathbb {R}^{5 \times 4}\) is sum matrix which sums lower levels up to the higher levels of the hierarchy and \(\textbf{G} \in \mathbb {R}^{4\times 5}\) for \(M > 0\) is a matrix that maps the forecasts to the bottom level of the hierarchy, and depends on the method deployed.

In the special case of the simple coherent approach which simply sums the lowest level forecasts this gives the matrices

$$\textbf{S} = \begin{bmatrix} 1 &{} 1 &{} 1 &{} 1 \\ 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} 1 &{} 0\\ 0 &{} 0 &{} 0 &{} 1 \end{bmatrix} $$

and

$$\textbf{G} = \begin{bmatrix} 0 &{} 1 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} 0 &{} 1 &{} 0\\ 0 &{} 0 &{} 0 &{} 0 &{} 1 \end{bmatrix} $$

\(\textbf{G}\) drives the different approaches, but a manual choice is likely to be suboptimal, i.e. not produce the most accurate set of coherent forecasts. There are many ways to define optimal but one such method was given by the authors in [2] which minimises the total variance of the forecasts. The details are beyond the scope of this book (see [2] or [3] for complete details) but the final choice is given by

$$\begin{aligned} \textbf{G} = (\textbf{S}^T\textbf{W}^{-1}\textbf{S})^{-1}\textbf{S}^T\textbf{W}^{-1}, \end{aligned}$$
(13.6)

where \(\textbf{W}=\mathbb {V}ar(\hat{\textbf{L}}_t-\textbf{L}_t)\) is the covariance of the baseline forecast errors.

This example has only shown one level in a hierarchy but of course this can be extended to have multiple levels. For example in distribution networks this could consist of residential smart meters at the lowest level, which aggregate up to the secondary substations, several of which aggregate to primary substations (see Fig. 13.1).

Coherency can also be applied to probabilistic forecasts but this is much more complicated and beyond the scope of this book. This area has a lot of interest, and more new and novel results can be expected. Further reading on this topic is cited in the Appendix D.2.

13.3 Household Level Forecasts

Forecasts at the household level (or aggregations of only a few households) have specific features and challenges. In this section the focus is on the specific problem of trying to measure the errors for point forecasts of household data.

Fig. 13.3
figure 3

Overlay of three consecutive Monday’s demand profiles from the same household. Constructed using data from the CER Smart Metering Project—Electricity Customer Behaviour Trial, 2009–2010 [4]

The unique problems occur because data at this level is particularly spikey and hence very sensitive to small changes of the occupants within the household. To illustrate this consider a household with a single occupant who, every weekday, wakes up, leaves, and returns homes at roughly the same time each day. When the occupant comes home they turn on various appliances, lights, heating, TV, cooker etc. Although there are often similarities from week-to-week, there is still significant irregularity. Figure 13.3 shows daily demand for three consecutive Monday’s for a real household overlayed over each other. This example shows volatility, especially in the evenings (in fact the demand is actually much more regular than many other households, and highlights how irregular such demand can be).

Due to natural variation in behaviour, and other unexpected events, demand profiles are likely to change even for the most regular consumers. For example, unexpected traffic in the morning, or a missed alarm will result in the occupant arriving late to work. In turn, the late start may mean the occupant now decides to work late and thus arrive home later than usual. An illustration of such a profile is shown in Fig. 13.4, for this the original “typical” profile is shifted slightly.

This unique feature does not typically occur at aggregations of over 10 households. The individual demands and their irregularities smooth out and the data is no longer spikey (This is shown in Fig. 1.2 in Sect. 1.2). This relationship is emphasised in the case study in Sect. 14.2, which shows the power law relationship between aggregation size (size of feeder equating to more households) and relative error (Fig. 14.8). The scaling law shows it becomes increasingly difficult to forecast accurately at low aggregations relative to higher aggregations.

Fig. 13.4
figure 4

Daily demand for one profile, together with the shifted profile. Constructed using data from the CER Smart Metering Project—Electricity Customer Behaviour Trial, 2009–2010 [4]

Fig. 13.5
figure 5

Demonstration of the double-penalty effect. The actual demand (bold line) is compared to a shifted version (dashed line), and a flat estimate (grey line) which is just the uniform profile formed from the average of the actual

The “spikeyness” of household level forecasts also produces a specific problem in terms of measuring the errors of household level forecasts. Consider comparing the two forecasts as illustrated in Fig. 13.5. One forecast is a simple flat forecast (made say from taking the average half hourly demand from the actual daily demand), the second forecast is a simple shift on the actuals (this could be viewed as a seasonal persistence forecast (Sect. 9.1) where the profile is relatively regular but with small shifts in demand). The second forecast is subjectively quite good, and potentially useful. It correctly estimates the peak magnitude and is only slightly misaligned with the peak timing. For an application like storage control for peak reduction (Sect. 15.1), this second forecast can help inform the appropriate charging schedule for a storage device since the peak is correctly anticipated albeit at a adjusted time. This means a battery can be prepared with a full charge and ready to discharge when the peak does finally appear (this presumes there is some sort of monitoring of the demand which identifies the appearance of the peak of course).

However, in the situation described in Fig. 13.5, despite being a useless forecast providing no information about the peaks, the RMSE error for the flat estimate (0.71) is smaller than that for the peaky forecast (0.91). The reason for this is the so-called double-penalty effect. For any pointwise error metric like RMSE the peaky forecast will be penalised twice: once for estimating a peak that didn’t occur, and a second time for missing the peak that did occur. In contrast, the flat forecast is only penalised once.

There are two main options to deal with this situation:

  1. 1.

    Develop new error measures that reduce the double penalty effect and produce a more representative and smaller score for forecasts that more closely describe the peak magnitude and timing differences.

  2. 2.

    Consider probabilistic forecasts. Since the estimates now estimate the spread of data, variable timing of peaks will be more accurately captured. In particular, multivariate probabilistic forecasts (Sect. 11.6) will capture the uncertainty and the interdependencies in the data, and if properly calibrated will include ensembles which represent the wide range of possible household profiles.

Probabilistic methods are the most desirable since they have proper scoring functions (Sect. 7.2) with consistent interpretation. This means they can be assessed in an objective way. The drawback is the computational costs, and the large amount of data required to train a probabilistic forecast. Techniques for probabilistic forecasts have already been introduced in Chap. 11 so the rest of this section discusses alternative error measures for point forecasts.

One set of options are so-called time series matching algorithms. These are popular techniques used in areas such as speech recognition to show how close individual signals are. One such technique, dynamic time warping, has already been introduced in Sect. 10.1. This creates two new time series by stretching them to find the closest match. Then the original point wise forecast error measures can be used to measure the difference.

There is a number of drawbacks with this technique. Firstly in the standard DTW approach there is no penalty or limit on how much a time series can be stretched to enable a match of the features. Secondly, the ordering is fixed. If there are multiple peaks at similar periods but in a different order then dynamic time warping will match one peak but not the other. Since energy demand will likely allow different orders of appliances being used (TV on then the Oven, or Oven on then the TV) this may be too restrictive for household demand forecasts.

Fig. 13.6
figure 6

Illustration of the matching between two time series performed by the adjustment error approach

An alternative method, as developed by the authors is the so-called adjusted error measure [5]. This allows a restricted permutation of the time series around each point. This can be solved relatively efficiently for small numbers of time series using an assignment algorithm called the Hungarian method. A basic illustration of the matching is shown in Fig. 13.6. The adjusted error measure matches a time series like dynamic time warping but allows reordering of peaks (within a limited area). The drawback to the adjusted error measure is that there is no penalty on permuting the time series within the limited area and the size of this permutation length must be chosen before hand.

In summary, no matter measure, point forecasts for households will require some subjective choices in order to deal with the spikey nature of the demand.

13.4 Global Verses Local Modeling

Traditionally in time series forecasting, a model’s parameters are fitted based on historical data from the same instance one wants to forecast. As discussed in Chap. 2 in the context of load forecasting this instance can, for example, be a building, a household, a substation or another point of interest in the grid. So, if one trains a model for a specific building, the forecast model parameters are estimated using this building’s available historical data. This was the same initial approach that was taken with the advent of machine learning models.

However, as discussed in Chap. 10, machine learning models, especially deep models such as LSTMs and CNNs tend to overfit when there is insufficient data. To mitigate against this a new possible strategy was developed for when implementing more complex machine learning models referred to as global modelling. In this approach the model is trained on data of multiple diverse instances, e.g. different buildings, simultaneously. Note that the instance that the forecast is made for, may or may not be part of the training data. The global modelling approach makes use of the fact that deep machine learning models can learn general enough feature representations, which often generalise also to other instances.

The main benefit of global modelling is that having more data is often advantageous to deep learning models and avoids overfitting. Further, learning a model for each instance can be highly impractical. For example, the aim in many countries is to ensure most homes have smart meters and, if the intention is to provide smart services, such as storage control or home management systems, it may be impractical to train an individual model on each home. Instead global models may be more feasible. The traditional approach of fitting the model on the same instance is now referred to as local modelling. Figures 13.7 and 13.8 illustrate these two approaches to load forecasting.

Fig. 13.7
figure 7

Traditional local modelling process

Fig. 13.8
figure 8

A global modelling process for deep learning models

However, if there are too many different instances in the data, there are typically diminishing returns, i.e., adding more data to the dataset does not lead to improvements. In fact, performance can even degrade as more data is added. This could especially be the case if many load profiles are added that are too diverse and if the amount of data is too large relative to the capacity of the model used. To mitigate this problem, alternative hybrid strategies have been proposed. A typical approach is to initially cluster the data to find groups of similar instances and then to train a global model on each cluster. At inference time for a particular instance, one needs to determine what cluster a specific new instance would belong, e.g. in k-means this would be through comparing the profile to each clusters representative profile, for finite mixture models (Sect. 11.3.2) this would be via calculating its membership posterior probability and then assigning to the cluster with the highest value. The prediction is then made by applying the trained model from this cluster to the particular instance. Figure 13.9 illustrates this process. The process is sometimes referred to as pooling.

Fig. 13.9
figure 9

A “pooling” approach to global modelling

Fig. 13.10
figure 10

Process of fine-tuning a pre-trained model

As briefly discussed with CNNs in Sect. 10.5.2, the lower layers of an ANN or the filters in a CNN have usually learned feature representations that are often general enough to also work as a feature extractor for other similar tasks. In other words, one can re-use these pre-trained parts of the neural network and use it for another related task. This general idea of reusing parts of an already trained ANN is called transfer learning. Figure 13.10 illustrates this approach to forecasting. This is the common procedure when working with state-of-the-art image or language models, where such models are pre-trained on very large image or text datasets to learn first some general features of images or text before being trained on the dataset for a specific task.

Local modelling is often not an effective strategy for deep machine learning models unless there is a lot of data available for the instance, or unless the resolution is sufficiently high. Hence, any of the global strategies or variations thereof should be explored. This has the benefit of improving generalisation and avoiding overfitting, but is also, in practice, a much more computationally effective strategy, as typically training a global neural network on multiple instances is more computationally effective than training multiple local models individually. This saves cost for computing, energy and hence unnecessary CO\(_2\) emissions from the compute (see, for instance, [6] on the carbon emissions of machine learning training). Even if the model is transferred to a local computer (e.g. a building energy management system), there are no real data privacy concerns in the context of models trained on smart meters. When the model was trained on multiple instances, no method is known to re-create the instances used to train the model.

13.5 Forecast Evaluation: Statistical Significance

It is often the case that several methods are performing very similarly but in fact there is no statistical significance between them. This can be a desirable situation as it means you may be able to choose a forecast model with other useful properties (say it has high interpretability, or low computational cost) without sacrificing accuracy. It is also important to rule out that a forecast is performing well by chance alone.

To tell if two time series forecasts are significantly different requires a statistical test. One of the most popular methods is the Diebold-Mariano Test. As with many statistical test, it begins with a null hypothesis, denoted \(H_0\), which in the case of load forecasting states that “the two time series forecasts are equally accurate on average”. The aim of the test is to see if the null hypothesis can be rejected, i.e. trying to show that the forecasts are in fact not equally accurate.

Suppose one forecast produces one-step ahead forecast errors \(e_1, e_2, \ldots , e_N\), and the second forecast produces errors denoted by \(\epsilon _1, \epsilon _2, \ldots , \epsilon _N\). Consider the loss differential series given by

$$\begin{aligned} d_t = g(e_t) - g(\epsilon _t), \end{aligned}$$
(13.7)

where g is a loss function, usually defined as \(g(x) = x^2\), or \(g(x)=|x|\). Note that two forecasts have the same accuracy if the expected value of the differential loss is zero. Hence the null hypothesis can be reframed as \(\mathbb {E}(d_t)= 0 \) \(\forall t\). The main assumption for the Diebold-Mariano test is that the loss series is stationary.

To perform the test requires producing a test statistic, i.e. a value derived from the observations. This statistic should follow a particular distribution under the null hypothesis. If the actual calculated value is found to be very unlikely it is evidence for rejecting the original hypothesis. “Unlikeliness” is determined by a significance level \(\alpha \) and a p-value. The p-value is the probability of obtaining a value as extreme as the observed value, assuming the null hypothesis holds. The significance level is a value determined by the user before the experiment/test-statistic is derived and sets the threshold for rejecting the null hypothesis.

Fig. 13.11
figure 11

Illustration of hypothesis testing with assumed standard Gaussian distribution. Shaded are the areas which represent the \(5\%\) extreme values which determine a rejection of the null hypothesis

An example for a hypothesis test shown in Fig. 13.11 for a standard normal distribution, i.e. \(Pr(x<X|H_0) \sim N(0,1)\) which represents the distribution of points assuming the null hypothesis is true. Suppose in this example, it is undesirable for the test statistic, T, to lie in the extremes of the distribution. If it does occur in the extremes, the null hypothesis can be rejected. This is a two-tail test.Footnote 2 Note then that this means the p-value is given by \(p= 2\min \{Pr(x>T|H_0), Pr(x<T|H_o)\}\). Suppose the significance level is chosen as \(\alpha = 5\%\), which is represented as the shaded part in Fig. 13.11 with a probability of 0.025 in each tail (i.e. total of \(\alpha = 0.05\)). In other words, if the statistic lies in the shaded area (i.e. the quantiles \(z_{\alpha /2}, z_{1-\alpha /2}\) determined by \(\alpha /2\) and \(1 - \alpha /2\)) the null hypothesis is rejected, i.e. if \(|T| > z_{\alpha /2}\). If it is not in the tails, then the null hypothesis is not rejected. Note that since we are considering a standard normal distribution \(z_{1-\alpha /2} = 1.96\) and \(z_{\alpha /2} = -1.96\) as it corresponds to the 97.5th percentile value.

Returning to our example of comparing time series forecasts. The Diebold-Mariano statistic [7] is defined as

$$\begin{aligned} DM = \frac{\bar{d}}{\sqrt{\left( \sum _{n=-(h-1)}^{h-1}\gamma _n)/N\right) }}, \end{aligned}$$
(13.8)

where \(\bar{d}= \frac{1}{N}\sum _{n=1}^N d_n\) is the sample mean of the loss differentials and \(\gamma _n\) is the autocorrelation of the loss differentials at lag n (Sect. 3.5). The h used in the denominator should be large enough to include any significant lags in the autocovariance and can be checked through the autocorrelation plots (Sect. 3.5). Assuming the loss differential time series is stationary then the DM statistic follows a standard normal distribution (Sect. 3.1).Footnote 3 A two-tailed test can thus be performed using the same distribution as shown in Fig. 13.11. Rejecting the null hypothesis in this case means that the two forecasts are not of similar accuracy.

For small samples the DM statistic can result in the null hypothesis being rejected too often. To account for this there is adjustments which can be applied such as the Harvey, Leybourne and Newbold test [8]. This adjusted statistic is as follows

$$\begin{aligned} HLN = DM\left( \frac{(N+1-2h+(h(h-1)/N))}{N}\right) ^{1/2}. \end{aligned}$$
(13.9)

Instead of the standard normal, this corrected statistic must be compared to a Student-t distribution with N-1 degrees of freedom. A Student-t distribution is similar to a Gaussian/normal distribution (Sect. 3.1) but with heavier tails, i.e. the extremes of the distribution converge to zero slower than a Gaussian distribution. Thus for this type of distribution very large/small values are more likely to occur.

It should be noted that the DM tests can be applied to more general error functions, including multivariate forecasts, [9]. Further reading in this area in provided in Appendix D.2.

13.6 Other Pitfalls and Challenges

There are other common difficulties that can be encountered when developing short term forecasts. A few of them are outlined here.

13.6.1 Collinearity and Confounding Variables

Correlation is an extremely useful property when producing forecast models, but it can also create complications. As it is commonly known “correlation doesn’t equal causation”. However, this does not mean the variable can not be used to produce accurate forecasts. It does mean care should be taken when assuming and interpreting the relationship between these variables. A variable with a spurious correlation may not be initially useful for the model but since the relationship is not causal it may not generalise well, and can create inaccuracies in your model at a later point.

Fig. 13.12
figure 12

Causal diagram showing the relationship between temperature, demand and wind chill

The challenges extend beyond the independent and dependent variables. When two of the independent/predictor variables are highly (linearly) correlated they are called collinear and can create difficulties in interpreting their effect on the dependent variable. It can also increase the chances of overfitting. Adding a collinear variable may not add much accuracy to the forecast since much of the effect has already been captured by the correlated variable already included in the model. As an example, consider ambient temperature and wind chill. Wind chill is a combination of temperature and wind speed and estimates how the temperature feels to a human. Thus these two variables are often strongly correlated to each other. The causal relationships between them are shown in Fig. 13.12.

Collinearity can also effect the sensitivity of the model coefficients to whether those variables are in the model or not [10]. For a linear regression it therefore effects the precision of the associated coefficients. There are tests to help identify collinearity effects, one of which is the variance inflation factor (VIF) (see, for example, the book by Ruppert and Matteson [10]). This measures how much the variance of an independent variable changes when the effects of the other independent variables are compared. In addition correlation functions can be used to identify which variables are highly correlated.

Ideally, it should be checked whether the collinear variables improve the forecast model accuracy when they are included versus when one, or both are removed. If interpretation and stability is not important then the most accurate forecast should be retained. However, note that if both are retained and there is a concept drift (Sect. 13.6.3) in only one of the variables this can reduce the accuracy on new unseen data. Regular model retraining can reduce the likelihood of this.

There are a few approaches which may be able to reduce the collinearity effects. The most obvious is to prune the variables amongst those which have the strongest correlation and or highest VIF, possibly keeping those which have the strongest causal link with the dependent variable (causality not a particularly easy task to prove of course). Model selection techniques, such as the information criteria, as presented in Sect. 8.2.2, are another possibility for finding an accurate model with the lowest required variables.

Of course this collinearity can extend across multiple variables, in which case it is referred to as multicollinearity.

A related concept is that of confounding variables. If a variable causally effects both the dependent and an independent variable it is a confounder for them both. In Fig. 13.12 temperature is a confounding variable since it causes changes in demand but also is a major determinant of the wind chill values. Not taking into account confounders can make it difficult to understand the causal relationship between variables, and may mean an effect or correlation is overestimated. As an example, consider a model of the net demand linked to the wind chill values. Suppose wind chill has no effect on demand, but temperature does. Since temperature is also related to wind chill (Sect. 6.2.6) then it could be perceived that a change in wind chill is correlated to the change in demand, whereas in fact this is spurious since it is the temperature which are driving the related changes. Regardless, there may still be some influence of the wind chill on the demand which can improve the accuracy of the forecast model, but this may not clear because it is confounded by the temperature variable. If you are interested in the effect of wind chill you need to isolate it by controlling for the temperature. One way to do this is to control for the temperature by learning the relationship for fixed (or binned) values of the temperature.Footnote 4 This is not easy as you have to reduced the overall data from which to train the model since you need to learn on subsamples of the overall data. For linear relationships, linear regression can also be used to identify confounding variables and the size of the effect. This is done by comparing the effect (i.e. the associated coefficient) on regressing two models for the independent variable against the dependent variable, but including the suspected confounding variable in one of the models. Large changes in the coefficient can suggest a confounding variable with the magnitude of this change indicating the size of the effect.Footnote 5

If the confounding variables are not included in the models then there could be a loss of accuracy. However, including them means that interpreting the results can be difficult, as shown above. Again, this may not be a major concern if the focus is on performance rather than interpretation. However spurious correlations may decrease generalisability of a model. Cross validation methods can help with model and input selection (Sect. 8.1.3) and thus ensure that the model is still generalisable.

Ideally all confounding variables and those which have causal effects on the dependent variable will be included in a model, but this may be very difficult. One issue caused by not identifying confounding variables is that the assumed relationship between the independent and dependent variables may be tenuous, and thus sudden changes in the independent variable may mean the model no longer holds or becomes very inaccurate. For example, suppose a model has been created for the total demand at a local substation connected to five unmonitored households. If it is found that the monitored demand behaviour of another household (but on a different network) is very similar it could be included in the model to help increase the accuracy. However if, unknown to the modellers, one of the five households moves house the correlation with the demand no longer holds and may produce very inaccurate results.

The case study in Sect. 14.2 suggests a real example where it is possible seasonality effects may be a confounding variable for the temperature.

13.6.2 Special Days and Events

The forecast models described throughout this book typically assume that the future load will look similar to the historical load. To do this, data analysis and visualisation methods are deployed to look for common patterns in the data. Therefore a forecast model which uses these features will, on average, be very accurate for typical days. Unfortunately it may not do well for less common days or events.

More often than not these untypical days will be holidays such as New Years, Bank Holidays, etc. I.e. special days where businesses may be closed, houses may be occupied in different ways, and there may be different behaviours than usual. These special days will vary depending on the country and culture, for instance predominantly Christian countries will include Christmas, predominantly Muslim countries will include Ramadan, and North American countries will have Thanksgiving, etc. On days like these, workplaces may be closed, and there may be other behavioural changes (e.g. extra home cooking).

Another cause for less typical energy usage days may be because of special events. This could be a World Cup final, the Olympics, or a Royal Wedding which causes many more households to watch TV in their own homes, or perhaps go to public areas to watch communally. Other special days can be created due to highly atypical weather events. For example, a heat wave, a storm or an usual cold snap.Footnote 6

Each of these types of events can cause energy demand to be very different than from what regularly occurs in the homes, or over an aggregate of homes. If they are not taken into account then the forecast models are likely to be inaccurate on these special days.

Ideally, these special days should be treated within the model. This can be through dummy variables (Sect. 6.2.6), or a separate model could be created for them. The problem with these approaches is that these special events are often very rare. In other words, there is not usually much data to train the models to accurately forecast special day demand. Take for example Christmas day which occurs on 24th December each year. There is only one day per year in which to train on and further to this, each year Christmas will fall on a different day of the week. It could be Monday one year, and Tuesday on another etc. This means that each Christmas day may be slightly different depending on which day of the week it falls on. This further reduces the relevance of the data for training.

There is not many robust ways to alleviate these problems. With so few examples it is difficult to learn the regular patterns which means heuristics or educated guesses must be made about what data can be used to inform the models. A common way to do this is to use other data as proxies for the demand on the special day. For example, it is reasonable to assume holiday dates (like Bank Holidays) are very similar to weekend dates, with occupants acting similar on both. In these models the special days can be assigned the dummy variables for the weekend (or say Sunday).

13.6.3 Concept Drift

Concept drift in machine learning refers to the change in the statistical properties of a dataset over time. Often these changes can occur without the modeller being aware, as they can be subtle and/or take place very quickly. This creates a lot of problems for forecasts since the distribution of the random variable is changing and may mean the distribution in the test set is not going to be the same as the training set.

There are many reasons that energy demand may change over time, including:

  • Churn of customers: If the occupants of many households or businesses connected to a LV network change then the demand behaviours will likely change.

  • Uptake of new technologies: Disruptive technologies such as heat pumps, electric vehicles and photovoltaics can have major effects on the demand and change the aggregated loads on the distribution networks.

  • Atypical Events: Special events (See Sect. 13.6.2) can create unusual demand. This includes unusual weather and special sporting events (Olympics, World Cup etc.).

  • Improving Energy Efficiency: New technologies can mean reductions in demand. For example, new LED lights and TVs are much more efficient then their predecessors.

  • Change in Energy Prices: Volatile energy prices can mean households and businesses have to change their behaviours or appliances to better reduce their costs.

Many of the concept drifts in electricity time series will be temporary. Efficient versions of appliances may change the demand behaviour when they are first installed but the future demand will then be stable until another major change. This suggests a way to reduce the effects of permanent changes in time series: adaptive methods or tracking methods. These methods use a limited window of the most recent observations to ensure that models can react to changes in the demand. When a change in the distribution occurs, the model will eventually train to this new distribution. Of course, it may require an certain period of time to pass until model accuracy is restored since the training data may consist of a mix of data from before and after the distribution change.

The change in demand can be detected by so-called change-point analysis, which aims to identify the time point where the data before and after the change-point has different statistical properties. A related time series property to concept drift is regime switching. This is where the time series has finite number of states or “regimes”, which the time series may switch between. To generate forecasts in these cases means generating a forecast model for each regime. This has the challenge of predicting which regime the time series will be in as well as the demand within that regime. An example in energy demand may be a commercial building which has different uses across the year. For example, University halls of residence may be empty over the summer whereas the rest of the year they are occupied by students.

13.6.4 Unrealistic Modelling and Data Leakage

Another pitfall in forecast experiments is the unrealistic assumptions and modelling which are often deployed. Ideally, forecasts should be designed to replicate the real-world scenario for which they are being generated. However, in many cases, to simplify the analysis, or due to lack of resources, a forecast model used in a desktop study may be non-replicable or impractical to apply in reality.

The most common mistake is the use of data which would not actually be available in practice. For demand forecasting this is usually weather data. Weather is considered a strong driver of electricity demand due to heating or cooling needs. Therefore weather forecasts can be useful inputs for accurate load forecasts. Unfortunately, weather data, in particular weather forecast data, can be difficult to source. Instead many experiments resort to using weather observation data which of course would not be known in advance. However, it is often easier to collect observed data. Any forecast that uses data which would only be available after the forecast is generated is known as an ex-post forecast. Those which only use data which is available at the time the forecast is generated, are known as ex-ante forecasts. This can be viewed as a form of data leakage. This term refers to using data in machine learning experiments when they would not be available in practice. This is often the case when test data is inadvertently included in the training leading to an unrealistic performance on the test set. In time series forecasting, other forms of data leakage may occur if non-time series splits are utilised in the cross-validation, such as the shuffled split (see Sect. 8.1.3).

Another practical consideration when developing load forecasts is understanding what load data would be available for use in the forecasts. For example, household smart meter data is often only transmitted to the data platforms/exchanges, at most once a day. Any model waiting for such data to train a model, for example a storage control scheduling algorithm (Sect. 15.1), may not have timely access for training and transmission. In addition to the data not being available due to collection restrictions, the speed of an algorithm to train or generate a forecast may be insufficient to be able to use the new data in time for the required application.

There may also be external constraints that limit what data is available. For example, if you are utilising weather forecasts, most numerical weather prediction centres only reveal the updated weather forecasts at particular times of the day, e.g. midnight, 6AM, noon, and 6PM since it is often computationally infeasible to do global predictions more frequently. Therefore the weather inputs used for a rolling load forecasts may not be the most recent depending on the time of day. In summary, communications limitations, regulatory restrictions, commercial considerations, and computational expenses can all have implications for what data should or could be used when training or developing a forecast. The main point is that as many realistic assumptions should be embedded in a forecast trial or experiment to ensure they mimic the real-world conditions of the application they will be applied in.

13.6.5 Forecast Feedback

In some cases the forecasts can effect the demand itself, which will now mean that the forecast itself is incorrect. This feedback is especially true if linked to electricity costs. For example, suppose that a commercial business sees a load forecast which identifies that a large peak will occur during a period when electricity prices are high. The business then may decide to make plans to change their behaviour, e.g. turning off non-essential equipment, or shifting their demand. This will reduce the predicted peak and hence reduce their costs. This means that the forecast is now technically incorrect a condition created by the forecast itself. This is completely valid, after all forecasts are there to try and support many applications (Chap. 15) with the objective in many cases to reduce the demand and costs for consumers.

The feedback effect not only reduces the accuracy of the forecast but it also changes the training data itself. Where the forecast has influenced the observed demand this data may not describe the normal behaviour or features of the time series. As another example, consider the battery storage device example in Sect. 15.1. The forecast is used to help plan the charging and discharging of the battery on a feeder, however this also changes the demand on the feeder. What was originally going to occur is now unknown unless the charging of the battery is also recorded. These adjustments therefore effectively change the regime of the data during those periods (See Sect. 13.6.3) and should therefore not be used in the training for the ‘typical’ behaviour of the time series.

The question is how to deal with such feedback for future training? In the case of a storage device, if the changes are known and recorded then the uninfluenced demand can be recovered. However, in the cases where the original underlying demand is not known (such as with demand side response, see Sect. 15.2) then one solution is to not train the normal demand on those periods where there has been interventions. This is only practical if there is sufficient data where no interventions have occurred, otherwise the final models may be inaccurate. Alternatively, it may be possible to learn the intervention effect itself by incorporating it into the model, possibly even as a dummy variable, (Sect. 6.2.6), that way more of the training data can be utilised.

13.7 Questions

For the questions which require using real demand data, try using some of the data as listed in Appendix D.4. Preferably choose data with at least a year of hourly or half hourly data. In all the cases using this data, split into training, validation and testing in a \(60, 20, 20\%\) split (Sect. 8.1.3).

  1. 1.

    Generate some day ahead point forecasts for some demand time series. Preferably a few benchmark models, a few statistical methods and a few machine learning models. On the test set compare the RMSE errors of the individual models. Now combine all the models and calculate the RMSE, does it improve compared to any or all of the individual models? Try different averages of the models, for example in one case sample the two best, or the two worst. Try mixing some of the best and worse. See if some of them give smaller RMSE than the other combinations? Take the two best forecasts and see if they are statistically significant using the Diebold-Mariano test.

  2. 2.

    Take a collection of at least 100 smart meter demand time series. Partition them into ten sets of ten and aggregate each set. Now create a forecast for each aggregate over the test set. Now aggregate all set of ten (so they are now an aggregate of 100 smart meters) and using the same forecast model (trained on the aggregated demand series of 100 smart meters) generate a forecast on the test period. Calculate the RMSE error for this forecast. Now aggregate the forecast of the ten sets and calculate the RMSE error there too. Compare these two values, which is more accurate? Which would you expect to be more accurate?

  3. 3.

    Can you think of some other forms of data leakage?

  4. 4.

    What else may cause concept drift? Can you think of changes in your home which would cause reasonably large changes in your usually demand behaviour? Can you think of other buildings which may have some dramatic changes in their behaviour?

  5. 5.

    Think of some special days where the energy demand in your house may be different? What are some other reasons for changes in typical behaviour? Which of these days may be similar to each other?