The previous sections have described in detail the steps required to develop a time series forecast including: how to generate useful explanatory variables; how to train the model; how to avoid overfitting; and how to evaluate the accuracy of the model. What has not been investigated is the models themselves. This chapter will be the first of three chapters looking at a wide range of models and some of their properties.

This chapter and the next will look at point forecast methods, and then in Chap. 11, probabilistic forecasts will be examined which provide models for handling highly uncertain data, something which is often required for low voltage feeders and substations (Chap. 2).

Of the point forecasting chapters, this chapter looks at traditional statistical methods, whereas Chap. 10 will look at what are sometimes referred to as machine learning models. Each type of models has advantages and disadvantages, some of which have already been described in Sect. 5.3, but further criteria will be described in Sect. 12.2. In short, statistical models are typically more transparent and easier to interpret and understand. That makes them not only useful for investigating some of the core features of the data, but also makes them good benchmark candidates.

The majority of the models presented in this chapter are easily implemented through packages in open source programming languages for scientific computing such as Python and R as well in popular proprietary software such as MATLAB. However they can be easily derived and trained from scratch (since they are often linear functions and hence can be easily trained using e.g. linear least squares, see Sect. 8.2), which may be preferable when you want to extend the models or make bespoke adjustments.

This chapter starts by considering some simple models and then introduces progressively more complicated ones (in terms of more parameters and computational expense) starting with exponential smoothing (Sect. 9.2), multiple linear regression models (Sect. 9.3), ARIMA and SARIMA models (Sects. 9.4 and 9.5 respectively), and then finally generalised additive models (Sect. 9.6).

Before diving into the models it is worth highlighting the context for these forecasts: short term load forecasts (STLF). A common way to categorise a load forecasts is in terms of the forecast horizon. Short term forecasts estimate the demand between a day and a week ahead (sometimes two weeks). In contrast, those from 1 week up to a year are referred to as medium term load forecasts, and those beyond a year, long-term load forecasts. Note these definitions can vary slightly depending on the context but are typically in the ranges specified. Models which are good for STLF may not be suitable for medium and long term load forecasts, and vice-versa. Hence the models presented here are specifically chosen for their use in shorter term load forecasts which will usually heavily rely on the most recently observed information.

9.1 Benchmarks Methods

This section will begin with considering basic and commonly used benchmark methods. As discussed in Sect. 8.1.1 developing appropriate benchmarks is essential for any well-designed forecast experiment. As usual throughout this chapter a time series of the form \(L_1, L_2, \ldots , \) will be considered and the aim is to produce estimates \(\hat{L}_{N+k}\) for the time steps \(N+k\) where \(N \in \mathbb {N}\) is the forecast origin and \(k \in \mathbb {N}\) the forecast horizon (See Sect. 5.2 for further details on these terms).

One of the simplest benchmarks is the persistence model, which can be described as

  • Persistence:       \(\hat{L}_{N+1} = L_{N}\)

This model assumes that the demand of the next time step is simply the current load. Additional future time steps can be estimated by simply repeating this value.

This can be an effective model if the data has a single strong autocorrelation at lag one (see Sect. 6.2.2). However for most applications, with more variable demand, the method will provide very little accuracy. Instead, energy demand often has strong daily, weekly and annual seasonal components (Sect. 5.1). Hence, effective adjustments can be made to the simple persistence model to produce a much more accurate forecast model which presumes that the behaviour at the current step is the same as that at exactly one seasonal cycle away. These are called seasonal persistence models and have the following form:

  • Seasonal persistence:             \(\hat{L}_{N+k} = L_{N+k-s_1}\)

where \(\hat{L}_{N+k}\) is the k-step ahead prediction, N is the forecast origin, while \(s_1\) denotes the seasonal period (note it is assumed the last seasonal point is observed, i.e. occurs before the forecast origin, so \(N+k-s_1 \le N\)). As an example, consider the case of half hourly data, seasonal persistence models for daily, weekly and yearly seasonality can be produced by setting \(s_1\) to 48, 336 or \(52 \times 336\) respectively. These seasonal persistence forecasts are very easy to implement and require no training data whatsoever. An example of day ahead persistence and seasonal persistence models are shown in Fig. 9.1, were the seasonal persistence model uses daily seasonality (i.e. yesterday is the same as today).

Fig. 9.1
figure 1

The plot shows a simple persistence forecast (grey flat line) and a daily seasonal persistence (red line) for the fourth day of the half hourly data. The observations are shown as a black line. The data in this example has daily seasonality and hence the seasonal persistence model picks up important features of the data

For seasonal data, an extension (and usually an improvement) to these methods is to include several historical observations at the same period and take a simple seasonal moving average. In other words:

  • Seasonal Moving Average (SMA):             \(\hat{L}_{N+k} = \frac{1}{p}\sum _{i=1}^{p}L_{N+k-i s_1}\).

Fig. 9.2
figure 2

Plot shows a daily seasonal moving average forecast (blue line) over three historic days to generate a forecast for the fourth day of the half hourly data. The observations are shown as a black line. The data in this example has daily seasonality but more volatile data on the third day of the data. Hence in this case the daily seasonal persistence would not produce an accurate forecast for the fourth day compared to the simple moving average which smooths out the errors over the three days

As with the seasonal persistence, often a weekly period (\(s_1=336\) for half hourly data) is used. The weekly simple averages often perform much better than the equivalent seasonal persistence models since it smooths out the random week-to-week aberrations around the expected value and therefore better replicates the typical weekly behaviour. An example is illustrated in Fig. 9.2 for daily seasonal data. Day 3 had unusually large demand, hence a daily seasonal persistence model would not be as accurate as when it was used in the previous example in Fig. 9.1. Instead the simple average over the Days 1–3 reduces the effect of the unusual day 3 and hence provides a better estimate of day 4. For the simple average method, slightly more training data is required than the persistence models, and in addition a validation period is required to choose the most appropriate value of the hyperparameter p (Sect. 8.1.3). However, the model is very quick to calculate and in practice typically only requires setting \(p=4\) or 5 weeks to optimise the model and offer significant improvements over the persistence models.

9.2 Exponential Smoothing

Despite their simplicity, the benchmarks introduced in Sect. 9.1, especially the seasonal moving average, can be surprisingly accurate. However, one of their disadvantages is that each historical week they utilise is given equal weighting whereas it would be expected that older data is less relevant to the current forecast period. In other words, older data should contribute less than more recent data to the final forecast. This is particularly relevant for load data as it is strongly driven by seasonalities and trends. For example, it would be expected that data from a few months ago in, say summertime, is less relevant to the winter period.

Exponential smoothing methods take weighted averages of past observations but where the weights decay for older observations. To illustrate this, consider the simplest form of exponential smoothing which creates a smoothed 1-step ahead output \(\hat{L}_{N+1}\) which is updated at each step using the latest observation, \(L_N\), in the following way

$$\begin{aligned} \hat{L}_{N+1}= \alpha L_{N} +(1-\alpha ) \hat{L}_{N} = \hat{L}_{N} +\alpha (L_N-\hat{L}_{N}), \end{aligned}$$
(9.1)

where \(\alpha \in (0,1)\) is a smoothing constant to be optimised in the validation period (see Sect. 8.1.3). The estimate, \(\hat{L}_{N+1}\), of the next observation \(L_{N+1}\) is a weighted average of the current estimate \(\hat{L}_{N}\) and the most recent observation \(L_N\). Similarly, the previous estimate is also a weighted average of the previous observation \(L_{N-1}\) and the estimate \(\hat{L}_{N-1}\) before that, and so on. In other words Eq. (9.1) can be written in the expanded form

$$\begin{aligned} \hat{L}_{N+1}= \alpha ( L_{N} +(1-\alpha )L_{N-1}+(1-\alpha )^2 L_{N-2} + \ldots (1-\alpha )^{N-1}L_2 ) +(1-\alpha )^{N}L_1, \end{aligned}$$
(9.2)

a geometric sum. Since \(\alpha \) and hence \(1-\alpha \in (0,1)\) then older observations are given less weight and thus contributes less to the final estimate. In the special case of \(\alpha =1\) then the forecast is simply the last observation and is equivalent to the simple persistence model as given in Sect. 9.1. This method is a 1-step ahead forecast and hence if multiple steps are required the forecasts are fed back into the model in place of the unobserved values. The optimal parameter can be found by a minimisation the sum of square errors for the 1-step ahead forecasts (over the validation period) but in addition to \(\alpha \) an initial estimate must also be produced. This can be generated as a simple average over previous values. The sum of square errors is a nonlinear equation due to the nested application of the smoothing constant and therefore has to be optimised using numerical methods rather than being solved directly.

Fig. 9.3
figure 3

Simple example of exponential smoothing for different values of the smoothing parameter

To illustrate the exponential smoothing method consider a basic example given in Fig. 9.3. Two exponential models are applied using two different values of \(\alpha \) to produce a 1-step ahead forecast. The model that uses \(\alpha =0.7\) is less smooth and is driven mainly by the most recent points. The model that uses \(\alpha =0.2\) is the smoothest and takes a weighted average which has more contributions from older historical values. In this case less smoothing (higher \(\alpha \) value) is more useful for prediction since the data has a decreasing trend and hence older points are much less relevant to the recent data.

In this basic form, exponential smoothing is relatively limited since it ignores trends or seasonalities which are important components of demand. A more advanced exponential smoothing algorithm that does take seasonalities into account is the Holt-Winters-Taylor (HWT) exponential smoothing method and models two levels of seasonality. This method estimates the load \(\hat{L}_{N+1}\) at time t using the following set of equations:

$$\begin{aligned} \hat{L}_{N+1}= & {} l_{N} + d_{N+1-s_1}+ w_{N+1-s_2} + \phi e_{N} \nonumber \\ e_{N+1}= & {} \hat{L}_{N+1} - (l_{N} + d_{N+1-s_1}+w_{N+1-s_2}) \nonumber \\ l_{N+1}= & {} l_{N} + \lambda e_{N+1} \nonumber \\ d_{N+1}= & {} d_{N+1-s_1} + \delta e_{N+1} \nonumber \\ w_{N+1}= & {} w_{N+1-s_2} + \omega e_{N+1}, \end{aligned}$$
(9.3)

where the parameters \(\phi , \lambda , \delta , \omega \) must be trained on the historical data. The load is broken down into three core components, a level \(l_t\) which corresponds to the first order correlation, and two seasonal terms, \(d_t\) and \(w_t\), which in load forecasting often correspond to intraday and intraweek seasonality respectively (although of course different periods can be used depending on the data). The intraday seasonality period \(s_1\) and intraweek period \(s_2\) are the number of time steps covering one day or week, and for hourly data would be 24 and 168 respectively. Notice that each of the level and seasonal terms has their own simple exponential smoothing equation as in Eq. (9.1). The error terms \(\epsilon _{N+1} = L_{N+1} - \hat{L}_{N+1}\) are assumed to be normally distributed with zero mean. At each time step \(N+1\), a forecast, \(\hat{L}_{N+1}\), is made using the current values for the level and seasonal terms \(l_N, d_N, w_N\) as well as the first order error term \(e_N\). Given this new estimates the other terms values can be then updated using their respective smoothing equations as described in Eq. (9.3). Due to the recursive nature of the algorithm the older values contribute less to the updates and the amount of contribution is determined by the size of their respective parameters, \(\phi , \lambda , \delta \), and \(\omega \).

Training the model parameters can be achieved by numerical optimisation of the one-step ahead, sum of squared errors (i.e. Eq. (8.5)) over the training data (Sect. 8.2) as before. However, note that there must be an initial estimate for the level and seasonal components before the parameters can be trained. There a few ways to do this but a simple method is to take an average over the oldest observations to ensure that there is initial data to train the algorithm. An example of the double seasonal exponential smoothing model will be given in the case study in Sect. 14.2.

9.3 Multiple Linear Regression

Standard regression is a statistical process for estimating the relationship between single or multiple variables. One of the simplest and most common of such models is multiple linear regression as it is easy to explain, fast to compute and very versatile. Suppose there is \(n\ge 1\) input variables \(X_{1,t}, X_{2, t}, \ldots , X_{n, t}\) which are assumed to be linearly related to the load \(L_t\) at time t, in other words the following forecast model is constructed

$$\begin{aligned} \hat{L}_{N+1} = \sum _{k=1}^{n} \beta _k X_{k,N+1}. \end{aligned}$$
(9.4)

The coefficients (or regression parameters), \(\phi _k\), describe the explanatory power of each of the variables in modelling the load \(L_t\) (although this does depend on each variable having similar magnitude). The independent variables are assumed to be uncorrelated with each other,Footnote 1 as are the 1-step ahead errors \(\epsilon _{t} = L_{N+1} -\hat{L}_{N+1}\) which are often assumed to be distributed as a Gaussian (see Eq. (3.5) Chap. 3) with mean zero and constant variance (the constant variance means the errors are homoskedastic—see Sect. 11.6.1). These assumptions simplify the training of the coefficients and modelling of the prediction intervals. However, as always it is a good idea to check these assumptions by plotting the residuals as well as their ACF (see Sect. 7.5 for further details).

If there is only one explanatory variable then the model is simply called linear regression, whereas with more than one it is called multiple linear regression. Multiple linear regression can often be written in a more succinct vectorised form

$$\begin{aligned} \hat{L}_{N+1} = \boldsymbol{\beta }^T \textbf{X}_{N+1}, \end{aligned}$$
(9.5)

where \(\textbf{X}_{t} = (X_{1,t}, X_{2, t}, \ldots , X_{n, t})^T\) and \(\boldsymbol{\beta }= (\beta _1, \ldots , \beta _n)^T\) are the vectors of independent variables and regression parameters respectively.

Although Eqs. (9.4) and (9.5) only show independent variables \(X_{k,t}\) at the same time step t as the independent variable, \(\hat{L}_t\), the equations can of course include lagged time points and autoregressive variables.

Fig. 9.4
figure 4

Linear regression line \(Y=(X-2)^2+1.2 = X^2 - 4X + 5.4\) (black) and noisy, Gaussian observations (red crosses) around the line

As an example, consider the situation in Fig. 9.4 where the best regression fit for the observations (in red) is the curve \(Y=(X-2)^2+1.2 = X^2 - 4X + 5.4\). Notice, that although the function contains a quadratic term \(X^2\), it is still clearly linear in the coefficients with independent variables \(\textbf{X} = (X^2, X, 1)^T\) and corresponding regression parameters \(\boldsymbol{\beta }= (1, -4, 5.4)^T\). Hence it is important to understand that nonlinear relationships can still be modelled within linear regression. For an example in demand forecasting, notice that the nonlinear relationship between demand and temperature in Fig. 6.7 in Sect. 6.2.2 could be modelled by a linear regression using a polynomial (if chosen with sufficient order).

Linear regression is also well suited to model the impact of categorical/discrete variables through the use of dummy variables (see Sect. 6.2.6). This is particularly useful in load forecasting which often require day-of-the-week or time-of-the-year effect. For example, often different days of the week are likely to have different demand characteristics, in which case the model should include the effect of the different days. In multiple linear regression this is done by including the dummy variables \(D_j(k)\) for \(j=1, \ldots , 7\) (one for each day of the week—with Monday represented by \(j=1\) and Sunday by \(j=7\) etc.) which indicate the day of the week, defined by

$$D_j(k) = {\left\{ \begin{array}{ll} 1, &{} \text {if time step k occurs on day }j\text { of the week} \\ 0, &{} \text {otherwise} \end{array}\right. } $$

In a linear regression model we often use six of the dummy variables as inputs to avoid the dummy variable trap (see Sect. 6.2.6) since in fact we can model the effect of one of the days by the other six (the seventh day is modelled by simply setting the other six to zero, presuming there is at least another term such as a constant which will ensure that its effect can be modelled).

Another useful feature of linear regression is that we can include interaction terms. This is where we model the effect of two or more variables on the dependent variable. For example, it may be that temperature \(T_k\) has an effect on demand, but only for a particular hour of the day, say 2–3 pm. In this case we can include a term for the temperature variable but multiplied by a dummy variable which indicates the time of day and is zero at all times except the hour 2–3 pm. In the linear regression the interaction term is often denoted as multiplication of the two terms, e.g. \(T_k D_j(k)\) or \(T_k * D_j(k)\). The case is similar if the simultaneous effect from more than two variables are modelled. An example of interaction terms in a linear regression model will be given in the case study in Sect. 14.2.

Given the assumptions on the errors, the coefficients of a linear regression model are often found by minimising the least squares estimate (see Sect. 8.2) and are therefore quite easy, and quick, to train. Recall, that since the errors are assumed to be Gaussian with constant variance, the least squares estimate of the model are also the maximum likelihood estimate as shown in Sect. 8.2. This is particularly convenient since the loglikelihood (see Eq. (8.8)), and hence the Bayesian information criteria (BIC) and Akaike information criteria (AIC), are both easy to calculate. Recall from Sect. 8.2.2, that identifying the models with the smallest values of AIC or BIC is one way to choose the best models on the training data, which have a tradeoff between accuracy and model complexity helping to limit the potential for overtraining the models.

As described in Sect. 8.2.2, linear models can be easily adapted to regularisation frameworks such as LASSO and ridge regression. Much like the AIC and BIC these techniques penalise the number and/or size of the coefficients by including a penalty term on the normal least squares regression. In particular LASSO can be used as a model selection technique as it tends to set the coefficients of irrelevant (or less influential) explanatory variables to zero. Finally, of course, as with all the methods, the models can also be selected through cross-validation and finding the model which minimises the error on the validation set. This can be quite inefficient if there is a lot of independent variables being considered.

Given the final trained model, the simple linear structure the coefficients provide a useful way to interpret the effect of each variable (assuming they are independent). Essentially they tell you how much the expected value of the dependent variable will change given a unit change in the independent variable assuming all the other independent variables are fixed. The interpretation becomes a little more complex when there are interaction terms as the effect size will now be dependent on the value of the other variable(s). In these cases inserting a range of reasonable values for these other variables may help to show the range of effects.

9.4 ARIMA and ARIMAX Methods

The autoregressive moving average (ARMA) technique is a traditional linear time series model which has been extensively used in time series forecasting. An ARMA (pq) model for a time series is a linear model described by

$$\begin{aligned} \hat{L}_N= \ C+ \sum _{i=1}^{p} {\psi _i}{L_{N-i}}+ \sum _{j=1}^{q} {\varphi _j}{\epsilon _{N-j}}, \end{aligned}$$
(9.6)

where \(\epsilon _{t}\) is a time series of error terms and C is a constant. ARMA models depend on the time series \(L_t\) to be stationary (see Sect. 5.1) however this may not always be the case. When the series is not stationary, differencing can be applied to time series until the final series is stationary. If d differences are applied this can be written

$$\begin{aligned} L^{(d)}_{N}= L^{(d-1)}_{N} - L^{(d-1)}_{N-1}, \end{aligned}$$
(9.7)

where the differences are applied iteratively d times. When differencing is used, the ARMA(p, q) model is now referred to as a ARIMA(p, d, q) model (autoregressive integrated moving average, with the integrated part meaning the differencing) and can be written

$$\begin{aligned} \hat{L}^{(d)}_N= \ C+ \sum _{i=1}^{p} {\psi _i}{L^{(d)}_{N-i}}+ \sum _{j=1}^{q} {\varphi _j}{\epsilon _{N-j}}+ \epsilon _N. \end{aligned}$$
(9.8)

A convenient and concise way to write ARIMA models is in terms of the Backshift operator B (also known as the lag operator), where B is defined on elements of a time series by \(BL_{t} = L_{t-1}\). By definition the lag operator can therefore be written \(B^kL_t = L_{t-k}\). Thus the ARMA(p, q) model can be written as follows

$$\begin{aligned} \left( 1-\sum _{i=1}^{p}\psi _{i}B^{i}\right) L_{N}=\left( 1+\sum _{j=1}^{q}\varphi _{j}B^{i}\right) \epsilon _{N} + C, \end{aligned}$$
(9.9)

and an ARIMA(p, d, q) model can be written as

$$\begin{aligned} \left( 1-\sum _{i=1}^{p}\psi _{i}B^{i}\right) (1-B)^d L_{N}=\left( 1+\sum _{j=1}^{q}\varphi _{j}B^{i}\right) \epsilon _{N} + C, \end{aligned}$$
(9.10)

where \((1-B)^d L_{N}\) is the \(d\textrm{th}\) order difference.

ARIMA models are quite versatile, being able to estimate a wide range of time series. They consistent of three main components: the difference d, an autoregressive (AR(p)) component, \(\sum _{i=1}^{p} {\psi _i}{L^{(d)}_{t-i}}\), of order p, and a moving average, MA(q), term, \(\sum _{j=1}^{q} {\varphi _j}{\epsilon _{t-j}}\), of historical white noise error terms of order q. As with multiple linear regression, the error terms are generally assumed to be Gaussian distributed (although other distributions can be used), with mean zero and uncorrelated with each other.

Fig. 9.5
figure 5

Example of autocorrelation (top) and partial autocorrelation (bottom) for a simple AR(4) model

Autoregressive models, AR(p), and moving average models, MA(q), are special cases of ARMA models (actually ARMA(p, 0) and ARMA(0, q) models respectively) and they are worth considering in a bit more detail before looking at the full ARMA model. Autoregressive models are ARMA processes but with \({\varphi _j}=0\) for all j and hence these are simple models in which the p past values of the time series influence the current values. An AR(p) can be written

$$\begin{aligned} \hat{L}_N= \ C+ \sum _{i=1}^{p} {\psi _i}{L_{N-i}}. \end{aligned}$$
(9.11)

Recall from Sect. 3.5 the partial autocorrelation is a measure of autocorrelation between the time series and its lagged values but with the influence from the inbetween lags removed. This means the PACF is an natural way to identify the order, p, of an AR process since the PACF should be zero for lags \(k > p\). In practice, the order can be detected by considering the sample PACF plot from the available time series and identifying when the lagged values are effectively zero, i.e. are consistently within the 95% confidence bounds which are often plotted with the PACF. Further, the ACF should also exponentially decay to zero for an AR(p) process. An example for a simple AR(4) model is shown in Fig. 9.5. Notice that in the PACF there are no correlations which are outside the confidence bounds beyond lag 4 as expected.

In contrast, moving average models are influenced by past values of the error values, so large past deviations can have an influence on the current time series values. One of the useful properties of a pure MA(q) process is that the autocorrelation function should be zero from lag \(q+1\) onwards. So the ACF plot can be used to identify a MA time series and its order. It should be noted that, although the ACF and PACF can be used to identify AR and MA models and their orders, in practice the sample version of these functions are used, applied to real observed data, and hence the results may deviate from the more clear-cut theoretical solutions. In other words the autocorrelations may exceed the confidence bounds but these may be spurious and simply occur due to random chance.

Now consider an ARIMA model, where the optimal orders, pq and d must be found in order to train the coefficients \(\psi _i\), \(\varphi _j\) of the final model. Typically this done by comparing the AIC values (see Sect. 8.2.2) from a range of different choices for pq and d. It is impractical to compare all possible values and hence typically a search is only performed around a good approximation to the orders. A common method for finding the best orders for an ARIMA model to historical time series data is the Box-Jenkins method which utilises the ACF and PACF to identify the autoregressive and moving average orders as discussed above. The process typically consists of the following steps:

  1. 1.

    Check if the time series is stationary. If it isn’t perform differencing until the final series is stationary. Stationarity can be checked in many ways. In addition to a time series plot, another indication of non-stationary time series is a slowly decaying auto-correlation function as a function of lags (see Chap. 3). However, there are also stationarity tests as outlined in Appendix A.

  2. 2.

    Identify the orders of the autoregressive (AR) and moving average (MA) terms. This can be estimated by examining the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots (see Chap. 3 and Sect. 6.2.4). In particular, if the model has an AR component of order p then the PACF should be effectively zero from lag \(p+1\) and above. Similarly for an MA model of order q, the ACF should effectively be zero from lag \(q+1\) and higher. In practice these orders can be found by looking at the respective plots and considering whether they are above the \(95\%\) confidence interval (which are usually included on the plot, see Sect. 6.2.4).

  3. 3.

    Using the ACF and PACF as an approximation for the correct orders, check the AIC (or BIC) values for a selection of ARIMA models with different pdq values (around the approximate values). The final orders are those that give the smallest AIC (BIC) values.

It should be emphasised that the ACF and PACF do not often give a definitive answer on the correct orders, and hence in practice they are used to approximate the correct orders which are then tested in step 3 using the AIC/BIC.

Fig. 9.6
figure 6

Example time series, generated from the ARIMA(3, 0, 1) model \(y_t = 0.14+0.609y_{t-1}-0.5y_{t-2}+0.214y_{t-3}+0.624e_{t-1}+e_t\)

Fig. 9.7
figure 7

ACF (top) and PACF (bottom) plot for the time series \(y_t = 0.14+0.609y_{t-1}-0.5y_{t-2}+0.214y_{t-3}+0.624e_{t-1}+e_t\)

The Box-Jenkins methodology is illustrated here for a specific example using an ARIMA(3, 0, 1) (or equivalently an ARMA(3, 1)) model given by \(y_t = 0.14+0.609y_{t-1}-0.5y_{t-2}+0.214y_{t-3}+0.624e_{t-1}+e_t\). The time series is shown in Fig. 9.6 and was generated using the Matlab simulate function.Footnote 2 The \(e_t\) is the error series which are distributed according to the standard normal distribution. In this case the series is stationary so there is no differencing required. To check the autoregressive and moving-average orders the ACF and PACF plots are considered, these are shown in Fig. 9.7, together with the confidence bounds for the \(95\%\) significance level. The ACF (the top plot) indicates the MA order and shows that the largest correlation is at lag 1, which is as expected, however there are also significant correlations (significant in terms of being clearly outside of the confidence interval) at lags 16 and 17. Notice that the ACF doesn’t gradually decrease as a function of lag, this supports the conclusion that the time series is stationary. The PACF indicates the AR order and in this example shows there are significant peaks at lags 1–4 which suggest a slightly larger order than expected. In addition there are smaller peaks outside the confidence interval at larger lags as well. This analysis indicates that ACF and PACF analysis is limited in terms of giving a complete answer to the exact order. In fact, the plots have limitations as it would be expected that \(5\%\) of autocorrelations to be outside of the confidence interval by random chance anyway. This means that the ACF and PACF must be interpreted with caution and in conjunction with the AIC.

Using the correlation analysis helps to locate the approximate area of the correct orders. In this example the ACF and PACF have suggested orders of around \(q=1\) and \(p=4\), and a test of the AIC for a variety of combinations of autoregressive and moving-average orders close to these values should be performed. Since the number of parameters for an ARIMA(p, 0, q) model is \(p+q+1\) (the one is due to the constant term) then the Akaike Information Citerion (AIC) has a particularly simple form

$$\begin{aligned} AIC = 2(p+q+1) - 2\ln (L), \end{aligned}$$
(9.12)

where L is the likelihood function of the ARIMA model. The AIC is checked for all combinations of orders with \(p=1,2,3,4\) and \(q=1,2,3,4\).Footnote 3 The result for each combination of p and q are shown in Table 9.1 which shows that a minimum AIC value of 50.45 is achieved for \(p=3\) and \(q=1\), correctly identifying the ARIMA(3, 0, 1) model.

Table 9.1 Akaike Information Criterion results for different AR (p) and MA (q) values for the ARIMA example given in the text

It should be noted that any MA model can be estimated by an AR model with a sufficiently large number of lags (p value). Since the coefficients of an AR model can be calculated much more quickly than a full ARIMA model, it can be preferable to replace any ARIMA model with an AR (with differencing if not stationary) with large enough degree. This can also simplify the analysis and interpretation of the models. However, this may also require a relatively large order and thus many more parameters in the AR model compared to a simple ARMA model, reducing parsimony and interpretation.

There are a number of useful extensions to the ARIMA models. One of the most important for load forecasting purposes is to extend the model to include other explanatory variables. This model is then called an Autoregressive Integrated Moving Average with Explanatory Variable (ARIMAX) model. An ARIMAX (pdq) model includes extra external variables and is described by Eq. (9.13)

$$\begin{aligned} \begin{aligned} \hat{L}^{(d)}_N= C+{\sum _{i=0}^{h} {\mu _i}{X_{N-i}}}+ \sum _{i=1}^{p} {\psi _i}L^{(d)}_{N-i}+{ \sum _{i=1}^{q} {\varphi _i}{\epsilon _{N-i}}} \end{aligned} \end{aligned}$$
(9.13)

with the differencing in (9.7) as before. The \({\sum \nolimits _{i=0}^{h} {\mu _i}{X_{t-i}}}\) is the explanatory variables term. The model can be analysed by first considering an ARIMA model without exogenous inputs to isolate the orders of the equations, and then fitting the full model with the exogenous variables. Note, when finalising the AR and MA orders, the AIC should be applied to the full ARIMAX equation.

9.5 SARIMA and SARIMAX Models

An important extension to ARIMA models is to include seasonality. Seasonal ARIMA (SARIMA) includes an extra set of hyperparameters, denoted PDQ, which extends the model to include autoregressive, differencing and moving average terms at a specified seasonal level. These models are written ARIMA\((p, d, q)(P,D, Q)_S\) where the S indicates the seasonality. For example, for hourly data with daily seasonality, the SARIMA model would be written ARIMA\((p, d, q)(P, D, Q)_{24}\). The ACF and PACF are interpreted differently for seasonal ARIMA models. Consider a simple case where \(d=D=q=Q=0\) but \(p=2\) and \(P=1\). This means the time series would have autoregressive lags at 1, 2, 24, 25, 26. Notice the combination of p and P terms means intra-seasonal lags (1, 2) are applied onto the seasonal lag 24. An example of an ARIMA\((2, 0, 0)(1, 0, 0)_{10}\) is shown in Fig. 9.8 together with the partial autocorrelation function. Notice the significant spikes on the PACF at lag at the periodic intervals 10, 20, and also 11, 12, 21, and 22.

Fig. 9.8
figure 8

Example of a ARIMA\((2,0,0)(1, 0, 0)_{10}\) series (top), and the corresponding PACF

The backshift operator is particular useful for representing SARIMA models. For example, an ARIMA\((p, d, q)(P, D, Q)_{24}\), for hourly seasonal data, can be represented as (Note no constant included here for clarity)

$$\begin{aligned} \left( 1-\sum _{i=1}^{p}\psi _{i}B^{i}\right) \left( 1-\sum _{j=1}^{P}\zeta _{j}B^{24j}\right) (1-B)^d (1-B^{24})^D L_{N}= \nonumber \\ \left( 1+\sum _{i=1}^{q}\varphi _{i}B^{i}\right) \left( 1+\sum _{j=1}^{Q}\theta {j}B^{24j}\right) \epsilon _{N}, \end{aligned}$$
(9.14)

where \(\psi _{i}\) are the coefficients for the nonseasonal AR components, \(\zeta _{j}\) are the coefficients for the seasonal AR components, \(\varphi _{i}\) are the coefficients for the nonseasonal MA components, and \(\theta {j}\) are the coefficients for the seasonal MA components. Note that \((1-B^{24})\) represents a seasonal difference, i.e. \((1-B^{24})L_N = L_N - L_{N-24}\). A seasonal difference of \(D=1\) is often sufficient.

For more details on ARIMA and SARIMA models check out [2] as well as other literature listed in Appendix D.

9.6 Generalised Additive Models

The linear models specified in Sect. 9.3 have various limitations. The two strongest and most common assumptions are that the errors follow a Gaussian distribution and that the model is a simple linear combination of various input variables.

Generalised linear models (GLM) are an extension to simple multiple linear models which include a link function which can allow for more diverse types of relationships. Using the notation as in Sect. 9.3 a dependent variable \(L_t\) at time t follows a GLM if for \(n\ge 1\) input variables \(X_{1,t}, X_{2, t}, \ldots , X_{n, t}\), then

$$\begin{aligned} g(\mathbb {E}(\hat{L}_{N+1})) = \sum _{k=1}^{n} \beta _k X_{k,N+1}, \end{aligned}$$
(9.15)

for some (possibly nonlinear) link function g(.), and such that the response variables are from a probability distribution from the exponential family (for example Gaussian, binomial or Gamma distributions—see Sect. 3.1). In other words, for a GLM, a transformation (via g) of the expected value of the dependent variable is a linear model. Notice, like the linear model all the linear coefficients, \(\beta _k\)’s must be estimated but, in addition, the link function and a probability distribution model for the errors must also be chosen. When the link function is simply the identity (\(g(x) =x\)), and the dependent variables are assumed to be Gaussian, then Eq. (9.15) reverts to the simple multiple linear regression model as introduced in Sect. 9.3. The choice of link function and distribution you require depends on the problem being considered. For example, if a dependent variable is non-negative then a log link function could be valid.

In this work general GLMs are not investigated. Instead the focus is on a very specific, and powerful, form of GLMs called Generalised Additive Models (GAMs) which have been very successful in load forecasting.Footnote 4 A GAM has the general form of

$$\begin{aligned} g(\mathbb {E}(\hat{L}_{N+1})) = \sum _{k=1}^{n} f_k(X_{k,N+1}), \end{aligned}$$
(9.16)

for some (possibly nonlinear) smooth functions \(f_k\).

GAMs have several advantages over GLMs, firstly the functions \(f_k\) allow the modelling of a much more diverse set of, possibly nonlinear, relationships whereas GLMs are only of the form \(f_k(X_k, {N+1}) = \beta X_{k, N+1}\). In addition, these functions are often modelled nonparametrically, whereas the GLMs often assume parametric transforms and distributions (GAMs can also utilise common parametric forms as well, e.g. log functions, or polynomials for each \(f_k\)). Note that GAMs still use a link function g which can be used to transform the dependent variable into a more suitable form for training.

A nonparametric approach for each of the functions (\(f_k\)) in the additive model allows the algorithm to learn the relationship between each input variable \(X_{k, N+1}\) from the observed data. A most common way to do this is to model each function using basis functions (See Sect. 6.2.5). Hence each function \(f_k\) is modelled

$$\begin{aligned} f_k(X_{k, N+1}) = \sum _{i=1}^m \alpha _{k, i} \phi _{k, i}(X_{k, N+1}), \end{aligned}$$
(9.17)

for basis functions \(\phi _{k, i}(X)\). Notice that this form transforms the GAM (9.16) into a GLM since the sum of the additive functions are now sums of linear functions in the bases.

Fig. 9.9
figure 9

Example of linear spline (top) and cubic spline (bottom). The squared markers are the knots which the polynomials interpolate

For GAMs, it is common to choose splines for these basis functions. A spline is a piece-wise continuous function which is composed of other simpler polynomial functions. One of the simplest examples of a spline is a piecewise linear combination. Examples of a linear and a cubic spline is shown in Fig. 9.9. Note since a spline is continuous, the end of one polynomial must join on the start of the next polynomial. The knots specify where the polynomials join to each other. The cubic version is regressed on the observations (red points) between the knots to determine the other two coefficients in each cubic polynomial (two of the coefficients are already found by the interpolation constraints).

In more precise terms, consider the one dimensional case where the aim is to approximate a function \(f: [a,b] \longrightarrow \mathbb {R}\) defined on an interval \([a,b] \subset \mathbb {R}\). For m knots at \(a = z_1< z_2< \cdots< z_{m-1} < z_m =b\) a spline is fitted to some data by a polynomial \(s_i(z)\) on each subinterval \([z_i, z_{i+1}]\). Further, \(s_i(z_{i+1}) = s_{i+1}(z_{i+1})\) since the spline should be continuous at the knots.

Fig. 9.10
figure 10

Example of a smooth cubic spline interpolated through the same points as in Fig. 9.9

Other constraints can be applied to the spline to either make it easier to train or to satisfy other criteria. One of the most common requirements for a GAM is to ensure that the spline has a particular level of smoothness. As can be seen the cubic interpolation in Fig. 9.9 is smooth between the knots but not across the knots themselves. Constraining the cubic spline to be smooth whilst interpolating across the knots means all coefficients can be determined uniquely. Another way of saying the spline is smooth is to say that the derivative (up to a sufficient order) is continuous at the knot points. An example of a cubic spline which is smooth across the knots is shown in Fig. 9.10.

Note that the aim in forecasting is to regress on the data, and therefore it is not necessary (or desirable) to strictly interpolate through the observations. However, the principle is still the same and the final spline should be continuous throughout, including at the knots.Footnote 5 This is achieved by regressing the basis version of the relationship on the observations (See Eq. (9.17) above).

Certain basis functions, such as B-splines have very desirable properties such as providing smoothness at the knots. Further, although the number and type of the basis functions should be sufficiently flexible to fit the data, without any additional constraints or regularisation (Sect. 8.2.4) large numbers of knots and high polynomial degrees will increase the chance of overfitting to the noise. In addition, this will mean the polynomials will be very “wiggly”. To prevent this, one approach is to include an extra term which is often added to penalise the lack of smoothness in the final solution. Recall this is much like the LASSO (Sect. 8.2.4) method and other regularisation techniques used to preventing overfitting.

A trivial example is where the link function is the identify. Since the GAM is linear in the basis functions then a least squares fit (see Sect. 8.2.4) to N observed dependent values \(\textbf{L} =(L_1, \ldots , L_N)^T\) can be considered. In other words, the aim is to minimise

$$\begin{aligned} \sum _{l=1}^N\left( L_l - \sum _{k=1}^n \sum _{i=1}^m \alpha _{k, i} \phi _{k, i}(X_{k, l}) \right) ^2 \end{aligned}$$
(9.18)

by training the parameters \(\alpha _{k, i}\) for \(k=1, \ldots , n\) and \(i=1, \ldots , m\). For a large number of basis functions this model is likely to overfit the data. To prevent this a penalty can be applied, i.e.

$$\begin{aligned} \left( \sum _{l=1}^N\left( L_l - \sum _{k=1}^n \sum _{i=1}^m \alpha _{k, i} \phi _{k, i}(X_{k, l}) \right) ^2\right) + K(f_1, \ldots , f_n). \end{aligned}$$
(9.19)

The function \(K(f_1, \ldots , f_n)\) is a penalty based on the individual functions \(f_k\). In order to penalise deviation from smoothness the following penalty is commonly considered given by

$$\begin{aligned} K(f_1, \ldots , f_n)= \sum _{k=1}^n\lambda _{k}\int f_{k}^{\prime \prime }(x_k)^{2}dx_k, \end{aligned}$$
(9.20)

where the size of the penalty for each variable is controlled by the smoothing parameter, \(\lambda _k\). The minimisation of the sum of second derivative of each function reduces the wiggliness of the function, i.e. encourages more smoothness depending on the value of the lambda’s. These smoothing parameters are, as usual, often found by cross-validation (see Sect. 8.1.3) or by optimising information criterion (Sect. 8.2.2).

Due to the basis function representation in (9.17) it can be shown that the penalty takes a particularly convenient quadratic form

$$\begin{aligned} \int f_{k}^{\prime \prime }(x_k)^{2}dx_k = \boldsymbol{\alpha }_k ^T \textbf{S}_k \boldsymbol{\alpha }_k, \end{aligned}$$
(9.21)

where \(\boldsymbol{\alpha }_k = (\alpha _{k,1}, \ldots , \alpha _{k, m})^T\) and \(\textbf{S}_k \in \mathbb {R}^{m \times m}\) is a matrix formed from derivatives of the basis functions evaluated at the input values for \(X_{k, l}\).

As in multiple linear regression models, GLMs and GAMs can be used to model the interaction of two or more features, for example

$$\begin{aligned} g(\mathbb {E}(\hat{L}_{N+1})) = f_1(X_{1,N+1})+ f_2(X_{2,N+1})+ f_3(X_{1,N+1}, X_{3,N+1}) \end{aligned}$$
(9.22)

In this case the first two functions \(f_1(X_{1,N+1}), f_2(X_{2,N+1})\) model a single variable each, but the third function models the effect of the interaction of \(X_{1,N+1}, X_{3,N+1}\). In these cases multi-dimensional versions of spline functions can be used.

Fig. 9.11
figure 11

Reprinted from [4] under CC 4.0

Example of partial contributions for single weekday term (Left) and interaction between hour-of-day and outside temperature.

The additive nature of the GAMs model makes the model interpretable since the contributions of individual features and interactions can be analysed and visualised, even if complex nonlinear functions are used. Figure 9.11a and b show exemplary visualisations of the contribution of individual terms to the final prediction. Figure 9.11a shows the weekday (\(W_k\)) contribution to the demand, indicating that for the specific model, the load is much lower on weekends and is highest on Thursdays. Figure 9.11b shows the combined effect of the interaction of the hour of the day (\(H_k\)) and outside temperature (\(T_k^{out}\)), for example, influence is lowest over night and for cold temperatures and highest around noon for high temperatures. Plots of the smaller subsets (typically one or two) of the full input variables are called partial dependence plots and allow us to examine, and better interpret, the overall effects of the different components.

There are a whole host of different approaches and parameters to choose and many GAM programming packages, such as gam or mgcv in R and pygam in python,Footnote 6 work for a selection of splines, smoothing parameter selection methods, and link functions. Often these packages will have their own default settings but in many cases these can be tweaked to ensure a more accurate fit and better performance. In particular if it is known that the errors are not Gaussian, or that a particular independent variable only has a linear relationship to the dependent variable, then these can be specified when implemented. Other parameters or data assumptions should also be checked, but if you are uncertain then several values can be checked via cross-validation methods. Since regularisation is employed within most packages it is better to have more degrees of freedom specified by the splines than too few. As usual residual checks (Sect. 7.5) can be used to evaluate the final models and identify incorrect assumptions or areas of improvement.

Note that there may be additional constraints applied to the basis/spline functions to better model the features in the demand data. In particular, since there is often periodicity in many of the dependent variables (e.g. hour of the day or week), basis functions can be chosen to include these features, e.g. periodic B-splines which are available for some of the aforementioned packages.

The above is a basic introduction to GAMs and a more detailed description for a very complicated area is beyond the scope of this book. Some further reading is included in Appendix D.2.

9.7 Questions

For the questions which require using real demand data, try using some of the data as listed in Appendix D.4. Preferably choose data with at least a year of hourly or half hourly data. In all the cases using this data, split it into training, validation and testing with a 3 : 1 : 1 ratio (Sect. 8.1.3).

  1. 1.

    Select a demand time series. Analyse the seasonalities, (see Sect. 6.2). Generate some simple benchmark forecasts for the test set, including the persistence forecast, and seasonal persistence forecasts, one for each seasonality you found. Calculate the RMSE errors. Which one is lower? How does this compare with the seasonalities you observed? Compare these results to the ACF and PACF plots for the time series.

  2. 2.

    Continuing the experiment from the previous section generate seasonal moving averages using the identified seasonalities. Using a validation set (Sect. 8.1.3) identify the optimal value of seasonal terms, p, to include in the average. If there is multiple seasonalities which one has the smallest errors overall? How does the RMSE error on a test set for the optimal seasonal average forecasts compare to the persistence forecasts in the previous question?

  3. 3.

    Generate a simple 1-step ahead exponential smoothing forecasts (Sect. 9.2) for a load forecast time series (preferably one which has double seasonal patterns, usually daily and weekly). Manual select different values of the smoothing parameter, \(\alpha \). Plot the RMSE errors against the smoothing parameters. Do a grid search to find the optimal smoothing parameter (Sect. 8.2.3). How does the optimal forecast compare to a simple persistence forecast? Now consider the Holt-Winters-Taylor forecast and perform a grid search for the four parameters \(\phi , \lambda , \delta , \omega \).

  4. 4.

    Investigate a LASSO fit for a linear model. Set the coefficients of a model with a few Sine terms, e.g. \(\sum _{k=0}^N \alpha _k \sin {k x}\), for N about 5, and \(x\in [0, 4\pi ]\). Sample 20 points from this data (and add a small amount of Gaussian noise). Now fit a multiple linear equation of the form \(\sum _{k=0}^{50} \gamma _k \sin {k x}\) using least squares regression to find the coefficients \(\gamma \). Now plot the trained model on 20 new \(x \in [0, 4\pi ]\) values. Is it a good fit? Now try and minimise the LASSO function using different values of the regularisation parameter \(\lambda \) (See Sect. 8.2.4). How does the fit change as you change the parameter? How many coefficients \(\gamma \) are zero (or very small). Use inbuilt functions to do the LASSO fit such as sklearnFootnote 7 in Python, or glmnetFootnote 8 in R.

  5. 5.

    Show for the basis representation for GAMs, that the second order penalty term (9.20) takes the form \(\boldsymbol{\alpha }_k ^T \textbf{S}_k \boldsymbol{\alpha }_k\)

  6. 6.

    Try and generate a linear model that fits a demand profile. Consider what features to use, if time of day is important considering using dummy variables. If weather data is available check if there is a relationship with the demand (see Chap. 14). In the case study in Sect. 14.2 a linear model will be generated for modelling the low voltage demand. Come back to this question once you’ve reached that part of the book and see what similarities there are. What have you done differently? What would you like to change in the model? Fit the linear model using standard packages in Python and R such as sklearnFootnote 9 and lmFootnote 10 respectively. Now using the same features implement a GAM. Again these forecasts can be trained using standard Python and R packages such as pygamFootnote 11 and mgcvFootnote 12 respectively. These packages often have similar syntax to the linear models. Now compare the forecasts and the errors. For the GAM look at the partial dependency plots. What is the relationship for each variable chosen.