This chapter will present the main definitions and concepts for time series forecasting. It begins by introducing time series before leading into the general form and definitions of a time series forecast. The following sections will lay the foundations for much of the tools, models and concepts in the later chapters. This chapter will rely on a basic understanding of statistical concepts which will be assumed. Chapter 3 contains a crash course in some of the important elements of statistics and probability and will be referred to throughout.

5.1 Time Series: Basic Definitions and Properties

Time series data will be the core object of study for this book. Time Series data are simply a sequence of data points, measured at discrete time points, that are ordered in terms of an increasing time index, i.e. chronologically. Typically, the points are spaced equally in time and the majority of timeseries analysis and methods will assume this is the case. Monitoring equipment used for recording demand, for example smart meters, are designed to collect data at regular intervals, usually half hourly, so this is not an unrealistic assumption and it also simplifies the analysis.Footnote 1

Throughout this book it is assumed that the time series is sampled at uniform (regularly spaced) time steps. A time series will often be denoted by a sequence of letters \(X_1, X_2, X_3, \ldots X_N\) where the subscript denotes the time step, with a larger index indicating a chronologically later point. Alternatively, the time series can be written as \(X_k\) with time steps \(k \in K =\{1, 2, \ldots N \}\). If the series continues forever into the future then \(N=\infty \) and \(K= \mathbb {N}=\{1, 2 \ldots \}\), the set of natural numbers.

The values a time series can take are very diverse and can be discrete values, real numbers, sets of values, or even letters. In the case of this book, since the data are typically energy demand then \(X_k\) is a single real-valued random variable (a more detailed explanation of random variables can be found in Sect. 3.1) denoting either power (which will have units watts, W, or kilowatts, kW), or energy (which will have units watt-hours, Wh, or kilowatt-hours, kWh). If the values at each time step consist of only a single variable the series is said to be univariate. However, if the series consists of more than one variable per time step, for example say \(X_k = (L_k, T_k)\) at each time step, where \(L_k\) is the load and \(T_k\) is the temperature, then the time series is called multivariate. The special case \((L_k, T_k)\), of only two variables, is referred to as bivariate.

An important feature of a time series is whether it is Stationary or not. A time series of random variables \(X_t\) is stationary if the joint distribution over any fixed segment of the data, \(X_k, X_{k+1}, \ldots , X_{k+M}\) (for some positive integer M), is the same whatever the temporal shift in the data, i.e. the same as the joint distribution of \(X_{k+m}, X_{k+1+m}, \ldots , X_{k+M+m}\) whatever the choice of \(m \in \mathbb {Z}\) (See Sect. 3.3 for the definition of joint distribution). In particular it means the expected value and the variance at each time step is fixed.Footnote 2 Time series that are not stationary are called, unsurprisingly, non-stationary. Examples of basic stationary and non-stationary time series are shown in Fig. 5.1. Plot (a) is a stationary time series, with each point coming from the same distribution with fixed mean and variance. Plot (b) is non-stationary with values coming from a distribution whose mean and variance increase as time increases. Finally, plot (c) shows a time series with a fixed variance but with a seasonal mean. Stationarity is an important property for many time series forecasting models, for example ARIMA models which will be introduced in Sect. 9.4. They are also easier to model since they have fixed properties in time. It is not trivial to prove that a time series is stationary. Plotting the time series is a typical first check for stationarity, and there are also statistical tests which are briefly discussed in Appendix A.

Two important features that occur often in non-stationary data are trends and seasonality. Trend in a time series is the general macroscopic (i.e. the low frequency) changes in the data, with the most common being a linear trend, where there is a gradual, linear growth in the time series. Figure 5.1b is as an example with positive linear trend. In energy based applications, an increasing trend in energy demand could be due to, for example, the gradual uptake of less energy efficient technologies, or perhaps simply the uptake of more devices.

Fig. 5.1
figure 1

Examples of different time series which are a stationary, b non-stationary with linear trend, c non-stationary with periodic behaviour

Seasonality is defined to be changes in the time series that occur at fixed regular intervals or fixed periods. Often demand behaviour is driven by human behaviour hence there are often strong periodicities at the daily, weekly and annual levels corresponding to typical behavioural patterns. Seasonal time series can often also be called periodic time series. Not all oscillations in behaviour will be of fixed period. For example, shift workers such as doctors and nurses will likely not have standard daily or weekly patterns (working different days of the week, and perhaps doing a mix of day shifts, long days or night shifts). These are often called cyclic patterns. The focus in this book will be on seasonalities with regular periods. An example of a seasonal time series is shown in Fig. 5.1c which has a seasonal period which repeats after every interval of length five.

Finally, another important property of a time series is its autocorrelation, which describes how the changes in the time series at one point relate to the time series at lagged (or older) points in the time series. A more detailed definition of autocorrelation is given in Sect. 3.5 and is investigated in more detail in Sect. 6.2.4. For now the general principle is described since they are often very important measures for producing accurate forecasts. As a simple example, take a person who gets to work regularly at 8AM every day. One day they may be late to work due to their alarm not going off or their car breaking down, in which case they may decide to work later than usual. In this case their later behaviour is correlated to their earlier behaviour. Notice that a seasonal time series with period P will have relatively high autocorrelation with itself for lags which are multiples of P. Finding correlations in the data is an important part of identifying which historical values may be important for estimating future points (see Sect. 6.2.4 for more details).

So far, only properties of the time series itself have been considered. However, often energy usage is influenced by other external drivers. For example, heating and air-conditioning are obviously related to how cold the occupants feel within a household. Further, the use of lighting will be related to how dark it is outside, which will in turn also depends on the time of year. Hence the energy demand will strongly depend on external explanatory variables. Choosing which external variables to include in a time series model (and its corresponding forecast model) is called feature selection and will be described in more detail in Sect. 6.2.

5.2 Time Series Forecasting: Definitions

In its simplest form, a forecast for a time series is an individual, or collection of, estimates for future values using currently available information. For the purposes of this book, the aim will almost always be to accurately forecast the future electricity demand on a low voltage network or application. How we define the accuracy of a forecast will be defined in Chap. 7.

For simplicity, the majority of the following arguments will be in terms of a univariate time series (see Sect. 5.1) but the definitions will easily extend to multivariate time series as well. For the following discussion consider a real-valued, univariate time series \(L_1, L_2, \ldots ,\) defined at uniformly spaced time steps \(t_1, t_2, \ldots ,\) where the current time point is \(t_n\) and the aim is to produce a forecast at the next h time steps \(h_{n+1}, h_{n+2}, \ldots , h_{n+h}\). Given this scenario a few terms can be defined

  • The data \(L_1, L_2, \ldots , L_n\), up to the current time \(t_n\), is often referred to as the historical data and is a core component of any forecast, especially those with regular seasonal patterns (see Sect. 5.1).

  • The current time period \(t_n\) is often called the forecast origin as it is the starting point for the forecast.

  • The value h is referred to as the forecast horizon and defines how many time steps beyond the forecast origin are to be estimated by the forecast. These forecasts are referred to as h-step ahead forecasts.

These definitions are illustrated in Fig. 5.2 which demonstrates a time series on a uniform, hourly time step grid, with a forecast origin at time step \(t_6=6\) and a forecast horizon of \(h=4\) time steps. Note that although lines have been drawn between markers (observations) for clarity, there are no observations between the time steps.

Fig. 5.2
figure 2

An Illustration of a 4-step ahead forecast with historical data, and forecast origin labelled

Often forecasts are written using the same lettering as the original time series but with a hat, e.g. \(\hat{L}_{n+k}\). To signify the starting (or origin) point of the forecast this can also be written as \(\hat{L}_{n+k|n}\) for a forecast which indicates both the forecast origin, \(t_n\), and the time step being estimated, \(t_{n+k}\). In this book both forms will be used and the origin and horizon should be clear from the context.

A special case of forecasts are 1-step ahead forecasts, and these are often used to compare the accuracy of different methods. They can be applied iteratively to produce h-step ahead forecasts by applying the 1-step ahead forecast h times where each new forecast value is fed back into the model for the next time-step forecast. Unsurprisingly, these are referred to as iterative forecasts. Alternatively, the entire forecast horizon can be achieved in one go, in which case such forecasts are called direct. Both types of forecasts will be considered in this book.

In many applications, including the core application of storage control presented in this book (Sect. 15.1), forecasts can be updated as new observations become available. This has the advantage of using the most recent data and thus improving the future estimates, especially those at shortest horizons. These are called rolling forecasts. Consider a h-step ahead forecast with forecast origin \(t_n\), with estimates \(\hat{L}_{n+1|n}, \hat{L}_{n+2|n}, \ldots , \hat{L}_{n+h|n}\). When a new observation becomes available at \(t_{n+1}\) the forecast model can be retrained on the updated dataset to produce a new estimate \(\hat{L}_{n+2|n+1}, \hat{L}_{n+3|n+1}, \ldots , \hat{L}_{n+h+1|n+1}\). Since more recent information is now incorporated into the model, the forecasts at \(t_{n+2}, \ldots , t_{n+h}\) should be more accurately estimated than the previous forecast. The forecast horizon is a moving window of width h. An example of a rolling forecast for the same situation presented in Fig. 5.2 in shown in Fig. 5.3. A forecast is originally made at the initial forecast origin at \(t = 6\) for the next four time steps (\(t=7, \ldots 10\)). When a new observation is made at time \(t = 7\) a new forecast can be produced at this new forecast origin for the next four time steps (\(t=8, \ldots 11\)). Notice that the new forecast trajectory has now been updated given the new observation.

As suggested in Sect. 5.1, a time series is actually a function of several other factors such as weather variables, time of day, seasonalities and other, perhaps unseen, factors. The aim of the forecast is to try and approximate the function which ‘accurately’ describes the future behaviour of this time series. Accuracy can be a difficult term to define but is often based on error measures (these will be introduced in Chap. 7) or how much they optimises the application of interest. The forecast can be written in a functional form. The following is a general form for a 1-step ahead forecast

$$\begin{aligned} L_{n+1} = f(L_1, \ldots , L_n, Z_1, \ldots , Z_k, \boldsymbol{\beta }) +\epsilon _{n+1}, \end{aligned}$$
(5.27)

for some function f which generates the forecast \(\hat{L}_{n+1|n}\) and is dependent on the historical data \(L_1, \ldots , L_n\) and k explanatory variables \(Z_1, \ldots , Z_k\) (Methods for selecting these variables will be considered in Sect. 6.2). For example, in electricity demand forecasting these explanatory variables could be weather or electricity prices. If one of the explanatory variables is a forecast, e.g. a temperature forecast, then estimates for future time steps \(t_{N+1}, \ldots , t_{N+k}\) can be included in the model (although note they are still only generated prior to the current time step). It is important to note that the larger the horizon (the bigger the k), the less accurate a forecasted explanatory variable will be and hence may be less effective as a model input. This should be tested as part of the model development. Similarly one can describe a h-step ahead forecast

$$\begin{aligned} \hat{L}_{n+h|n} = f(L_1, \ldots , L_n, Z_1, \ldots , Z_k, \boldsymbol{\beta }), \end{aligned}$$
(5.28)

for forecast origin n. Since a h-step ahead forecast can be produced from repeated application of a 1-step ahead forecast the inputs in many of these steps may include forecast values of L as inputs.

Fig. 5.3
figure 3

Example of a rolling forecast updated as a new observation becomes available. A new observation is made at time step 7 at which point an updated forecast is produced with a rolling window of size \(h=4\)

Every forecast model has parameters or hyperparameters (Sect. 8.2.3) which determines the response to the inputs. The parameters are represented by \(\boldsymbol{\beta }\) in Eq. (5.27), and must be appropriately trained in order to produce an accurate forecast (see Sect. 8.2 for an introduction to how to train these models). As a basic example, consider a simple linear regression \(ax + b\) (Sect. 9.3). In this case, the parameters are the coefficients for the model, \(\boldsymbol{\beta }= (a, b)\), i.e. the trend and intercept.

There are many terms to describe the inputs, outputs and other elements of a forecast model as represented in Eq. (5.27):

  • The variables within the function f, \(L_1, \ldots , L_n\) and \(Z_1, \ldots , Z_k\) are often known as the predictor or independent variables.

  • The variable to be estimated/predicted is often called the dependent or predicted variable. For this book this almost always will be electricity demand.

  • When the independent inputs are historical versions of the dependent variable, e.g. \(L_1, \ldots , L_n\), then these are often referred to as autoregressive features.

  • \(\epsilon _{n+1} = L_{n+1} - f(L_1, \ldots , L_n, Z_1, \ldots , Z_k, \boldsymbol{\beta })\) are the errors between the actual observations and the forecast estimate. Since no forecast is ever perfect these will rarely, if ever, be zero. In time series forecasting, errors are also often called residuals, although sometimes this term is used to represent what is left over after fitting a model on the training set (see Sect. 8.1.3). This will be the convention typically used throughout this book.

Given any of the models which will be introduced in Chaps. 911 (also assume for simplicity that the hyperparameters, Sect. 8.2.3, have already been selected), the role of the forecaster is to find the ‘best’ version of the model (i.e. the optimal choice of function f()) and this will require optimising the parameters, \(\boldsymbol{\beta }\), which define that model. As will be seen in Chap. 7, ‘best’ is often defined in terms of generalisation which is measured by minimising the errors on a test set (Sect. 8.1.3). If the forecast is used for a specific application then an appropriate error measures must be carefully chosen in order to optimise the overall performance.

As will be shown in Chaps. 911, there are a wide variety of forecast models each with their own advantages and disadvantages which are suited to different applications. A good forecast model will have zero mean errors because otherwise the forecast can be improved by simply shifting the current forecast model, e.g. \(\hat{f}(L_1, \ldots , L_n, Z_1, \ldots , Z_k, \boldsymbol{\beta })= f(L_1, \ldots , L_n, Z_1, \ldots , Z_k, \boldsymbol{\beta })-b\) where \(\mathbb {E}(\epsilon ) = b \ne 0\) is the mean value of the errors (see Sect. 3.1 for definition of mean).

Fig. 5.4
figure 4

Example of the different types of forecasts, including the three different types of probabilistic forecasts. The blue crosses are historical observations and the forecasts are in red starting at time step \(t=31\). Top left is the point forecast. Top right is a quantile forecast, showing the 0.1, 0.5 (median) and 0.9 quantiles. Bottom left is the density forecast and bottom right is the ensemble forecast

The above mainly describes forecasts in the context of point forecasts which only provide a single estimate for each time step \(t_{n+1}, t_{n+2}, \ldots , t_{n+h}\) in the forecast horizon. This is usually in terms of some measure of centrality such as the mean or median. A more descriptive alternative is a probabilistic forecasts which provides multiple value for each time step and better describes the uncertainty of the future values. Methods for generating such estimates will be given in Chap. 11. Probabilistic forecasts typically take one of the following three forms:

  1. 1.

    Quantile Forecast: Here several quantiles (see Sect. 3.2 for more details on quantiles) of the future values are estimated. If two quantiles are used (a high and low) then the area between the two values is often called the prediction interval or forecast interval. The \(10\%\) and \(90\%\) quantiles are common choices. An example is shown in the top right of Fig. 5.4).

  2. 2.

    Density Forecast: For a density forecast the full continuous distribution (see Sect. 3.1) is estimated for each time step. This is illustrated in the bottom left of Fig. 5.4).

  3. 3.

    Ensemble Forecast Footnote 3: The quantile and density forecasts only estimate a distribution at each time step \(t_{n+1}, t_{n+2}, \ldots , t_{n+h}\) in the forecast horizon. In reality the time steps are often interdependent with the values at earlier time periods influencing the values at later time periods. Ensemble forecasts estimate realisations from the full joint multivariate distribution for the set of random variables \(\hat{L}_{n+1}, \hat{L}_{n+2}, \ldots , \hat{L}_{n+h}\) (See Sect. 3.3 for more details on multivariate distributions). This is illustrated in the bottom right of Fig. 5.4) for 30 ensembles.

A drawback to probabilistic forecasts is the extra computational costs and the requirements for much more training data in order to generate an accurate estimate. If there is sufficient computational resource and data then probabilistic models provide a much more descriptive and informative estimation of the uncertainty in the future values.

figure a

5.3 Types of Forecasts

As briefly introduced in Sect. 5.2 forecasts can be classified according to whether they are iterative or direct, or as point forecasts or various forms of probabilistic forecasts. Different types and families of forecasts are desirable for different situations, applications and scenarios. Some of the most common groupings of forecasts and their features are listed below.

  • Rolling Forecast Frequency: Rolling forecasts are updated at regular time steps (it could be every time step) and produce estimates over a horizon of fixed length. So for load forecasts this could be a day-ahead forecast which is updated at every half hour, utilising new observations as they are recorded. Alternatively the updates may only be once a day, say at midnight. The latter is still technically a rolling forecast but much less frequently updated. Those more frequently updated will give much better prediction at very short time horizons as they utilise the most recent information. However, the drawback is that they will require infrastructure in place to collect, transmit and integrate the most up-to-date information.

  • Point or Probabilistic forecasts: A wide variety of point and probabilistic forecast models will be introduced in the following chapters. As introduced in the previous section, in contrast to a point forecast, a probabilistic forecast provides multiple values per time step to describe an estimate of the spread of the future values. Point forecasts are quicker to generate as they have fewer parameters to learn, and require less training data. Further, they are often easier to integrate into applications, e.g. storage control models (See Sect. 15.1) since it is easy to utilise a single value per time point rather than a range of values. For volatile data, point forecasts do not describe the uncertainty in demand and hence applications utilising more volatile demand may require probabilistic forecasts. A drawback to probabilistic methods is they are much more computationally expensive to produce and require more storage. In this book, methods for creating both types of forecasts will be considered.

  • Statistical and Machine Learning Methods: Traditionally time series forecasts have been implemented using statistical models such as ARIMA and exponential smoothing (see Sects. 9.4 and 9.2) and are easy to implement, computationally inexpensive and easy to interpret. More recently, machine learning techniques such as neural networks and random forests have become popular (see Chap. 10). Despite being more computationally expensive, they can engineer unseen features and learn complex nonlinear relationships. Statistical models can be preferable when there are clear, well understood relationships in the data, e.g. daily/weekly seasonality, or clear links to external influences such as weather. They can also be preferable when there is a relatively small amount of data since model assumptions are used to replace learning the relationships directly from the data (although of course if the model assumptions are wrong then the model will be inaccurate). Machine learning methods generally excel for complicated data with nonlinear and possibly unclear relationships (less manual feature engineering is possible). They are also preferable when learning across a large number of time series or for hierarchical time series (see below). The question of which type of model is better is ongoing. The most popular time series forecasting competition, the M-Competitions,Footnote 4 have shown in some cases that either type is preferable. More recently combinations of both types of models has shown to have the best accuracy (see Sect. 13.1 for more information on model combination).

  • Hierarchical Forecasts: Often data are arranged in hierarchies. In power systems, as shown in Chap. 2 the distribution network is a hierarchy with electricity stepped down at substations as it is distributed to consumers. The demand increases from the individual customers up the hierarchy to the substations, all the way up to the transmission and national level. The objective of hierarchical time series forecasting is ensure that forecasts are coherent across the hierarchy, i.e. that forecasts at one level of the hierarchy should be coherent with the forecast at the next level of the hierarchy. Another way of saying this is that the aggregate of the forecasts should match the forecast of the aggregate. This will be discussed in more detail in Sect. 13.2.

  • Local versus Global Forecasts: When forecasting multiple time series there are two main approaches that can be taken. You can take a local approach where you train a model on each time series, or you can take a global approach in which you fit the same model to all time series. The global approach can be preferable when there is a lot of time series and it would be prohibitive to produce a model for each of them. This is particularly relevant when considering smart meter forecasting. If every home in a country is to have a smart meter this is a large number of time series and therefore a global forecasting approach is preferable to a local approach. This is described in more detail in Sect. 13.4.

  • Peak Forecasts: The above approaches have been written in the context of forecasts for an entire period of a time series (e.g. each half hour of a day or week). In fact, in many cases it is only specific features that are of interest. One of the most important aims of forecast models is to predict the peak of a demand time series over a period (typically a day). The advantages of peak forecasts is that only a single value needs to be estimated for each period although the timing may also be important. However it should be noted that there are less historical examples of peaks and since they are, by definition, extreme values they may be trickier to accurately predict than baseload demand. Furthermore, for volatile demand, such as household smart meter data (see Sect. 13.3), the timing of peaks may be very irregular.

5.4 Notation

Some basic notation of time series and time series forecasts were introduced in Sects. 5.1 and 5.2. Here some of the most important notation used throughout this book are reiterated and expanded on in the context of load forecasting:

  • The actual monitored electricity demand will be modelled as a time series, \(L_1, L_2, \ldots \), of real numbers with \(L_{n}\) representing the demand at the nth time step \(t_n\). \(L_1\) represents the oldest data point in the data set. Unless otherwise stated the time steps are uniformly spaced, i.e. have the same time difference between one time step and the next, \(t_{n+1}-t_n = \Delta t\) \(\forall n\) . For load data, if not stated otherwise, we report the average load over the respective interval in kilowatts, denoted kW.

  • Forecasts are denoted as another time series, \(\hat{L}_{n}\), with a hat indicating that this is an estimate of the true demand at time step \(t_n\).

  • The notation \(\hat{L}_{N+k|N}\) will often be used to indicate that the forecast is for the time step \(N+k\) and has been generated starting from the forecast origin at time N, for forecast horizon of length k time steps. However, this notation can be a bit cumbersome and hence is omitted and simply written \(\hat{L}_{N+k}\) when the forecast origin is obvious.

  • Explanatory time series, for example temperature, will be denoted by another capital letters, e.g. \(X_t\). If there are more than one explanatory variable, for example if using multiple weather variables, then another index will be used to indicate the different variables. For example, given M explanatory variables they can be denoted, \(X_{1, t}, X_{2, t} \ldots , X_{M, t}\) for their value at time t. Alternatively different letters may also be used for each individual time series.

5.5 Questions

Some of the following questions will require using some demand data. A list of possible resources is listed in the Appendix D.4.

  1. 1.

    List some other types of time series that you can think of. This can be anything not necessarily energy demand related. What is the range of values that the series can take?

  2. 2.

    Download a demand time series. Is there any trends or seasonality in the data? If you have several time series compare them, do some have different types of seasonality? How many different seasonalities can you see? Is there a difference between the weekday demand and the weekend demand? Are there any other patterns you can see in the data?

  3. 3.

    From the data listed in Appendix D.4. Take some aggregated state level demand (GEFCOM 2014) and household level demand (e.g. the Low Carbon London data set). Plot the data. Compare some of the features: What is the average size of the demand, when are the peaks in the demand? Are there several peaks in a day? When do they typically occur? Do the daily peaks vary much from one day to the next?

  4. 4.

    Generate simple rolling forecasts. Consider a half hourly time series. Create a simple day ahead forecast for the following day by using the previous day as the forecast for the following day (i.e. a 48 half hour shift). For example, to predict Tuesday, use the previous Monday’s values. Consider the difference between the actuals and the forecast (see Sect. 7.1). Now create a basic half hour ahead rolling forecast for each half hour of the day by using the previous half hour as a forecast (i.e. a half hour shift of the data). Try this with some of the time series from the GEFCOM 2014 data and some household data (say from the Low Carbon London dataset). Are the errors smaller or bigger than the day ahead forecast? How do the absolute errors compare between the GEFCOM and household data? What about the relative errors?