The previous chapters have discussed all of the aspects of developing and testing a short term load forecast. This includes

  • How to analyse data to identify patterns and relationships.

  • How to identify, extract and select features to include in the forecast model.

  • How to split data for training, validation and testing.

  • The different types of forecasts and their features.

  • Popular statistical and machine learning models for point and probabilistic forecasting.

  • How to select error measures and scores to assess your forecast accuracy.

However, what are the steps required for actually producing a forecast? In Sect. 12.1 the general steps in developing a forecast experiment are discussed, and in Sect. 12.2 some of the criteria for choosing among the plethora of forecast models introduced in Chaps. 911 are given.

12.1 Core-Steps for Forecast Development

The following are the main steps in developing a forecast model for your chosen application. They are written in the approximate order they should be applied but many steps can be repeated or ignored depending on the circumstances. For example, sometimes further data analysis may be required if new data becomes available, or the initial model reveals other relationships to be checked. The process can be repeated when looking to refine the forecasts but ideally this should only be tested on data that has not been seen or used previously to ensure no bias or cheating (even unconscious) is included in the experiment.

  1. 1.

    Understand the problem: Presumably you are creating the forecasts for a specific application or purpose (such as those in Chap. 15). In which case it is worth fully understanding what the objectives are and what would be a good measure of success. Without solid aims it is easy to lose focus, and produce sub optimal results. Often the objectives will be determined by business objectives, in which case it is essential to translate these aims into a well-defined problem where you will know that a suitable result has been achieved. Proper development of the problem framing early on can influence or determine the choice of error measure, model, or even the data analysis later in the modelling process.

  2. 2.

    Access the Core Data: Although it’s likely the data is available in some format (otherwise the project itself may not be possible in the first place), it is important to perform some form of data audit early in the study to ensure you have the minimum required data to tackle the problem. Your understanding of what data may be needed will likely change as you investigate the data further but a quick check of what and how much data is available is essential to prevent wasting time on an impossible task and allows time to collect data you discover may be vitally needed.

  3. 3.

    Initial Data Checks: Now you have the data, a deeper check of the quality and usability is required. You can start to understand some of the relationships and features of the data at this point but the main objective is to understand the amount of cleaning and preprocessing that is required, and whether you have sufficient data to continue. This is a good point to check for missing data, anomalous values, outliers etc. (Sect. 6.1).

  4. 4.

    Data Splitting: Once you understand the complexity, quality and quantity of the data, you can determine the split of the dataset into Training, Validation and Testing data (Sect. 8.1.3). This may be determined by the length of any seasonalities, or the number of parameters (more parameters need more training data). This choice is important to determine the right bias-variance trade-off (Sect. 8.1.2) so your models don’t over and under fit the data. The split also prevents the researcher utilising information which would not be available at the time of producing the prediction (also known as data leakage). This would be unrealistic and create unscientific results since in real-world scenarios you would be testing on unseen data.

  5. 5.

    Data Cleaning: Here anomalous values and outliers should be removed or replaced with other values. If there is sufficient data then it may be better to remove the values so as not to introduce biases from any imputation methods used (Sect. 6.1.2). If the missing or removed values are not imputed you must make sure to adjust your model so it can handle missing cases or you should reduce the data so no missing instances are included (although this reduces the amount of data available for training/testing). If time permits it may be worth including tests of models with and without cleaning to see if it affects the model performance.

  6. 6.

    Visualisation and Data Analysis: Next is the deep dive into the data, arguably the most important aspect of developing forecasting models. This will help you understand the relationships and patterns in the data, the types of relationships, the important exogenous variables, perhaps even the models which may be most suitable. This step will likely iterate with the data cleaning step since you won’t necessarily know what an outlier is without doing some preliminary analysis, and similarly you can’t complete your analysis until the data is cleaned. Visualisation techniques and feature identification methods are discussed in Sect. 6.2.

  7. 7.

    Further Pre-processing: Given the data analysis further preprocessing may be required. For example, perhaps the data must be normalised or scaled to help with training the data. Alternatively a transformation may have to be applied to change the data distribution to one which is more appropriate for the model being used, e.g. to be Normally distributed so that a linear regression model can be used (Sect. 6.1.3).

  8. 8.

    Initial Model and Error Choices: Given the analysis, choose and derive some models. You can always add further models later on, in particular once you’ve finally tested the models on the test set and discovered limitations or possible improvements. It is also important at this stage to choose an appropriate benchmark (or benchmarks) with which to compare your model (Sect. 8.1.1). Further criteria for choosing the initial models are given in Sect. 12.2. In addition, at this stage a suitable error measure must be chosen. When tackling a specific application, the performance within the actual application is the true test of the usefulness/effectiveness of the forecast model. However it may not be practical to test all models (and their adaptions) within the application due to high computational costs. In these cases a computationally cheaper error measure which correlates with the application performance is more appropriate. Different error measures for point and probabilistic forecasts are discussed in Chap. 7.

  9. 9.

    Training and Model Selection: Using the data split (determined in a previous step) train the data on the training dataset (Sect. 8.2). Utilise the validation set to compare accuracy of the models and determine the optimal hyperparameters within each family of models (Sect. 8.2.3), including any weights for regularisation methods (Sects. 8.2.4 and 8.2.5). For some models such as ARIMA (Sect. 9.4) a model can be chosen without using the validation as a hold-out set. In these cases you can use Information Criteria (Sect. 8.2.2) to choose the hyperparameters on the combined training and validation set. These final chosen models (including the benchmarks) are taken through to the testing phase. Note if you are considering rolling forecasts (Sect. 5.2) or forecasts that are updated at regular intervals then you will have to apply a rolling window over the validation set.

  10. 10.

    Apply the models to the test set: Retrain the models on the combined validation and training datasets (This should improve the fit of the data, see Data augmentation in Sect. 8.2.5). Again, if you are considering rolling forecasts (Sect. 5.2) or forecasts that are updated at regular intervals then you will have to apply a rolling window over the test set.

  11. 11.

    Evaluation: Now you must evaluate the final models in a variety of ways to understand where they under (or over) perform. The most basic assessment is to ask which model performs best according to the chosen metric or measure? Are there different periods of the day or week which have different accuracy for different models? How does the accuracy change with horizon? How do the models rank compared to the chosen benchmark(s)? What were the common features of the models that perform best? Or worst? Finally, consider the residuals and their distributions. Are there any remaining features or biases? (Sect. 7.5).

  12. 12.

    Model Corrections and Extensions: Forecast corrections (Sect. 7.5) should be considered (with the correction ideally trained on the validation set, not the test set) where possible. However, a simple improvement is to combine the models that you’ve already produced (Sect. 13.1). This has been shown to be a very effective way to utilise the diversity across the models to improve the overall accuracy.

  13. 13.

    Evaluation with the Application: If the forecasts are used within an application then a true test of their usefulness is within the application or an in silico model for the application (for example for controlling a storage device, Sect. 15.1) rather than the error measure. If an error measure has been appropriately chosen the application performance will correlate with scores for the accuracy. This should be confirmed and if there is inconsistencies they should be further investigated.

  14. 14.

    Next Steps: Looking at your analysis of the results there could be new inputs which could be appropriate (e.g. different weather variables), or different ways of using the same inputs (e.g. more lagged values from the temperature time series or combining different inputs to create a new variable). The process should now be repeated (ideally on new, unseen data) to test further updates or adaptions. This process can be repeated until a sufficient forecasting accuracy or application performance has been achieved.

Fig. 12.1
figure 1

Forecasting procedure as outlined in this section

The steps in producing a forecast are outlined in the diagram in Fig. 12.1. The procedure can be seen in terms of two components. The data collection and analysis is the first part which describe how the data is mined, analysed and wrangled to get it ready for testing. It also is used to understand the features to use in the models. The second stage is the model selection, training and testing.

12.2 Which Forecast Model to Choose?

In the last few chapters a wide variety of methods were introduced from the different types listed in Sect. 5.3. There are point and probabilistic forecasts, those suited more to direct forecasts than rolling, and a mix of statistical and machine learning methods. There are no hard and fast rules to determine which are the most appropriate models to use, and the choices will depend on the application and context. However, there are some general principles which can be followed to help narrow down the model choice:

  1. 1.

    Computational costs: short term load forecasts require at least daily updating. For this reasons, models which are computationally quick to train, but are less accurate, may be preferable to more accurate but computationally expensive models. If the model is taking too long to run then it may be worth trying one which uses less processing power, memory, or time to run.

  2. 2.

    The type of relationship believed to exist between the predictor and the dependent variable: if the relationships are not clear, or appear to be relatively complex, then machine learning techniques may be preferable (see Chap. 10). Are the relationships between explanatory variables linear or nonlinear? If linear then simple statistical models such as linear regression and ARIMA may be suitable (Sects. 9.3 and 9.4). If nonlinear then perhaps GAMs (Sect. 9.6) or neural network models (Sect. 10.4) should be considered.

  3. 3.

    The type of features (Sect. 6.2): For example, if lagged components of the data are important then an autoregressive model may be the most appropriate (see Sect. 9.4). If only exogenous variables like weather or econometric variables are important, maybe a cross-sectional forecast model is more appropriate than a time series one. In these cases, tree-based models (Sect. 10.3) and simple feed-forward neural networks (Sect. 10.4) could be applied.

  4. 4.

    Number of features: If a lot of features are required to train an accurate model then you risk overtraining. Not only are regularisation techniques likely to be required (Sect. 8.2.5), but the larger number of model inputs will mean more computational cost. Simpler models like linear regression can be quickly trained and they can be utilised within weighted regularisation methods such as LASSO (Sect. 8.2.4) to help with variable selection.

  5. 5.

    The amount of data available for training: many machine learning techniques require large amounts of data to create accurate forecast models, whereas some simpler statistical models can be trained with relatively little data.

  6. 6.

    Number of time series: If you are forecasting many time series (e.g. forecasts for smart meters from thousands of households) then it may not be practical to generate a forecast model for each time series. Instead you can train a single model over all (or a selection) of the time series. This is called global modelling, in contrast to the individual training (local modelling). This is discussed in a little more detail in Sect. 13.4.

  7. 7.

    Interpretability: many models, including support vector regression and most statistical methods, are easier to understand than others, such as neural networks, in terms of relating the outputs to the original inputs. This means sources of forecast errors can be more easily detected and fixed, and relationships more easily understood. This also means that further development of the method can also be applied, since weakness in the model are more easily identified. The trade-off is that more interpretable models can often have lower performance.

  8. 8.

    Forecast judgment: As a forecaster gains more experience they may come to better understand which methods tend to work well and which ones do not. Unfortunately, this can only come with time and practice.

In the long run the only true test of a methods accuracy is to implement it. When creating forecasts a good adage to remember, attributed to statistician George Box (co-creator of the Box-Jenkins method for ARIMAX models, Sect. 9.4), is “All models are wrong, but some are useful”.

However, at first, try models of different classes first to see what works well. For instance, if LASSO regression does not work well as the variables have highly non-linear relationships, ridge regression is also likely to not perform well. Similarly, if the data is insufficient to train a complex LSTM model, suffers from overfitting and needs a lot effort to tune the regularisation parameters in order to outperform much simpler models, a CNN will likely be similarly hard to train for the specific problem.