1 Introduction

Numerous sectors are heavily reliant on accurate weather forecasting including renewable energy production, energy consumption, agriculture and emergency services. Numerical weather prediction is an established weather forecasting technique in which the transport fluid equations momentum, energy and scalar transport are solved using the current atmospheric state as an input. The output is the temperature, humidity, pressure, etc. in a desired forecast length. Modelling large scale weather is notoriously difficult due to uncertain boundary conditions and the chaotic nature of the underlying fluid mechanics equations. The accuracy of numerical forecast predictions has improved steadily since the 1960s, carried mostly by the increase of computational power and turbulence modelling techniques [1]. To reduce the uncertainty of the predictions, expensive ensemble modelling is used, where simulations are run many times with small differences in initial conditions. Beyond five days, chaotic effects become dominant and the simulations demand large computational resources and are exceedingly expensive [2]

Ensemble modelling is computationally demanding requiring numerous runs of each model with different initial conditions. To make meaningful seasonal predictions, the number of runs should be between 100 and 200 [3], increasing the cost 100-fold over deterministic approaches. Moreover, the multi-scale nature of the fluid equations and physical processes associated creates simplifications and the initial state approximation may be inaccurate [4]. Similarly, the acquisition of representative initial conditions is one of the biggest hurdles in numerical weather prediction [5]. This characterisation process becomes increasingly challenging in cities where landscape drastically affects wind and temperature behaviour. Machine learning approach can complement existing numerical weather prediction, or in some cases even substitute it, thereby reducing the enormous computational demands associated with numerical weather prediction.

The present work proposes to use historical data of weather stations to produce short-term local forecasts. The locality of the data and forecast simplifies the complexity of spatial correlations that exist in turbulent fluid dynamics and reduces the size and training of the network. Moreover, local data is attractive for Deep Learning, which can account for the "unpredictability" of the local conditions.

The novelty of the method resides in the use of large historical data of nearby locations, to create simple input–output network models independent of the date. The approach is purely data-driven, without any kind of data assimilation or hybridisation. The model is tested using historical data from two London- based locations to train a Bi-LSTM recurrent neural network to predict temperature and relative humidity.

The main contributions of this article are:

  • The creation of a Deep Neural Network framework to use historical weather data to create forecasts of selected weather features over desired length.

  • The development of two models to predict temperature and humidity hourly evolution over 24 and 72 h in two locations in London.

  • The study of forecasting errors investigating seasonal variations and forecast length.

The rest of the paper is structured as follows. In Section 2, the relevant literature related to the use of Machine Learning in Weather forecasting is discussed, while in Section 3, the architecture and the dataset used for testing is described. In Section 4, the results with the two models developed are presented, while Section 5 concludes the paper and outlines future research directions.

2 Related work

Machine Learning (ML) is showing large potential in fluid mechanics [6, 7], where it can be used to model sub-grid stress [8, 9] or extract turbulent structures [10]. One of the first ML applications in weather forecasting was Schizas et al. [11] in 1991, where Artificial Neural Networks (ANN), where used to predict minimum temperatures. Similarly, Ochiai et al. [12] used ANN in 1995 to predict rainfall and snowfall. These models were able to improve the forecasting accuracy compared to statistical models [13]. However, the limited forecast of 30–180 min and difficulties in obtaining solution convergence made practical application impossible. Traditional machine learning examples include support vector machine or linear regression which are typically far less computationally demanding than neural networks and have been investigated as forecasting candidates. For example, Ma et al. [14] deployed a traditional machine learning model known as XGBoost, which are comprised of gradient boosted decision trees, to predict air temperature and humidity over a 3-h period with resulting root mean square errors (RMSE) of temperature of 1.77 \(^\circ{\rm C}\). Despite the relatively good result of traditional machine learning approaches, there are several reasons why a deep learning approach is preferred for weather prediction. Traditional algorithms are unable to model non-linearity, which is essential in predicting the evolution of the weather [15, 16]. Similarly, Shao et al. [17] reported that statistical and traditional ML techniques are not well-suited for complex wind forecasting and attribute this need to the turbulent and chaotic behaviour of wind. Recent efforts have focused on using Support Vector Machines and variations for short term series forecasting and classification of non-linear data and time series [18,19,20]. Deep Learning (DL) leverages the growing volume and accessibility of data. While traditional machine learning models reach a point beyond which additional training data no longer improves model performance, deep learning models have been observed to benefit from the increase in data [21]. DL networks have been increasingly used in time series forecasting in several applications, examples include finance [22], sugarcane yield prediction [23] and power load forecasting [24] among others. DL has the potential to significantly improve the accuracy of weather forecasting and its applications increased exponentially. Bauer et al. [4] showed that their Convolutional Neural Network (CNN) ensemble forecasting model can predict anomalies such as Hurricane Irma. Weyn et al. [25] increased the accuracy of weather prediction by applying ensemble modelling of separate CNN models, each with different starting conditions and sets of weights. Roy et al. [26] evaluated a multilayer perceptron, a long short-term memory (LSTM) model and a hybrid CNN/LSTM model and concludes that models with more complex architectures in general improve performance, while Ravuri et al. [27] demonstrated that their neural network model can predict precipitation more accurately in 89% of instances compared to existing weather prediction techniques. Hewage et al. [13] report that their ML models predict weather conditions 12 h into the future with higher accuracy than conventional weather forecasting.

Neural networks have been identified as being particularly promising in precipitation forecasting. A MetNet model developed at Google [28] was shown to predict precipitation accurately over the course of eight hours. In this hybrid approach, several models were used at different stages including LSTMs and CNNs. Despite its good performance, the model requires large volumes of data. An improvement was obtained by Met-Net2 [29], outperforming up to 12 h state-of-the-art weather models operating in the Continental United States. Fu et al. [30], upon evaluating many neural network architectures, settled on a combined Bidirectional-LSTM (Bi-LSTM) and a one-dimensional CNN to predict ground air temperature, relative humidity and wind speed over seven days. They used weather station data from ten weather stations in Beijing and the final model contained more than a million nodes. Despite its size and complexity, the quantitative performance relative to the local weather observations was uncertain. The latest trends among others, include the use of hybrid LSTM/GAN [31] to predict cloud movement, LSTM/CNN for drought forecast [32]. Wind forecasting is of great importance in wind power and load estimations and DL has been recently applied [33,34,35,36]. Most of the applications focused on short term which sped up prediction by up to 24 h.

The recent literatures shows that DL applications in weather forecast are accelerating, with large-scale forecasts using CNN-variant architectures and LSTM dominating point forecast. However, there are clearly several research bottlenecks associated with short-term forecasting. Most applications have been in wind-farm sites with "simple" weather patterns, while urban environments are more complex to predict as the turbulence content of the signal is larger. Moreover, there is a deterioration of predictions after several hours and there is not an optimal forecast length, which seems to depend on application.

3 Methodology and data processing

LSTMs are applied frequently in sequential problems as they address the issue of loss of long-term memory [37]. The Bi-LSTM recurrent neural network builds upon the LSTM structure. In a Bi-LSTM model a duplicate layer is produced, where sequential information flows in chronological order through the first layer while the duplicate layer is used for the same sequential information, but in reversed order. This provides the model with far more context as key information at both the start and end of the sequence is available.

The training data is openly available by the Met Office from two London weather observation stations: Kew Gardens (51.482, -0.294) and Heathrow (51.479, -0.451). The data was extracted from the Centre for Environmental Data Analysis [38] and contains weather information from 2015–2021 with dozens of hourly weather parameters, hereinafter referred to as features for consistency. However, not all features are available for all weather stations and so the selection was limited to six unique features (three per weather station). The features of particular interest are air temperature, relative humidity and wind speed at both Heathrow and Kew Gardens, see Fig. 1.

Fig. 1
figure 1

Joint probability density functions of two features (off-diagonal) and single-feature probability density functions (diagonal) for the two locations

With the features selected, the dataset is normalised. This is performed by using the mean and standard deviation for each feature. The mean and standard deviation are calculated from the training dataset, as including data from the validation and test sets and may result in overfitting [39].

The training, validation and test datasets are split up in fractions of 0.7, 0.15 and 0.15 respectively with the chronological sequence of the data maintained. This corresponds to a sample size of 36,825, 7,891 and 7,892 observations respectively.

Two networks were created, Model A, to forecast 24 h and Model B to predict 72 h. The same dataset with the same split ratio for training, validation, and testing was used in both models. However, Model B is deeper, with a denser Bi-LSTM with more cells and an additional Feed Forward neural network (FNN) in the second hidden layer. Model B was trained on the same dataset with the same split ratio for training, validation, and testing.

The architecture of Model A is characterised in Table 1 and determines the number of calculations performed. The input layer shape is defined by the length of the context and the number of features. The hidden layer shape is defined by the batch size and number of Bi-LSTM units; 256 forward and 256 backward units. A batch size of 32 results in 1,151 observations per batch from a total of 36,825 training observations with any difference subtracted from the final batch. Finally, the output layer shape is defined by the number of features and batch size. The total number of parameters to be trained in the model is the sum of those in the hidden layer and output layer, totalling 541,702.

Table 1 Architecture of the Bi-LSTM used in Model A, which includes the number and type of layers and the number of nodes in each layer

A dropout layer is included to minimise the impact of overfitting by randomly setting the weight of 25% of the units in the hidden layer to zero. Dropout is a well-established technique in neural network modelling to overcome overfitting and is considered a more practical approach than regularisation, which is a common approach to reduce overfitting in traditional machine learning problems (Table 2) [40].

Table 2 Parameters used in Model A including number of epochs and optimiser settings

The training process was performed using Jupyter Notebook within a Google Colaboratory environment. The complete runtime was 78 s after which predictions could be made within 10 s. The maximum memory usage during training was less than 16 GB. The entire test dataset corresponds to roughly one year of data in 2020 (while training is 2015–2019). The model uses 120 measured hourly data as input and the output is the desired forecast hours. A benefit of having a context length greater than the forecast length is that some measured data will always be used in making the prediction. However, the returns are diminished as the temporal gap between the measured data and forecast increases. A model with a larger context of 240 h capture the data trend but failed to express the peaks and troughs accurately. The approach was first tested by doing a single-hour forecast (see Fig. 2). This process is repeated across the entire test dataset and 7,772 single-hour predictions are generated. The root mean, mean absolute and maximum errors were \(0.8{9}^{\circ } C\), \(0.6{2}^{\circ } C\) and \(12.8{1}^{\circ } C\) respectively.

Fig. 2
figure 2

Comparison between predicted and measured temperature at Kew Gardens using the forecast length of one hour and a context length of 120 h. Scatter plot (left), one-year predictions (right)

4 Results

4.1 24-h temperature forecast

To predict 24-h, a comparison was initially made between the single-step (predict 24 h in one step) and multi-step prediction models to assess the impact of error propagation, see Fig. 3. Table 3 shows that the multi-step model prediction error according to all three metrics is approximately twice as large as the single-step error.

Fig. 3
figure 3

Comparison between predicted and measured temperature at Kew Gardens using the 24-h and 1-h temperature predictions

Table 3 Root mean squared error (RMSE), mean average error (MAE) and maximum error between hourly and 24-h temperature predictions in Fig. 3

To quantify how well our 24-h model generalises to different time periods and seasons, four prediction windows spaced 90 days apart are illustrated in Fig. 4. A benchmark model, naive mode, is used for comparison. The naive model uses the last measured temperature for the entire 24-h forecast. The naive model does not made assumptions about the future state and is completely uninformed. The root mean squared errors confirm the neural network performs significantly better than the naive model in all instances (Table 4) with an average error of 1.45\(^\circ{\rm C}\) and 6.00\(^\circ{\rm C}\) for the neural network and naive forecast respectively.

Fig. 4
figure 4

24-h forecast of the air temperature at Kew Gardens during four days in different seasons

Table 4 Root mean squared error (RMSE), mean average error (MAE) and maximum errors for the 24-h temperature prediction (Fig. 4), values in parentheses are normalised RMSE

To contextualise the performance, the neural network was compared to performance metrics from the Met Office. The 24-h predictions produced by the neural network were in 72.9% of all instances accurate to \(\pm\)2\(^\circ{\rm C}\). By comparison, the Met Office states 92.5% of its 24-h temperature predictions are accurate to \(\pm\)2\(^\circ{\rm C}\) while 92% of 24-h wind speed predictions are within 5 knots [41]. Note that measurements used in the weather stations were acquired with a resolution of \(\pm\) 0.1\(^\circ{\rm C}\) (Fig. 5).

Fig. 5
figure 5

Temperature probability density functions at Kew Gardens. Full temperature dataset of 52,608 samples and two predicted and measured distrinbutions from 96 samples

A better statistical comparisons is done by looking at the probability density functions of the predicted and measured data. The 96 individual forecasts are derived from the four windows in Fig. 4. These points were used to compute a distribution function and are compared to the measured temperature distribution for the same period, while the entire yearly data was used to create a benchmark. The 96-sample measured temperature peak is wider than the predicted peak indicating that predictions are conservative with both curves demonstrating bimodal behaviour. Nonetheless, the predicted and measured distribution agree very well, except tails on very hot days. Outlier temperatures above 40 \(^\circ{\rm C}\) were measured that are not predicted.

Using the same network (Model A), the length of forecast was varied next to understand the deterioration of the predictions without adapting the model and parameters. Ten different forecast lengths were tested ranging from one to 168 h (seven days). The RMSE mean and standard deviation are plotted against forecast length in Fig. 6 to indicate uncertainty for increasing forecast lengths. For consistency, each prediction was run with a single epoch rather than attempting to optimise performance by identifying the most suitable number of epochs for each forecast length. The single hour prediction has the smallest mean and standard deviation, both of which increase as the forecast length increases, but become more stable after 24 h. 1–24 h predictions have a mean error less than 3\(^\circ{\rm C}\). Beyond 24 h, the prediction uncertainty continues to increase before rapidly converging around 4\(^\circ{\rm C}\). While there are many caveats to this information, the results suggest that, without further optimisation, the model should not be used for predictions exceeding one day.

Fig. 6
figure 6

RMSE of Model A predictions against forecast length. The error bar correspond to min/max RMSE in the windows

4.2 72-h temperature, relative humidity and wind velocity forecasts

The Model B setup, is shown in Table 5. The main difference with Model A is the addition of a linear layer within the hidden layer and a reduction in the dropout percentage to 10%. The hyperparameters used in the optimised model are recorded in Table 6.

Table 5 Architecture of Bi-LSTM model, Model B, including the number and type of layers and nodes in each layer
Table 6 The finalised hyperparameters used to train Model B including the number of epochs and optimiser settings

As the first model, an increase in the number of epochs resulted in a reduction in the error and increase of the r-square value. However, there was no direct correlation between optimisation of these two parameters and how the 72-h forecast performed over different time periods. Therefore, once a capable architecture was identified, a similar trial-and-error approach began to optimise the hyperparameters and context length based on the RMSE from the four windows. Initially, 120 h were used for the context length but later changed to 168 h as this gave optimal performance. After upwards of twenty iterations with different conditions, the hyperparameters listed in Table 6 resulted in the best performance. * Once the model was trained, it was possible to make new predictions rapidly, within 15 s. The single-step hourly prediction RMSE was 0.94\(^\circ{\rm C}\), MAE 0.68\(^\circ{\rm C}\) and maximum error 14.94\(^\circ{\rm C}\) when calculated over the entire test dataset. While the numbers are comparable to the single-hour predictions generated in Model A, the model did not perform quite as well over three days as one day. This is to be expected as the forecast window is three times longer and the likelihood of error propagation is much higher.

The four windows in Fig. 7 illustrate how the Bi-LSTM and linear model is highly capable of making predictions with excellent generalisability across different periods and seasons. The three day forecast resulted RMSE mean and standard deviation 2.26\(^\circ{\rm C}\) and 0.316\(^\circ{\rm C}\) respectively, with 79.5% of the temperature forecasts are within \(\pm\)3\(^\circ{\rm C}\) when making a 72-h forecast (compared to 1.45\(^\circ{\rm C}\) and 0.244\(^\circ{\rm C}\) in single day prediction) (Table 7).

Fig. 7
figure 7

72-h forecast of the air temperature at Heathrow during four days in different seasons. Symbols same as Fig. 4

Table 7 Root mean squared error (RMSE), mean average error (MAE) and maximum errors for the 72-h temperature prediction (Fig. 7), values in parentheses are normalised RMSE

Figure 8 shows predicted distribution for 72-h forecasts. Despite the qualitatively good agreement, the modelled distribution has a narrower peak with extreme high temperatures underestimated (similarly to model A), showcasing the difficulty to represent the tails of the distribution.

Fig. 8
figure 8

Temperature probability density functions (left) and scatter plot (right) at Heathrow

The model takes in all features from both locations resulting in six unique features and 12 features in total. As before, it is possible to generate a prediction for any one of the features introduced to the model in training. While the model does take all inputs into consideration during training and seeks to minimise the loss function with respect to all features, the performance arising from this approach does not necessarily translate into good generalisability across all timescales. When training the model, the weighted sum of all 12 features is used when minimising the loss, assigning different levels of importance to each feature. During the training of Model B, the objective was to optimise the 72-h temperature predictions, there was no guarantee that this performance would translate into comparable performance for another feature, in this case relative humidity. The accuracy of the results in Fig. 9 are a byproduct of the process to optimise the air temperature. If the relative humidity were the focus of the optimisation, the forecast prediction would probably show considerable improvement (Table 8).

Fig. 9
figure 9

72-h forecast of the relative humidity at Heathrow during four days in different seasons

Table 8 Root mean squared error (RMSE), mean average error (MAE) and maximum errors for the 72-h relative humidity prediction (Fig. 7), values in parentheses are normalised RMSE

5 Conclusions and future work

This paper presented a novel, flexible, deep learning local weather forecasting. The approach is capable of rapidly predicting weather features and generating cheap, reliable short duration forecasts. The model is purely data-driven, in contrast with earlier approaches that required varying degrees of data assimilation or hybrid model. A total of two models were trained and used to predict air temperature and relative humidity. The dataset used to train the models contained six years of historical weather observations from Kew Garden and Heathrow weather observation stations in London. The objective of having multiple locations is to infer a topographical representation for the model to learn from. As the two weather observation stations are positioned 11 km apart, it is expected that they would share similar weather characteristics. Discrepancies in wind speed and humidity between the location could be explained by local land features and artificial structures. Kew Gardens is positioned near the river Thames in a built-up area while the nearest body of water to Heathrow is several kilometres away. Heathrow observation station is situated within the airport boundaries with few obstructions.

Model A is a 24-h prediction network designed to predict air temperature. This model was intended to demonstrate proof of concept and was trained with wet bulb, air and dew point temperatures. The Model A achieved its objective of establishing a baseline for further predictions. It showed that air temperature could be predicted with reasonable accuracy compared to the Met Office, predicting the air temperature within a range of 2\(^\circ{\rm C}\) in 72.9% of instances with a maximum error of 3.85\(^\circ{\rm C}\) occurring mostly in very hot days. Model B is a 72-h prediction network that attempted to predict air temperature, relative humidity and wind speed. Despite a three-fold increase in the forecast length, the model was able to accurately predict air temperature with an RMSE of 2.26\(^\circ{\rm C}\) at Heathrow and was able to predict the temperature accurately to within \(\pm\)3\(^\circ{\rm C}\) in 79.5% of instance. It was able to predict the relative humidity in the same location with an RMSE of 14%. However, Model B was optimised with respect to air temperature which impacted the accuracy.

The flexibility and speed of the model makes it attractive to short-term local forecast in locations where weather stations are present but it maybe difficult to have accurate weather predictions (due to topography, local effects, etc.) The result show that predictions up to three days have accuracy comparable to expensive numerical weather predictions. However, featured-based optimisation may be required to improve the accuracy of features such as wind speed or humidity. Future lines of research will be in this direction.