1 Introduction

About 23% of the German (and European) energy demand is met by natural gas [2]. Additionally, for about the same amount Germany serves as a transit country. Thereby, the German network represents a central hub in the European natural gas transport network. In light of the German Energy transition (“Energiewende”) with an increasing share of renewable energy sources as well as the envisioned international transition towards substantially less fossil fuels and related greenhouse gas emissions, the importance of natural gas will increase even more. A critical task of gas power plants is to deliver electricity in peak load situations when electricity from renewable energy sources is not sufficient to cope with the demands. From the gas network point of view, this leads to huge gas demands on very short notice.

The gas transmission network is run by transmission system operator, hereinafter referred to as TSOs. An integrated organization would conduct jointly operations including gas trading, running storage facilities and operating the transmission network. Thereby, the network capacities could affect transport requirements. However, TSOs face novel challenges to ensure the security of supply caused by the liberalization of the European gas markets [1], which makes TSOs no longer allowed to own, trade or store gas. Instead, trading will be conducted by independent companies to ensure discriminatory-free access to the transport network for all traders. Therefore, natural gas forecasting has become a fundamental input to the TSOs’ decision-making mechanisms. Meanwhile, the natural gas market is becoming more and more competitive and is moving towards more short-term planning, e.g., day-ahead contracts, which makes the dispatch of natural gas in the pipeline network even more challenging [12]. Therefore, a high-accuracy and high-frequency forecasting of local supplies and demands of natural gas consumption is essential for efficient network operation of TSOs.

Although on the contractual level all gas transports of a market area have to be balanced, this needs only to be achieved on average over time. Some outflow might actually only be balanced by an inflow at a later time.

Despite these challenges for the TSOs they need to meet all transport demands. The TSO has the obligation to monitor the situation, foresee possible shortages and react accordingly to ensure the safety of supply. Since changes in gas networks happen rather slowly it is therefore extremely important to have accurate forecasts on the demands and supply of the network to be able to react on time.

We collaborated with one of the biggest German TSOs, operating a gas network with a pipe length of about 12,000 km in total (see Fig. 1), to improve their hourly forecasts for demand and supply. We aim to:

  • Predict as precisely as possible the average hourly gas flows for the upcoming gas day, i.e., from 6 am to 6 am, just before the start of the gas day (at about 5:59 am);

  • The prediction needs to be appropriate for all different types of nodes ranging from connections to other networks or countries to industrial users or municipal consumers, leading to very diverse data characteristics;

To reach these goals, we investigated real data from the transport network operated by Open Grid Europe. We propose a powerful and robust hybrid forecast model that benefits from the combination of state of the art forecasting approaches and optimisation, leading to improved forecast accuracy. We interpreted the most important features that our model automatically selects.

In the following, the next subsections present nomenclature and an overview of related work. Section 2 describes the data we used in this study. Section 3 gives details on the proposed models. Section 4 describes the evaluation methodology and the evaluation of computational experiments. Finally, we draw some conclusions.

Fig. 1
figure 1

Map of the gas transmission network operated by Open Grid Europe

1.1 Nomenclature

Open Grid Europe GmbH


Municipal power stations


Power stations and industry




Transfer points to other networks


Mathematical programming


Linear program


Mixed integer linear programming


Functional autoregressive


Long short term memory


Hybrid model


Heating day degree


Baseline forecast (persistance)


1.2 Related work

Models on natural gas demand forecasting are mainly focused on long term issues. There are quite some publications regarding electricity demand forecasting, (see, e.g. [3, 30, 32, 47]) but electricity behaves very differently from gas.

A survey on models to predict natural gas consumption published between 1949 and 2010 is presented by [42] who evidences that only a few works are focused on hourly gas flow prediction. A more recent survey [51] considers 187 papers published between 2002 and 2017. The authors point out that the majority of works provide daily predictions and recognize that neural networks are the most used models. The authors also show that, on the considered period, most of the works were performed at an aggregated level (i.e. country or city) and only three papers proposed models to forecast the hourly gas consumption.

In [49], two neural networks were tested to forecast natural gas consumption based on historical data and environmental variables. The authors found a better prediction accuracy when using the multi-layer perceptron compared to the radial basis function. In [48], a model similar to radial basis neural network was proposed to predict gas consumption in a distribution system. In this work, input variables were selected using a genetic algorithm. Residential hourly gas consumption was predicted with neural networks by [17]. In this work, the heating degree-hour method which considers the gap between outdoor and indoor temperature was considered. The best hyper-parameters configuration consisted of 29 neurons, a feed-forward backpropagation algorithm and tangent, sigmoid and linear functions for the input, hidden and output layers respectively. Similarly, [45] proposed neural networks to forecast residential natural gas demand. The proposed network consisted of a multi-layer perceptron with one hidden layer. The input features included calendar (i.e. month, day of the month, day of the week, hour) and weather (temperature) information. The authors found that the average prediction error was higher during the winter months because gas flow was higher. More recently, [23] compared several machine learning models to predict residential natural gas hourly demand and found that recurrent neural network and linear regression were the most accurate models. The prediction results of monthly gas consumption of residential buildings using Extreme Learning Machine (ELM), artificial neural networks (ANNs) and genetic programming (GP) were presented by [24]. The ELM is characterized by higher training speed compared to backpropagation and it was found to perform better, in terms of RMSE, compared to the other two techniques. In [26] the authors set up a two stages methodology to predict daily gas consumption of utility companies. In the first stage, two NNs are run in parallel to produce daily forecasts; in the second stage, a nonlinear transformation of some features of the input vector is performed. The combination of the two stages is based on several methods such as average forecast, recursive least squares, etc. The results show that the mix between the two forecasters has higher accuracy although the combination of the two models increases the complexity. Overall, these works show that the consumer profile is very important when forecasting gas flow. In this regard, [38] identified seventeen groups profiles, based on their historical consumption and predicted daily gas demand. The overall prediction was obtained from the combination of single predictions.

The backpropagation algorithm optimized with a genetic algorithm was implemented by [54] to increase the training speed and to achieve a global minimum. The authors predict next day gas loads based on temperature and weather conditions. Furthermore, the authors tested the algorithm on a three years real dataset recorded in Shanghai to predict one month and a half gas load. Similarly, [55] propose a recurrent neural network to predict daily gas flow. The Output-Input-Hidden Feedback-Elman neural network takes into account, not only the hidden nodes’ feedbacks but also considers the output nodes’ feedbacks. The results improved compared to these obtained with the standard Elman network. However, the authors recognize that further research is needed to forecast gas demand during holidays. In [4], an adaptive network-based fuzzy inference system (ANFIS) consisting of a neural network integrated with fuzzy logic was proposed to forecast short term natural gas demand. The main advantage of this model was its ability to handle uncertainty, noise and non-linearity in the data and, compared to standard neural network models, provided more accurate results. Wavelet transform has been deployed by [44] to decompose the hourly gas demand time series and Bi-LSTM and LSTM are optimized using genetic algorithm. The model was applied to winter data on which it has shown good prediction accuracy.

Several static and adaptive models have been tested by [37] for short-term gas consumption forecast (random-walk, temperature correlation model, linear regression model, ARX, adaptive (recursive) linear auto-regressive model (RARX), neural network (NN), Recurrent NN, Support Vector Regression). They found that the best performance was obtained by the RARX of order 3. Furthermore, they found that nonlinear models such as neural networks and support vector machines had a lower generalization capacity compared to linear models. Finally, they concluded that the adaptive models overall performed better than static models.

The traditional approaches are regression and econometric models. In this regard, the performance of non linear mixed effects, ARIMAX and ARX models to predict gas consumption of 62 residential and small commercial customers was assessed by [10]. The authors forecast daily consumption of an entire month based on the previous 18 months. The time series included zero flows and missing data which were excluded for the training process. The prediction performance was similar in terms of daily mean absolute error which was close to zero for all the tested models. Thus, the authors propose to combine multiple models although they recognize that this might be a difficult task because of increased computational complexity. Multiple linear regression has been proposed by [40] who predicted annual gas consumption based on socio-economic variables (GDP and inflation in the case of Turkey) that have been selected based on their statistical significance. Based on the forecast, the authors propose alternative energy policies. Robust least square method combined with log-linear Cobb–Douglas model has been proposed by [15]. The authors compared the proposed robust and ordinary least square methods for the yearly forecast of the natural gas demand in Brazil, considering the total demand as well as the industrial and power sectors demand. The authors showed that using the proposed model can be very useful when a large amount of past data is not available, which is usually necessary for the calibration of more sophisticated forecast models.

A hybrid model formed by a grey model and an autoregressive integrated moving average model has been proposed by [52] to predict monthly shale gas production. The authors conclude that the results of the combined model are more accurate than the single linear and nonlinear models.

In [33], Multivariate Adaptive and Conic Multivariate Adaptive Regression Splines were proposed to predict residential daily gas demand. The two models provided better results in terms of prediction errors (MAE and RMSE) compared to these obtained with linear regression and neural networks. In [41], the nonlinear characteristics of the natural gas consumption is modeled with several Grey models that are compared to predict the yearly natural gas consumption in China. Nonlinear programming and genetic algorithm have been proposed by [19] to predict natural gas consumption in the residential and commercial sectors on a yearly basis. Similarly, [25] proposed the breeder hybrid algorithm which consists of three steps for natural gas flow demand forecast. In the first stage, the coefficients of a nonlinear regression model are estimated. Successively, the estimates are improved using a genetic algorithm. Finally, the optimized coefficients are deployed as initial solutions for the simulated annealing. Nearest neighbor and local regression were proposed by [6] to predict gas flow in a small gas network with a 15 minutes resolution. The authors evidence the importance of environmental variables such as the temperature. Their method allowed to detect anomalies and the consumption patterns based on one year historical data. In the literature, there are also combinations of several methods to predict one day-head natural gas consumption. In [34], the time series were decomposed into low-frequency and high-frequency components using Wavelet transform. In a second step, the genetic algorithm and Adaptive Neuro-Fuzzy Inference System were deployed to predict each of the decomposed time series. The output was finally fed into a feed-forward neural network to refine the prediction. The research was focused on different types of natural gas distribution points. The authors obtained better prediction results using the data of distribution points located near the city center. Neural networks have been also compared to the performance of autoregressive models. In [46], for instance, short term natural gas consumption in Turkey was predicted using SARIMAX model and Neural Networks (Multilayer and Radial Basis) and multivariate regression. They found that SARIMAX had better prediction performance. The temperature correlation model, proposed by [43], was compared with several configurations of ARX, stepwise regression, Support Vector Regression and neural network. The authors found that SVR and NN performed better on the training set, while high order ARX model performed better on the test set. Support Vector Regression has been deployed with false neighbours filtered approach to predict short term natural gas consumption [56]. The local predictor was based on the nearest neighbour approach so that the Euclidean distance between the training and test data and the neighbour filter was applied to determine the validity of the predicted values based on the exponential separation rate. The authors obtained better performance prediction compared to ARIMA, neural networks and Support Vector Regression.

Overall, the analyzed literature shows that there are few works that are focused on the comparison between methods to predict hourly gas flow of different types of nodes in a gas network or combining the advantages of different forecasting methods to a hybrid model for hourly gas flows. Therefore, we propose a hybrid model based on optimisation and machine learning and compare its results to four different models to predict hourly gas flow. To address the heterogeneity of the time series for the different node types we compare results obtained for four different types of nodes.

2 Data

We consider high-resolution natural gas inflows and outflows in the high-pressure gas pipeline network operated by Open Grid Europe GmbH (OGE).

The gas transmission network has more than 1000 boundary nodes which can be classified into four different groups:

  • Network Transfer Points (labeled NET) are large nodes with natural gas imported and exported to other networks mostly outside Germany. These can be entries and/or exits.

  • Municipal Utility nodes (MUN) serve residential and small commercial constituents and are always only exits. They are often temperature dependent, exhibit daily and seasonal patterns, and simultaneously are influenced by weekends/holidays.

  • Industry and Power Stations (IND) represent electricity generation and factory production nodes. These are also always exits and naturally exhibit weekly patterns due to working routines.

  • Storage nodes (STO) usually have a large number of zero flow hours with some substantial, often constant transfer in between. The nodes can always be both entries and exits.

While, in principle, we know the above classification, it is proved not to be reliable regarding the behavior of the nodes, so we will not use this information as part of the forecast, but just to explain certain behavior.

Table 1 depicts the number of nodes belonging to the different groups and the percentage of gas flow explained by each group.

Table 1 Number of nodes and percentage of flow

As an illustration, we carefully selected three nodes for each type. The three network nodes we selected occupy 22% of the whole network flow. The municipal nodes are considered important by the TSO. The Industry nodes selected represent power plants and play a key role in energy generation with high renewable energy shares, as they are fast to start and can produce the necessary energy in times of peak demand. For the representative nodes from the Storage group, we selected the most frequently used nodes in the observed period. Figure 2 shows normalized (to the range of [0, 1]) flows of the nodes considered in this study.

Fig. 2
figure 2

Normalized flow of selected nodes

For each node, the gas flows are measured hourly. Additionally, we were given the average daily temperatures measured at the nodes. Some statistical properties of selected nodes are given in Table 2. As can be seen from Fig. 2 and in Table 2 some nodes have continuous flow, while other are active only occasionally. Storage nodes have the highest percentage of zero flows. For the ones considered in this study hours with zero flow amount for 26–53% of the time. Network nodes show the highest variability and are always inflows. Industry nodes are always outflows and are clearly not temperature dependent. Municipal nodes are usually temperature dependent and have strong daily, weekly and seasonal patterns.

Table 2 Properties of nodes used in the study

3 Methods

Many research studies showed that combining forecasts improves accuracy relative to individual forecasts [16, 27]. In this section, we will first present three different individual forecasting methods; Functional AutoRegressive (FAR), Long Short-Term Memory Network (LSTM) and Mathematical Programming (MP) model. Then we will propose a hybrid model (HYB) based on MP method which is using the output of two other forecasting models, FAR and LSTM, as additional inputs (features).

3.1 Functional autoregressive (FAR) method

Recent development in functional data analysis provides an efficient way for jointly analyzing the large dimensional processes such as natural gas flows. On each day, the hourly gas flows are represented as a continuous gas flow curve over an infinite time interval that naturally inherits the serial and cross-dependence in the raw data. The serial cross-dependence among the daily gas flow curves is described by the functional autoregressive (FAR) model ([8]), which extends the discrete time series analysis from a finite dimensional space to an infinite world. The most popular estimation methods for FAR type models include the functional Yule–Walker estimation and the sieve maximum likelihood (SML) estimation, see [7,8,9, 21], and [12,13,14, 31].

Our interest is to predict the daily gas flow curves based on the learned dynamic dependence over time. We detail the FAR setup for the daily curves of natural gas flows and show how to obtain the SML estimator of the functional parameters.

Let \( \{ X_{t} (\tau )\} _{{t = 1}}^{n} \) denote the gas flow curve on day t, which is a square-integrable random function in the Hilbert space \({\mathcal {H}}\) defined over a time domain \(\tau \in [0,1]\) without loss of generality. The functional autoregressive model of order 1, i.e. the FAR(1) model, is defined as:

$$ X_t(\tau )-\mu (\tau )=\int _0^1K(\tau -s)[X_{t-1}(s)-\mu (s)]ds + \varepsilon _t(\tau ), $$

where \(\mu (\tau )\) is the time-dependent mean function of \(X_t(\tau )\). The innovation \(\varepsilon _t(\tau )\) is a strong \({\mathcal {H}}\)-white noise with zero mean and bounded second moment \(E\Vert \varepsilon (\tau )\Vert ^2<\infty \). The norm \(\Vert \cdot \Vert \) is induced by the inner product \(<\cdot>\) of \({\mathcal {H}}\). The AR operator is represented as a kernel \(K \in L^2([0,1])\), which is one implementable form of the Hilbert-Schmidt operator specifying the serial dependence of the curve on its own past value. This choice of convolution kernel operator is very common in the study of functional linear and autoregressive processes, see, [29, 31, 53] and [14]. The kernel K is usually taken to be an even function with \(\Vert K\Vert _2<1\). Here, \(\Vert \cdot \Vert _2\) denotes the standard \( L_2 \) norm.

For the functional observations and the autoregressive terms defined on the infinite dimensional space, we project them onto Fourier basis functions given their periodic features, which is also easy to derive a closed-form solution. We represent the functional terms in the basis of \(L^2([0,1])\) given by the trigonometric functions \(\Phi _{0}=I_{[0, 1]}\), \(\Phi _{2k}(\tau )=\sqrt{2}\cos (2\pi k\tau )\) and \(\Phi _{2k-1}(\tau )=\sqrt{2}\sin (2\pi k\tau )\) for \(k \in {{\mathbb {Z}}}\backslash \{0\}\), as follows:

$$\begin{aligned} \begin{aligned}&{\textit{Y}}_{t}(\tau )=a_{t,0}+\sum _{k=1}^{\infty }[b_{t,k}\Phi _{2k-1}(\tau )+a_{t,k}\Phi _{2k}(\tau )],\\&K(\tau )=c_{j,0}+\sum _{k=1}^{\infty }c_{j,k}\Phi _{2k}(\tau ),\\&\delta (\tau )=\omega _{0}+\sum _{k=1}^{\infty }[\eta _{k}\Phi _{2k-1}(\tau )+\omega _{k}\Phi _{2k}(\tau )],\\&\varepsilon _{t}(\tau )=\epsilon _{t,0}+\sum _{k=1}^{\infty }[e_{t,k}\Phi _{2k-1}(\tau )+\epsilon _{t,k}\Phi _{2k}(\tau )], \end{aligned} \end{aligned}$$

where \(a_{t,0}\), \(a_{t,k}\), \(b_{t,k}\) denote the constant, cosine, and sine Fourier basis coefficients corresponding to the observed gas flow curves \(X_{t}(\tau )\); \(c_{j,0}\) and \(c_{j,k}\) are the constant and cosine basis coefficients for the unknown even kernel \(K(\tau )\); \(\omega _{0}\), \(\omega _{k}\), \(\eta _{k}\) are for the intercept function \( \delta (\tau )=\mu (\tau )-\int _{0}^{1}K(\tau -s)\mu (s)\,ds \), and \(\epsilon _{t,0}\), \(\epsilon _{t,k}\), \(e_{t,k}\) are for the innovation \(\varepsilon _{t}(\tau )\).

Plug-in the Fourier expansions into (1) and re-arrange the equations, we obtain the recursive relationship of the Fourier coefficients. However, it is unfeasible to estimate the Fourier coefficients in infinite-dimensional parameter spaces. This makes it necessary to conduct regularization or dimension reduction. Among others, [20] proposed the method of sieve to conduct estimation over the approximating subspaces \(\{\Theta _{m_n}\}\), called sieves, rather than over the original infinite-dimensional space \(\Theta \). We refer to [11] for more theoretical details and [13] for a specific application example of the sieve method.

Under sieves, the unknown parameters are estimated under a subspace \(\{\Theta _{m_n}\}\). The estimation of the FAR model is thus converted to an estimation problem of a finite number of unknown Fourier coefficients. Further assume the Fourier coefficients of the innovation function \(\varepsilon _t(\tau )\), i.e, \(\epsilon _{t,0}\), \(\epsilon _{t,k}\), \(e_{t,k}\), are IID Gaussian distributed with zero mean and variance \( \sigma _k^2 \), a transition density under \(\Theta _{m_n}\) is defined as follows:

$$\begin{aligned} \ell (X_t,K)= & {} \frac{2\pi ^{-(2m_n+1)/2}}{\sigma _0\prod _{k=1}^{m_n}\sigma _k^2}\cdot \exp \left\{-\frac{1}{2\sigma _0^2}( a _{t,0}-\omega _0-c_0 a _{t-1,0})^2\right. \\&-\left. \sum ^{m_n}_{k=1}\frac{1}{2\sigma ^2_k}\left[( b _{t,k}-\eta _k-\frac{1}{\sqrt{2}}c_k b _{t-1,k})^2+( a _{t,k}-\omega _k-\frac{1}{\sqrt{2}}c_k a _{t-1,k})^2\right]\right\}, \end{aligned}$$

based on which the maximum likelihood estimator can be obtained with closed-form solution under sieve \(\Theta _{m_n}\) as:

$$\begin{aligned} \begin{aligned}&{\hat{\omega }}_0=\frac{-{\hat{c}}_0\sum _{t=2}^{n} a _{t-1,0}+\sum _{t=2}^{n} a _{t,0}}{n-1},\quad {\hat{c}}_0=\frac{\sum _{t=2}^na_{t,0}\sum _{t=2}^{n}a_{t-1,0}-(n-1)\sum _{t=2}^{n}a_{t,0}a_{t-1,0}}{(\sum _{t=2}^{n}a_{t-1,0})^2-(n-1)\sum _{t=2}^{n}a_{t-1,0}^2}\\&{\hat{\omega }}_k=\frac{\sum _{t=2}^{n} a _{t,k}-\frac{1}{\sqrt{2}}{\hat{c}}_k\sum _{t=2}^{n} a _{t-1,k}}{n-1},\quad {\hat{\eta }}_k=\frac{\sum _{t=2}^{n} b _{t,k}-\frac{1}{\sqrt{2}}{\hat{c}}_k\sum _{t=2}^{n} b _{t-1,k}}{n-1}.\\&{\hat{c}}_k=\sqrt{2}\frac{\sum _{t}(a_{t,k}a_{t-1,k}+b_{t,k}b_{t-1,k})-(\sum _{t}a_{t,k}\sum _{t}a_{t-1,k}+\sum _{t}b_{t,k}\sum _{t}b_{t-1,k})/(n-1)}{\sum _{t}(a_{t-1,k}^2+b^2_{t-1,k})-\{(\sum _{t}a_{t-1,k})^2+(\sum _{t}b_{t-1,k})^2\}/(n-1)}, \end{aligned} \end{aligned}$$

which lead to the kernel estimator \( {\hat{K}}(\cdot ) \).

We implement the FAR modelling to forecast the daily gas flow curves. The h-step ahead gas flow forecast, denoted as \({{\hat{X}}}_{t+h}(\tau )\) is:

$$\begin{aligned} {{\hat{X}}}_{t+h}(\tau )={{\hat{\mu }}}(\tau )+\int _0^1{{\hat{K}}}(\tau -s)[X_{t}(s)-{{\hat{\mu }}}(s)]ds. \end{aligned}$$

At each forecast point, we estimate the Fourier coefficients to obtain the estimated mean function \({{\hat{\mu }}}(\cdot )\) and kernel operator \({{\hat{K}}}(\cdot ).\) The fitted model is then used to compute \(h=1-\) and 2-step ahead forecasts of the gas flow curves.

3.2 Long short-term memory network

In this section, we use Long Short-Term Memory Network (LSTM) to predict gas flow based on the previous 24-h. LSTM are a special types of Recurrent Neural Networks (RNN) that have been introduced in the eighties (i.e. [18, 39]) to model time interrelations by allowing connection between hidden units with a time delay [35]. At each iteration, the hidden state vector receives the input vector and its previous hidden state. The hidden state vector can therefore be seen as a representation of time sequences [36].

Long Short-Term Memory (LSTM) networks, proposed by [22], include a memory gate that controls what passes through the network and what is blocked so that some of the information that is feedback to the network is remembered and some other is forgotten. An additional gate keeps memory and filters out what has to be forgotten. At each time step, the network memorizes the information and filters out what is not relevant for the prediction. Finally, another set of gates ignores what is irrelevant. The formulation of the LSTM, as presented in [28], consists of three layers that are called gates: the input (3), forget (4) and output gates (7), respectively:

$$\begin{aligned} i = sig(W_i \dot{[}h_{t-1}, x_t] + b_i) \end{aligned}$$
$$\begin{aligned} f = sig(W_f \dot{[}h_{t-1}, x_t] + b_f) \end{aligned}$$

where \(h_{t-1}\) is the hidden state computed at time \(t-1\) which is calculated based on previous hidden state, \(h_t\). Each of the gate has a sigmoid function so that the values range between 0 and 1.

The decision on which information will be stored into a cell is based on the input layer (3) and on a hyperbolic tangent (or sigmoid) function assigned to the layer that returns the set of candidate values, \( {\hat{C}}\) (5):

$$\begin{aligned} {\hat{C}}_t = tanh(W_C \dot{[}h_{t-1}, x_t] + b_C) \end{aligned}$$

To update the cell state \(C_{t-1}\) into \(C_t\), the old state is multiplied by the forget gate and added to the new candidate values (6):

$$\begin{aligned} C_t = f_t \circ C_{t-1} + i \circ {\hat{C}}_t \end{aligned}$$

Finally, the output is obtained by passing the cell state \(C_t\) to a rectified linear function (or hyperbolic tangent) to decide which part of the information is passed to the output 7. Moreover, the cell state is multiplied by the output of the relu (or tanh) gate (8).

$$\begin{aligned} o_t= sig(W_o[h_{t-1},x_t] + b) \end{aligned}$$
$$\begin{aligned} h_t= tahn(C_t) \circ o_t \end{aligned}$$

Thanks to this architecture, LSTM has the ability to look back several time steps and, thus, to improve the predictions. Also recurrent neural networks can look time steps back but the problem they incur is called vanishing or exploding gradient for which the results either become very large or small.

The LSTM deployed to forecast gas flow consisted of one single layer and an early stop function with patience set to four. This means that the training of the network stops as soon as the value of loss function remains the same after four iterations. In this work, different types of parameters are manually selected to forecast gas flow depending on the type of node and based on trial and error.

Table 3 Parameters of the LSTM

The configuration of the parameters of the network for each node is reported in Table 3. The most influential parameter is the batch size that is the number of training examples that are utilized in each iteration. The higher the value of this parameter the faster is the training of the network. Only two types of activation functions are selected for the hidden state, depending on the node: hyperbolic tangent function (tanh) or sigmoid. The transfer function of the output gate selected for all nodes, except for one network node, is the Rectified Linear Unit (Relu). Overall for storage nodes that are characterized by high variability between negative and positive values and by a high number of hours with zero flows, the batch size was set between 48 and 80 and the number of neurons between 70 and 80.

3.3 Mathematical programming (MP) for time series forecasting

In this Section, we use Linear Programs (LP) together with Mixed Integer Linear Programs (MILP) for prediction of the flows—supplies and demands of the gas network. Given a set of measurements \(m_{d,h} \in M\) for each day \(d \in D\) and each hour \(h \in H\). Let us define \(M_{d} \subseteq M\) as a subset of the measurements before day d. The features \(i \in F_h = \{1, \ldots , n_h\}\) are defined as arbitrary functions of historical flow values, \(f_{h,i}(d):D \rightarrow M_d, i \le p_h \le n_h\) and exogenous variables \(f_{h,i}(d), i \in \{p_h + 1, \ldots , n_h\}\). We can approximate gas flow with weighted sum of features

$$\begin{aligned} p_{d,h} = \sum _{i \in F_h} w_{h,i} f_{h,i}(d) \end{aligned}$$

where \(p_{d,h}\) is the flow value which is approximated, and \(w_{h,i}\) define the weights.

The approximation error is defined as

$$\begin{aligned} e_{d,h} = p_{d,h}-m_{d,h} \end{aligned}$$

and the optimal weights are calculated by minimizing the sum of absolute errors for each day d and hour h

$$\begin{aligned} \min \sum _{d \in D, h \in H} |e_{d,h}| \end{aligned}$$

This problem is not a LP because of the nonlinear absolute value in the objective function but it can be transformed into a LP. We can rewrite the error \(e_{d,h}\) as the difference of two non-negative variables:

$$\begin{aligned} e_{d,h} = e^+_{d,h}-e^-_{d,h} \end{aligned}$$

Then the transformed objective function is

$$\begin{aligned} \min \sum _{d \in D, h \in H} |e^+_{d,h} + e^-_{d,h}| \end{aligned}$$

For a solution to be optimal regarding to the objective \(e_{d,h}^+\cdot e_{d,h}^- = 0\) must be true, so we can write

$$\begin{aligned} |e_{d,h}^+ - e_{d,h}^-|=|e_{d,h}|+|e_{d,h}^-|=e_{d,h}^+ + e_{d,h}^- \end{aligned}$$

and consequently the final LP problem becomes

$$\begin{aligned}&\min \sum _{d \in D, h \in H} (e^+_{d,h} + e^-_{d,h}) \\ {\text {subject to}} \sum _{i \in F} f_{h,i}(d)& \cdot w_{h,i} - m_{d,h}= e^+_{d,h} - e^-_{d,h} {\text { for all} } \in D, h \in H\\& e^+_{d,h}, e^-_{d,h} \ge 0 \\&w_{h,i} \in {\mathbb {R}} \end{aligned}$$

Furthermore, we can force our model to be unbiased by requiring

$$\begin{aligned} \sum _{d \in D, h \in H} (e^+_{d,h} - e^-_{d,h}) = 0 \end{aligned}$$

and setting bounds for the weights \(l \le w_{h,i} \le u\) to limit the influence of a single specific feature.

For each day in the test set (out of sample days that we want to forecast) the forecasted flow values are computed by first computing the weights via an LP with 16 weeks of historical data and then using the weighted sum of features (9) for each hour to forecast the flow values. The lower and upper bounds for the weights are set to \(l=-2\) and \(u=2\), respectively. For the computation of the forecasted flow values, it might be that also forecasted flow values of prior hours are used as input values for calculating the features, if the corresponding hours do not lie in the past.

3.4 Training: feature selection

In the training procedure of this method, a slightly different model is used which automatically chooses for each hour the B features which are most important, to limit over-fitting in the LP. Therefore, we add additional binary variables \(x_{h,i}\) to the problem, which determine whether feature i is chosen for hour h, i.e., whether the weight of feature i and hour h is not equal to zero. Then, we link these variables to the weight variables

$$\begin{aligned} x_{h,i} \cdot l \le w^+_{h,i} \le x_{h,i} \cdot u \end{aligned}$$

and limit the number of chosen features by B

$$\begin{aligned} \sum _{i \in F} x_{h,i} \le B \end{aligned}$$

The solution of the resulting MILP leads for each hour h to one feature set \(F_h\) of at most B features which are most important for this hour.

Table 4 List of features

The list of features used in this study is presented in Table  4. The whole set of features F we used consists of 29 different features based on historical flow values, one temperature feature and two different features describing position of the predicted gas day in the week and the offset feature. Using sensitivity analysis the number of chosen features was limited to six. One year of historical measurements was used for training and selecting optimal set of features for every node and hour. Figure 3 is showing the heatmap of selected features for MP model for each group of nodes summed up for 24 h.

Fig. 3
figure 3

Heatmap of selected features for different node group

For all nodes the feature representing the flow of the previous hour (\(f_1\)) is the mostly used feature. For all hours except the first predicted gas hour this feature value is calculated based on the the forecasted flow of previous hour when the final forecast is calculated. The same hour yesterday(\(f_4\)) is also widely selected among all groups. As it was expected Weekend (\(f_{31}\)), Evening (\(f_{32}\)) as well as Mean temperature difference feature (\(f_{30}\)) are usually chosen only for the Municipal utilities since the behaviour of those nodes shows strong daily, weekly and seasonal patterns. In the case of Industry nodes the features of Mean flow of the same and previous day (\(f_{29},f_{15}\)) together with the Ratio features \(f_{11},f_{12}\) are the most frequently chosen features. For Transfer Points and Storages features of mean flow of the same and previous day (\(f_{29},f_{15}\)) are also dominating ones except for first gas hour where this pattern is not present. For all observed nodes the Offset feature (representing the bias in the model) is selected very frequently.

Figure  4 shows a scatter plot of flow amount versus the three most frequently chosen features among different types of nodes for all hours of the day. It can be seen that the flow depends linearly on \(f_4\) while other features are showing a nonlinear dependency.

Fig. 4
figure 4

Scatter plot of some frequently chosen features vs flow

3.4.1 Hybrid model

The main advantage of the mathematical programming method (MP) proposed in this paper is flexibility in the sense that adding new features when they are available in order to improve the forecast is very simple.

In this section, we propose a hybrid model(HYB), combining mathematical programming (MP) with the two other proposed methods by adding the outputs from LSTM and FAR model as an exogenous inputs to the MP model. The optimal sets of features chosen for every node and hour were kept from previous MP training and extended with the forecasts from the LSTM and FAR model as additional features. The final forecast is calculated as weighted sum of all features from the extended features set:

$$\begin{aligned} p_{d,h}=\sum _{i \in F} f_{h,i}(d) \cdot w_{h,i} + {\text{ LSTM }}(d,h)\cdot w_{h,{\text{ LSTM }}}+ {\text{ FAR }}(d,h) \cdot w_{h,{\text{ FAR }}} \end{aligned}$$

New optimal weights are calculated by running an LP:

$$\begin{aligned}&\min \sum _{d \in D, h \in H} (e^+_{d,h} + e^-_{d,h}) \\{\text {subject to }} p_{d,h} &- m_{d,h} = e^+_{d,h} - e^-_{d,h} {\text { for all }} d \in D, h \in H \\p_{d,h}=\sum _{i \in F} f_{h,i}(d) \cdot w_{h,i} &+ {\text{ LSTM}}_h(d)\cdot w_{h,{\text{ LSTM }}}+ {\text{ FAR}}_h(d) \cdot w_{h,{\text{ FAR }}} \\&e^+_{d,h}, e^-_{d,h} \ge 0 \\w_{h,i},&w_{h,{\text{ LSTM }}},w_{h,{\text{ FAR }}} \in {\mathbb {R}} \end{aligned}$$

where \({\text{ LSTM}}_h(d)\) and \({\text{ FAR}}_h(d)\) represent the forecasted values obtained from the LSTM and FAR models, respectively.

4 Testing and results

4.1 Influence of temperature

The temperature is one of the most important factors that influence gas consumption. When the natural gas is consumed for heating such as in residential areas, the temperature usually has an inverse relationship with gas consumption which is also dependent on other environmental data such as the time of day, the day of week, the season, etc. Furthermore, temperature and time of day are the factors that mostly impact the forecast error [45].

The majority of models presented in the literature are focused on residential and small commercial consumer at individual or aggregate level. Several authors have considered the temperature (i.e. [19, 33, 34]) or meteorological data [46] in their models to forecast gas flow and reduce the prediction error. In [50], the authors pointed out that the nonlinear characteristics of temperature has been assessed long time ago and gas consumption is proportional to the Heating Degree Day (HDD) [5]. This proportionality is evidenced when plotting the average daily temperature versus the gas consumption.

Fig. 5
figure 5

Scatter plot of gas consumption daily change vs temperature change

The scatter plots of daily changes of temperature versus daily changes of gas flow of the nodes considered in this work are shown in Fig. 5. Storage nodes have a high percentage of zero flows, those are the nodes that better approximate the non linear relationship expressed by the HDD. As expected, the Municipal nodes present a positive correlation between the flow and the temperature. Among the network nodes, one of them presents a negative relationship with the temperature. The remaining nodes appear to be independent from the temperature.

4.2 Results

4.2.1 Objectives, setup and evaluation metrics

In this paper, we observed a data set of hourly gas flow time series from 12 nodes with 17.520 observations (2 years).

Our goal is to predict values \(p_{d,0}\) to \(p_{d,23}\) for a given \(d\in D\). Of course only data for days \(d-1\) and earlier can be used. All proposed methods are tested on the last 60 days of the data set.

Our basis comparison are the mean absolute deviation (MAD) between the hourly forecast and the measured flow during one day (24 h) ahead forecast, defined as

$$\begin{aligned} {MAD_d}:= 1/h\sum _{h}|p_h-m_h| \end{aligned}$$

and mean absolute percentage error (MAPE) defined as

$$\begin{aligned} {MAPE_d}:= 1/h\sum _{h}|(p_h-m_h)/m_h| \end{aligned}$$

All results are also compared to a baseline (BAS) forecast defined as

$$\begin{aligned} {\hat{p}}_{h,d}:= m_{h,d-1} \end{aligned}$$

Even though our main goal is to predict, as precisely as possible, the average hourly gas flows for the next 24 h, we adopted mean daily errors (MAD and MAPE) as accuracy metrics having in mind that the transport system operator has certain flexibility throughout the day. In particular, this means that, in principle, the maximum hourly error could be compensated by technical measures of TSO as long as it is not prolonged for several hours and have same direction.

4.2.2 Forecasting results

Mean MAD and MAPE (over 60 days in the test set) achieved by proposed models are presented in the Table 5.

It can be seen that between individual forecasting methods the LSTM model is the most robust one and obtained the best results for nodes from all 4 behavior groups. The MP model achieved the best results for all Municipal utilities, and the performance is especially good for MUN2 node where all models had a particularly high MAPE. FAR model outperformed others for the Industry node IND2. Even though FAR errors are slightly higher than other two proposed models (MP and LSTM) it can be seen that for most of the nodes the performance is very similar. For Storage nodes with intermittent behaviour none of the proposed individual methods has demonstrated adequate accuracy.

The Hybrid model showed an improvement for all 4 groups nodes. The improvement is especially significant for Storage nodes, where the average MAPE is lower for more than 2(%) for nodes STO1 and STO2. For node STO3 the LSTM model has the lowest MAPE but the lowest MAD is achieved by the HYB model.

The four types of nodes considered in the paper have very different behaviors, and the time series of the corresponding flows can be forecasted with wide accuracy ranges. The Storage nodes have a characteristic behavior with the intermittent flow of a large scale which, in addition, can have both directions. The MUN2 node also shows some intermittent behavior, especially in the test period, which results in higher mean daily errors.

Figure  6 shows calculated 24 h ahead forecast and the measured flow of all proposed models for a 1 week period.

Table 5 Comparison of mean daily performance
Fig. 6
figure 6

24 hours ahead forecast for 1 week test set

The proposed Hybrid model is built on the top of Mathematical Programming (MP) method, which represents a weighted sum of features (calculated on previous flow values as well as some exogenous variables) chosen by Mixed Integer Programming (MIP) in the offline regime. The Hybrid model’s main advantage is its property to easily include new features like forecasts from different models in order to improve the final result. Using the Hybrid model, we are ensuring that the provided forecast is at least ‘as good as’ the best model’s result, always taking advantage of discarding the influence of previously inaccurate model. Even though for some nodes, single forecast models outperform the Hybrid model, the proposed method shows robust, stable accuracy very similar to the best individual model and, in some cases, brings a significant improvements. For example, in the case for MUN2 node, which shows some intermittent behaviour and significant changes in the daily levels, the LSTM model fails to give a good forecast with the mean daily MAPE of more than 50%. Simultaneously, the Hybrid model successfully sets the weights for the corresponding features and discards LSTM forecast, using the linear combination of other features to provide the forecast, which reduces the MAPE by 26%.

5 Conclusions

In this paper, we proposed a robust and powerful hybrid forecast model combining Mathematical Programming with Functional AutoRegressive and Long Short Term Neural Network model for forecasting gas flows at the boundary nodes of gas transport network. Our experiments are based on real world data from one of Germany’s largest transmission system operators, Open Grid Europe. We showed that the proposed method is appropriate for choosing optimal set of features and forecasting various behaviours from different nodes groups in the complex gas transmission network. From obtained results it is clear that even though in some specific cases single forecast models outperform the Hybrid model, the proposed method can achieve stable accuracy close to the best individual model and in some cases brings a significant improvement to the forecast quality.