1 Introduction

Mixed-Integer Linear Programming (MILP) is concerned with the modelling and solving of problems from discrete optimisation. These problems can represent real-world scenarios, where discrete decisions can be appropriately captured and modelled by integer variables. In real-world scenarios a MILP model is rarely solved only once. More frequently, the same model is used with varying data to describe different instances of the same problem, which are solved on a regular basis. This holds true in particular for decision support systems, which can utilise MILP to provide real time optimal decisions on a continual basis, see (Beliën et al. 2009) and (Ruiz et al. 2004) for examples in nurse scheduling and vehicle routing. The MILPs that these decision support systems solve have identical structure due to both their underlying application and cyclical nature, and thus often have similar optimal solutions. Our aim is to exploit this repetitive structure, and create generative neural networks that generate binary decision encodings for subsets of important variables. These encodings can be used in a primal heuristic by solving the induced subproblem following variable fixations. Additionally, the result of the primal heuristic can be used in a warm start context to help improve solver performance in achieving global optimality. We demonstrate the performance of our neural network (NN) design on the transient gas optimisation problem (Ríos-Mercado and Borraz-Sánchez 2015), specifically on real-world instances embedded in day ahead decision support systems.

The design of our framework is inspired by the recent development of Generative Adversarial Networks (GANs) (Goodfellow 2016). Our design consists of two NNs, a generator and a discriminator. The generator is responsible for generating binary decision values, while the discriminator is tasked with predicting the optimal objective function value of the reduced MILP that arises after fixing these binary variables to their generated values. Our NN design and its application to transient gas network MILP formulations is an attempt to integrate Machine Learning (ML) into the MILP solving process. This integration has recently received an increased focus (Tang et al. 2019; Bertsimas and Stellato 2019; Gasse et al. 2019), which has been encouraged by the success of ML integration into other aspects of combinatorial optimisation, see (Bengio et al. 2018) for a thorough overview.

The paper is structured as follows: Sect. 2 contains an overview of the literature with comparisons to our work. In Sect. 3, we introduce our main contribution, a new generative NN design for learning binary variables of parametric MILPs. Afterward, we outline a novel data generation approach for generating synthetic gas transport instances in Sect. 4. Section 5 outlines the exact training scheme for our new NN and how our framework can be used to warm start MILPs. Finally, in Sect. 6, we show and discuss the results of our NN design on real-world gas transport instances. This represents a major contribution, as the trained NN generates a primal solution in 2.5s and via warm start reduces solution time to achieve global optimality by 60.5%.

2 Background and related work

As mentioned in the introduction, the intersection of MILP and ML is currently an area of active and growing research. For a thorough overview of deep learning (DL), the relevant subset of ML used throughout this article, we refer readers to Goodfellow et al. (2016), and for MILP to Achterberg (2007). We will highlight previous research from this intersection that we believe is either tangential, or may have shared applications to that presented in this paper. Additionally, we will briefly detail the state-of-the-art in transient gas transport, and highlight why our design is of practical importance. It should be noted as well that there are recent research activities aiming at the reverse direction, with MILP applied to ML instead of the orientation we consider, see (Wong and Kolter 2017) for an interesting example.

Firstly, we summarise applications of ML to adjacent areas of the MILP solving process. Gasse et al. (2019) creates a method for encoding MILP structure in a bipartite graph representing variable-constraint relationships. This structure is the input to a Graph Convolutional Neural Network (GCNN), which imitates strong branching decisions. The strength of their results stem from intelligent network design and the generalisation of their GCNN to problems of a larger size, albeit with some generalisation loss. Zarpellon et al. (2020) take a different approach, and use a NN design that incorporates the branch-and-bound tree state directly. In doing so, they show that information contained in the global branch-and-bound tree state is an important factor in variable selection. Furthermore, their publication is one of the few to present results on heterogeneous instances. Etheve et al. (2020) show a successful implementation of reinforcement learning for variable selection. Tang et al. (2019) show preliminary results of how reinforcement learning can be used in cutting plane selection. By restricting themselves exclusively to Gomory cuts, they are able to produce an agent capable of selecting better cuts than default solver settings for specific classes of problems.

There exists a continuous trade-off between model fidelity and complexity in the field of transient gas optimisation, and there is no standard model for transient gas transport problems. Moritz (2007) presents a piecewise linear MILP approach to the transient gas transport problem, (Burlacu et al. 2019) a nonlinear approach with a novel discretisation scheme, and (Hennings et al. 2020) and (Hoppmann et al. 2019) a linearised approach. For the purpose of our experiments, we use the model of Hennings et al. (2020), which uses linearised equations and focuses on gas subnetworks with many controllable elements. The current research of ML in gas transport is still in the early stages. Pourfard et al. (2019) use a dual NN design to perform online calculations of a compressors operating point to avoid re-solving the underlying model. The approach constrains itself to continuous variables and experimental results are presented for a gunbarrel gas network. MohamadiBaghmolaei et al. (2014) present a NN combined with a genetic algorithm for learning the relationship between compressor speeds and the fuel consumption rate in the absence of complete data. More often, ML has been used in fields closely related to gas transport, as in Hanachi et al. (2018), with ML used to track the degradation of compressor performance, and in Petkovic et al. (2019) to forecast demand values at the boundaries of the gas network. For a more complete overview of the transient gas literature, we refer readers to Ríos-Mercado and Borraz-Sánchez (2015).

Our framework, which predicts the optimal objective value of an induced sub-MILP, can be considered similar to Baltean-Lugojan et al. (2019) in what it predicts and similar to Ferber et al. (2019), in how it works. In the first paper (Baltean-Lugojan et al. 2019), a NN is used to predict the associated objective value improvements on cuts. This is a smaller scope than our prediction, but is still heavily concerned with the MILP formulation. In the second paper (Ferber et al. 2019), a technique is developed that performs backward passes directly through a MILP. It does this by solving MILPs exclusively with cutting planes, and then receiving gradient information from the KKT conditions of the final linear program. This application of a NN, which produces input to the MILP, is very similar to our design. The differences arise in that we rely on a NN discriminator to appropriately distribute the loss instead of solving a MILP directly, and that we generate variable values instead of parameter values with our generator.

While our framework is heavily inspired from GANs (Goodfellow 2016), it is also similar to actor-critic algorithms, see (Pfau and Vinyals 2016). These algorithms have shown success for variable generation in MILP, and are notably different in that they sample from a generated distribution for downstream decisions instead of always taking the decision with highest probability. Recently, (Chen et al. 2020) generated a series of coordinates for a set of UAVs using an actor-critic based algorithm, where these coordinates were continuous variables in a Mixed-Integer Non-Linear Program (MINLP) formulation. The independence of separable subproblems and the easily realisable value function within their formulation resulted in a natural Markov Decision Process interpretation. For a better comparison on the similarities between actor-critic algorithms and GANs, we refer readers to Pfau and Vinyals (2016).

Finally, we summarise existing research that also deals with the generation of decision variable values for Mixed-Integer Programs. Bertsimas and Stellato (2018, 2019) attempt to learn optimal solutions of parametric MILPs and Mixed-Integer Quadratic Programs (MIQPs), which involves both outputting all integer decision variable values and the active set of constraints. They mainly use optimal classification trees in Bertsimas and Stellato (2018) and NNs in Bertsimas and Stellato (2019). Their aim is tailored towards smaller problems classes, where speed is an absolute priority and parameter value changes are limited. Masti and Bemporad (2019) learn binary warm start decisions for MIQPs. They use NNs with a loss function that combines binary cross entropy and a penalty for infeasibility. Their goal of obtaining a primal heuristic is similar to ours, and while their design is much simpler, it has been shown to work effectively on very small problems. Our improvement over this design is our non-reliance on labelled optimal solutions, which are needed for binary cross entropy. Ding et al. (2019) present a GCNN design which is an extension of Gasse et al. (2019), and use it to generate binary decision variable values. Their contributions are a tripartite graph encoding of MILP instances, and the inclusion of their aggregated generated values as branching decisions in the branch-and-bound tree, both in an exact approach and in an approximate approach with local branching (Fischetti and Lodi 2003). Very recently, (Nair et al. 2020) combined the branching approach of Gasse et al. (2019) with a novel neural diving approach, in which integer variable values are generated. They use a GCNN for generating both branching decisions and integer variables values. Different to our generator-discriminator based approach, they generate values directly from a learned distribution, which is based on an energy function that incorporates resulting objective values.

3 The solution framework

We begin by formally defining both a MILP and a NN. Our definition of a MILP is an extension of more traditional formulations, see (Achterberg 2007), but still encapsulates general instances.

Definition 1

Let \(\pi \in {\mathbb {R}}^p\) be a vector of problem-defining parameters, where \(p \in {\mathbb {Z}}_{\ge 0}\). We use \(\pi \) as a subscript to indicate that a variable is parameterised by \(\pi \). We call \({\mathbb {P}}_{\pi }\) a MILP parameterised by \(\pi \).

$$\begin{aligned} \begin{aligned} {\mathbb {P}}_{\pi }:={\left\{ \begin{array}{ll} \quad \text {min} &{} \quad c_{1}^{\mathsf {T}}x_{1} + c_{2}^{\mathsf {T}}x_{2} + c_{3}^{\mathsf {T}}z_{1} + c_{4}^{\mathsf {T}}z_{2} \\ \quad \text {s.t.} &{} \quad A_{\pi } \begin{bmatrix} x_{1} \\ x_{2} \\ z_{1} \\ z_{2} \end{bmatrix} \le b_{\pi } \\ &{} \quad c_{k} \in {\mathbb {R}}^{n_{k}}, k \in \{1,2,3,4\}, A_{\pi } \in {\mathbb {R}}^{m \times n}, b_{\pi } \in {\mathbb {R}}^{m} \\ &{} \quad x_{1} \in {\mathbb {R}}^{n_{1}}, x_{2} \in {\mathbb {R}}^{n_{2}}, z_{1} \in \{0,1\}^{n_{3}}, z_{2} \in \mathbb \{0,1\}^{n_{4}}, n_{k} \in {\mathbb {Z}}_{\ge 0} \end{array}\right. } \end{aligned} \end{aligned}$$

Furthermore let \(\Sigma \subset {\mathbb {R}}^p\) be a set of valid problem-defining parameters. We then call \(\{{\mathbb {P}}_{\pi } \; \vert \; \pi \in \Sigma \}\) a problem class for \(\Sigma \). Lastly, we denote the optimal objective function value of \({\mathbb {P}}_{\pi }\) by \(f({\mathbb {P}}_{\pi })\).

Note that the explicit parameter space \(\Sigma \) is usually unknown, but we assume in the following to have access to a random variable \(\Pi \) that samples from \(\Sigma \). In addition, note that \(c_{i}\) and \(n_{i}\) are not parameterised by \(\pi \), and as such the objective function and variable dimensions do not change between scenarios. In Definition 3 we define an oracle NN \({\mathbb {G}}_{\theta _{1}}\), which predicts a subset of the binary variables of \({\mathbb {P}}_{\pi }\), namely \(z_{1}\). Additionally, the continuous variables \(x_{2}\) are separated in order to differentiate the slack variables in our example, which we will introduce in Sect. 4.

We now provide a simple definition for feed forward NNs. For a larger variety of definitions, see (Goodfellow et al. 2016).

Definition 2

Let \(N_{\theta }\) be defined by:

$$\begin{aligned} \begin{aligned} N_{\theta }&: {\mathbb {R}}^{|a_{1}|} \xrightarrow {} {\mathbb {R}}^{|a_{k+1}|} ; \quad a_{1} \xrightarrow {} N_{\theta }(a_{1}) = a_{k+1} \\ h_{i}&: {\mathbb {R}}^{|a_{i}|} \xrightarrow {} {\mathbb {R}}^{|a_{i}|} \quad \forall i \in \{2,...,k+1\} \\ a_{i+1}&:= h_{i+1}(W_{i}a_{i} + b_{i}) \quad \forall i \in \{1,...,k\}\\ W_i&\in {\mathbb {R}}^{|a_{i+1}|\times |a_{i}|}, \;\; b_i \in {\mathbb {R}}^{|a_{i+1}|} \quad \forall i \in \{1,...,k\} \end{aligned} \end{aligned}$$
(1)

We then call \(N_{\theta }\) a k-layer feed forward NN. Here, \(\theta \) is the vector of all weights \(W_{i}\) and biases \(b_{i}\) of the NN. The functions \(h_{i}\) are non-linear element-wise functions, called activation functions. Additionally, \(a_i\), \(b_i\), and \(W_i\) are tensors for all \(i \in \{1, \dots , k\}\).

Definition 3

For a problem class \(\{{\mathbb {P}}_{\pi } \; \vert \; \pi \in \Sigma \}\), let the generator \({\mathbb {G}}_{\theta _{1}} \) be a NN predicting \(z_{1}\), and the discriminator \({\mathbb {D}}_{\theta _{2}}\) be a NN predicting \(f({\mathbb {P}}_{\pi })\) for \(\pi \in \Sigma \), i.e.

$$\begin{aligned} \begin{aligned} {\mathbb {G}}_{\theta _{1}}&: {\mathbb {R}}^{p} \xrightarrow {} (0,1)^{n_{3}} \\ {\mathbb {D}}_{\theta _{2}}&: (0,1)^{n_{3}} \times {\mathbb {R}}^{p} \xrightarrow {} {\mathbb {R}} \end{aligned} \end{aligned}$$
(2)

Furthermore, a forward pass of both \({\mathbb {G}}_{\theta _{1}}\) and \({\mathbb {D}}_{\theta _{2}}\) is defined as follows:

$$\begin{aligned} \hat{z_{1}}&:= {\mathbb {G}}_{\theta _{1}} (\pi ) \end{aligned}$$
(3)
$$\begin{aligned} {\hat{f}}({\mathbb {P}}_{\pi }^{\hat{z_{1}}})&:= {\mathbb {D}}_{\theta _{2}} (\hat{z_{1}},\pi ) \end{aligned}$$
(4)

The hat notation is used to denote quantities that were approximated by a NN. We use superscript notation to create the following instances:

$$\begin{aligned} {\mathbb {P}}_{\pi }^{\hat{z_{1}}}:= {\mathbb {P}}_{\pi } \quad \text {s.t.} \quad z_{1} = [\hat{z_{1}}] \end{aligned}$$
(5)

The additional notation of the square brackets around \(\hat{z_{1}}\), refers to the rounding of values from the range (0, 1) to \(\{0,1\}\), which is required as the variable values must be binary.

Fig. 1
figure 1

The general design of \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\)

The goal of this framework is to generate good initial solution values \(\hat{z_{1}}\), which lead to an induced sub-MILP, \({\mathbb {P}}_{\pi }^{\hat{z_{1}}}\), whose optimal solution is a good feasible solution to the original problem \({\mathbb {P}}_{\pi }\). Further, the idea is to use this feasible solution as a first incumbent for warm starting the solution process of \({\mathbb {P}}_{\pi }\). To ensure feasibility for all choices of \(z_{1} \), we divide the continuous variables into two sets, \(x_{1}\) and \(x_{2}\), as seen in Definition 1. The variables \(x_{2}\) are potential slack variables that ensure all generated decisions result in feasible \({\mathbb {P}}_{\pi }^{\hat{z_{1}}} \) instances, and are penalised in the objective. We now describe the design of \({\mathbb {G}}_{\theta _{1}}\) and \({\mathbb {D}}_{\theta _{2}}\).

3.1 Generator and discriminator design

\({\mathbb {G}}_{\theta _{1}}\) and \({\mathbb {D}}_{\theta _{2}}\) are NNs whose structure is inspired by Goodfellow (2016), as well as both inception blocks and residual NNs, which have greatly increased large-scale model performance (Szegedy et al. 2017). We use the block design Resnet-v2 from Szegedy et al. (2017), see Fig. 3, albeit with a modification that primarily uses 1-D convolutions, with that dimension being time. Additionally, we separate initial input streams by their characteristics, and when joining two streams use 2-D convolutions. These 2-D convolutions reduce the data back to 1-D, see Fig. 2 for an visualisation of this process. The final layer of \({\mathbb {G}}_{\theta _{1}}\) contains a softmax activation function with temperature. As the softmax temperature parameter increases, this activation function’s output approaches a one-hot vector encoding. The final layer of \({\mathbb {D}}_{\theta _{2}}\) contains a softplus activation function. All other intermediate layers use the ReLU activation function. We refer readers to Goodfellow et al. (2016) for a thorough overview of deep learning, and to Fig. 14 in Appendix 1 for our complete design.

Fig. 2
figure 2

Method of merging two 1-D input streams

For a vector \(x=(x_{1}, \cdots , x_{n})\), the softmax function with temperature \(T \in {\mathbb {R}}\), \(\sigma _{1}\), the ReLU function, \(\sigma _{2}\), and the softplus function with parameter \(\beta \in {\mathbb {R}}\), \(\sigma _{3}\), are defined as:

$$\begin{aligned} \sigma _{1}(x_{i},T)&:= \frac{\exp (Tx_{i})}{\sum _{j=1}^{n}\exp (Tx_{j})} \end{aligned}$$
(6)
$$\begin{aligned} \sigma _{2}(x_{i})&:= \max (0,x_{i}) \end{aligned}$$
(7)
$$\begin{aligned} \sigma _{3}(x_{i},\beta )&:= \frac{1}{\beta }\log (1 + \exp (\beta x_{i})) \end{aligned}$$
(8)
Fig. 3
figure 3

1-D Resnet-v2 Block Design

We compose \({\mathbb {G}}_{\theta _{1}}\) and \({\mathbb {D}}_{\theta _{2}}\) to make \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\). The definition of this composition is given in (9), and a visualisation in Fig. 1.

$$\begin{aligned} {\mathbb {N}}_{\{\theta _{1},\theta _{2}\}} (\pi ) := {\mathbb {D}}_{\theta _{2}} ({\mathbb {G}}_{\theta _{1}} (\pi ),\pi ) \end{aligned}$$
(9)

3.2 Interpretations

In a similar manner to GANs and actor-critic algorithms, see (Pfau and Vinyals 2016), the design of \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) has a bi-level optimisation interpretation, see (Dempe 2002) for an overview of bi-level optimisation. Here we list the explicit objectives of both \({\mathbb {G}}_{\theta _{1}}\) and \({\mathbb {D}}_{\theta _{2}}\), and how their loss functions represent these objectives.

The objective of \({\mathbb {D}}_{\theta _{2}}\) is to predict \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\), the optimal induced objective value of \({\mathbb {P}}_{\pi }^{\hat{z_{1}}}\). Its loss function is thus:

$$\begin{aligned} L(\theta _{2}, \pi ) := \big | {\mathbb {D}}_{\theta _{2}} ({\mathbb {G}}_{\theta _{1}} (\pi ), \pi ) - f({\mathbb {P}}_{\pi }^{{\mathbb {G}}_{\theta _{1}} (\pi )}) \big | \end{aligned}$$
(10)

The objective of \({\mathbb {G}}_{\theta _{1}}\) is to minimise the induced prediction of \({\mathbb {D}}_{\theta _{2}}\). Its loss function is thus:

$$\begin{aligned} L'(\theta _{1}, \pi ) := {\mathbb {D}}_{\theta _{2}} ({\mathbb {G}}_{\theta _{1}} (\pi ), \pi ) \end{aligned}$$
(11)

The corresponding bi-level optimisation problem can then be viewed as:

$$\begin{aligned} \begin{aligned} \min \limits _{\theta _{1}}&\quad {\mathbb {E}}_{\pi \sim \Pi } [ {\mathbb {D}}_{\theta _{2}} ({\mathbb {G}}_{\theta _{1}} (\pi ), \pi ) ] \\ \text {s.t.}&\quad \min \limits _{\theta _{2}} \quad {\mathbb {E}}_{\pi \sim \Pi } [ |{\mathbb {D}}_{\theta _{2}} ({\mathbb {G}}_{\theta _{1}} (\pi ), \pi ) - f({\mathbb {P}}_{\pi }^{{\mathbb {G}}_{\theta _{1}} (\pi )}) |] \end{aligned} \end{aligned}$$
(12)

3.3 Training method

For effective training of \({\mathbb {G}}_{\theta _{1}}\), a capable \({\mathbb {D}}_{\theta _{2}}\) is needed. We therefore pre-train \({\mathbb {D}}_{\theta _{2}}\). The following loss function, which replaces \({\mathbb {G}}_{\theta _{1}} (\pi )\) with synthetic \(z_{1}\) values in (10), is used for this pre-training:

$$\begin{aligned} L''(\theta _{2}, \pi ) := \big | {\mathbb {D}}_{\theta _{2}} (z_{1}, \pi ) - f({\mathbb {P}}_{\pi }^{z_{1}}) \big | \end{aligned}$$
(13)

However, performing this initial training requires generating instances of \({\mathbb {P}}_{\pi }^{z_{1}}\), and is therefore done in a supervised manner offline manner on synthetic data.

After the initial training of \({\mathbb {D}}_{\theta _{2}}\), we train \({\mathbb {G}}_{\theta _{1}}\) as a part of \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\), using samples \(\pi \sim \Pi \), the loss function (11), and fixed \(\theta _{2}\). The issue of \({\mathbb {G}}_{\theta _{1}}\) outputting continuous values for \(\hat{z_{1}}\) is overcome by the choice of the final activation function of \({\mathbb {G}}_{\theta _{1}}\). The softmax with temperature (6) ensures that adequate gradient information still exists to update \(\theta _{1}\), and that the results are near binary. When using these results to explicitly solve \({\mathbb {P}}_{\pi }^{\hat{z_{1}}}\), we round our result to a one-hot vector encoding along the appropriate dimension.

After the completion of both initial trainings, we alternatingly train both NNs using updated loss functions in the following way:

  • \({\mathbb {D}}_{\theta _{2}}\) training:

    • As in the initial training, using loss function (13).

    • In an online fashion, using predictions from \({\mathbb {G}}_{\theta _{1}}\) and loss function (10).

  • \({\mathbb {G}}_{\theta _{1}}\) training:

    • As explained above with loss function (11).

Our design allows the loss to be back-propagated through \({\mathbb {D}}_{\theta _{2}}\) and distributed to the individual nodes of the final layer of \({\mathbb {G}}_{\theta _{1}}\) that correspond to \(z_{1}\). This is largely different to other methods, many of which rely on using loss against optimal solutions of \({\mathbb {P}}_{\pi } \), see (Masti and Bemporad 2019; Ding et al. 2019). Our advantage over these is that the contribution to \({\hat{f}}({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) of each predicted decision \(\hat{z_{1}}\) can be calculated. Additionally, we believe that it makes our generated suboptimal solutions more likely to be near optimal. We believe this because the NN is trained to minimise a predicted objective rather than copy previously observed optimal solutions.

4 The gas transport model and data generation

To evaluate the performance of our approach, we test our framework on the transient gas optimisation problem, see (Ríos-Mercado and Borraz-Sánchez 2015) for an overview of the problem and associated literature. This problem is difficult to solve as it combines a transient flow problem with complex combinatorics representing switching decisions. The natural modelling of transient gas networks as time-expanded networks lends itself well to our framework, however, due to the static underlying gas network and repeated constraints at each time step. In this section we summarise important aspects of our MILP formulation, and outline our methods for generating artificial gas transport data.

4.1 The gas transport model

We use the description of transient gas networks by Hennings et al. (2020). This model contains operation modes, which are binary decisions corresponding to the \(z_{1}\) of Definition 1. Exactly one operation mode is selected each time step, and this decision decides on the discrete states of all controllable elements in the gas network for that time step. We note that we deviate slightly from the model by Hennings et al. (2020), and do not allow the inflow over a set of entry-exits to be freely distributed according to which group they belong. This is an important distinction as each single exit-entry in our model has a complete forecast.

The model by Hennings et al. (2020) contains slack variables that change the pressure and flow demand scenarios at entry-exits. These slack variables are represented by \(x_{2}\) in Definition 1, and because of their existence we have yet to find an infeasible instance \({\mathbb {P}}_{\pi }^{z_{1}}\) for any choice of \(z_{1}\). We believe that infeasible scenarios can be induced with sufficiently small time steps, but this is not the case in our experiments. The slack variables \(x_{2}\) are penalised in the objective.

4.2 Data generation

In this subsection we outline our methods for generating synthetic transient gas instances for training purposes, i.e. generating \(\pi \sim \Pi \) and artificial \(z_{1}\) values. Section 4.2.1 introduces a novel method for generating balanced demand scenarios, followed by Sect. 4.2.2 that outlines how to generate operation mode sequences. Afterward, Sect. 4.2.3 presents an algorithm, which generates initial states of a gas network. These methods are motivated by the lack of available gas network data, see (Yueksel Erguen et al. 2020; Kunz et al. 2017), and the need for large amounts of data to train our NN.

4.2.1 Boundary forecast generation

Let \(d_{v,t}\) \(\in {\mathbb {R}}\), be the flow demand of entry-exit \(v \in {\mathcal {V}} ^\text {b} \) at a time \(t \in {\mathcal {T}} \), where \({\mathcal {V}} ^\text {b} \) are the set of entry-exit nodes and \({\mathcal {T}} \) is our discrete time horizon. Note that in Hennings et al. (2020), these variables are written with hat notation, but we have omitted them to avoid confusion with predicted values. We generate a balanced demand scenario, where the demands are bounded by the largest historically observed values, and the demand between time steps has a maximal change. Additionally, two entry or exits from the same fence group, \(g\in {\mathcal {G}} \), see (Hennings et al. 2020), have maximal demand differences within the same time step. Let \({\mathcal {I}}\) denote the set of real-world instances, where the superscript \(i \in {\mathcal {I}}\) indicates that the value is an observed value in real-world instance i, and the superscript ‘sample’ indicates the value is sampled. A complete flow forecast consists of \(d_{v,t} ^{\text {sample}}\) values for all \(v \in {\mathcal {V}} ^\text {b} \) and \(t \in {\mathcal {T}} \) that satisfy the following constraints:

$$\begin{aligned}&\sum _{v \in {\mathcal {V}} ^\text {b}} d_{v,t} ^{\text {sample}} = 0 \quad \forall t \in {\mathcal {T}} \end{aligned}$$
(14)
$$\begin{aligned}&\begin{aligned} M_{\text {q}}&= \max _{v \in {\mathcal {V}} ^\text {b}, t \in {\mathcal {T}}, i \in {\mathcal {I}}} | d_{v,t} ^{i} | \\ d_{v,t} ^{\text {sample}}&\in \left[ -\frac{21}{20}M_{\text {q}}, \frac{21}{20}M_{\text {q}}\right] \\ \end{aligned} \end{aligned}$$
(15)
$$\begin{aligned}&\begin{aligned} | d_{v,t} ^{\text {sample}} - d_{v,t-1} ^{\text {sample}} |&\le 200 \quad \forall t \in {\mathcal {T}}, \quad v \in {\mathcal {V}} ^\text {b} \\ \text {sign}(d_{v,t} ^{\text {sample}})&= {\left\{ \begin{array}{ll} 1 \quad \text {if} \quad \text {v is an entry} \\ -1 \quad \text {if} \quad \text {v is an exit} \end{array}\right. } \forall t \in {\mathcal {T}}, \quad v \in {\mathcal {V}} ^\text {b} \\ | d_{v_{1},t} ^{\text {sample}} - d_{v_{2},t} ^{\text {sample}} |&\le 200 \quad \forall t \in {\mathcal {T}}, \quad v_{1},v_{2} \in g, g\in {\mathcal {G}}, v_{1},v_{2} \in {\mathcal {V}} ^\text {b} \end{aligned} \end{aligned}$$
(16)

To generate demand scenarios that satisfy constraints (14) and (15), we use the method proposed in Rubin (1981). Its original purpose was to generate samples from the Dirichlet distribution, but it can be used for a special case of the Dirichlet distribution that is equivalent to a uniform distribution over a simplex in 3 dimensions. Such a simplex is exactly described by (14) and (15) for each time step. Hence we can apply the sampling method for all time steps and reject all samples that do not satisfy constraints (16). We note that 3 dimensions are sufficient for our gas network, and that the rejection method would scale poorly to higher dimensions.

In addition to flow demands, we require a pressure forecast for all entry-exits. Let \(p _{v,t} \in {\mathbb {R}}\) be the pressure demand of entry-exit \(v \in {\mathcal {V}} ^\text {b} \) at time \(t \in {\mathcal {T}} \). We generate pressures that respect bounds derived from the largest historically observed values, and have a maximal change for the same entry-exit between time steps. These constraints are described below:

$$\begin{aligned}&\begin{aligned} M_{\text {p}}^{+}&= \max _{v \in {\mathcal {V}} ^\text {b}, t \in {\mathcal {T}}, i \in {\mathcal {I}}} p _{v,t} ^{i} \quad \quad M_{\text {p}}^{-} = \min _{v \in {\mathcal {V}} ^\text {b}, t \in {\mathcal {T}}, i \in {\mathcal {I}}} p _{v,t} ^{i} \\ p _{v,t} ^{\text {sample}}&\in \left[ M_{\text {p}}^{-} - \frac{1}{20}(M_{\text {p}}^{+} - M_{\text {p}}^{-}), M_{\text {p}}^{+} + \frac{1}{20}(M_{\text {p}}^{+} - M_{\text {p}}^{-})\right] \end{aligned} \end{aligned}$$
(17)
$$\begin{aligned}&\quad | p _{v,t} ^{\text {sample}} - p _{v,t-1} ^{\text {sample}} | \le 5 \quad \forall t \in {\mathcal {T}}, \quad v \in {\mathcal {V}} ^\text {b} \end{aligned}$$
(18)

We now have the tools required to generate artificial forecast data, with the process described in Algorithm 1.

figure a

4.2.2 Operation mode sequence generation

During offline training, \({\mathbb {D}}_{\theta _{2}}\) requires optimal solutions for a fixed \(z_{1}\). In Algorithm 2 we outline a naive yet effective approach of generating reasonable \(z_{1}\) values, i.e., operation mode sequences:

figure b

4.2.3 Initial state generation

In addition to the boundary forecast and operation mode sequence generators, we require a gas constants generator. As these values are assumed to be constant over the time horizon, we generate them only under the constraint that they are bounded by maximum historically observed values. Let \(gas _{k}\) represent the value for the gas constant \(k \in \{\)temperature, inflow norm density, molar mass, pseudo critical temperature, pseudo critical pressure\(\}\), then the following is the single constraint on sampling such values.

$$\begin{aligned} \begin{aligned} M_{gas _{k}}^{+} = \max _{i \in {\mathcal {I}}} gas _{k}^{i} \quad \quad \quad \quad \quad \quad M_{gas _{k}}^{-} = \min _{i \in {\mathcal {I}}} gas _{k}^{i} \quad \quad \quad \quad \\ gas _{k}^{\text {sample}} \in \left[ {M_{gas _{k}}^{-} - \frac{1}{20}(M_{gas _{k}}^{+} - M_{gas _{k}}^{-})}, {M_{gas _{k}}^{+} + \frac{1}{20}(M_{gas _{k}}^{+} - M_{gas _{k}}^{-})}\right] \end{aligned} \end{aligned}$$
(19)

We now have all the tools to generate synthetic initial states, which is the purpose Algorithm 3.

figure c

5 Training scheme and warm start algorithm

In this section we introduce how our framework can be used to warm start MILPs and help achieve global optimality with lower solution times. Additionally, we outline the training scheme used for our NN design, as well as the real-world data used as a final validation set. For our application of transient gas instances, \(\pi \) is fully described by the flow and pressure forecast, combined with the initial state.

5.1 Primal heuristic and warm start

We consider the solution of \({\mathbb {P}}_{\pi }^{\hat{z_{1}}}\) as a primal heuristic for the original problem \({\mathbb {P}}_{\pi }\). We aim to incorporate \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) in a global MILP context and do this by using a partial solution of \({\mathbb {P}}_{\pi }^{\hat{z_{1}}}\) to warm start \({\mathbb {P}}_{\pi }\). The partial solution consists of \(\hat{z_{1}}\), an additional set of binary variables called the flow directions, which are a subset of \(z_{2}\) in Definition 1, and the realised pressure variables of the entry-exits, which are a subset of \(x_{1}\). Note that partial solutions are used, since instances are numerically difficult. The primal heuristic and warm start algorithm are given in Algorithms 4 and 5 respectively.

figure d
figure e
figure f

5.2 Training scheme

We generate our initial training and validation sets offline. This involves generating \(10^{4}\) initial states with parameter time_step set to 8 in Algorithm 3. Additionally, we generate \(4\times 10^{6}\) coupled demand scenarios and operation mode sequences with Algorithms 1 and 2. All instances contain 12 time steps (excluding the initial state) with 30 minutes between each step. This training data is exclusively used by \({\mathbb {D}}_{\theta _{2}}\), and is split into a training set of size \(3.2\times 10^{6}\), a test set of \(4\times 10^{5}\), and a validation set of \(4\times 10^{5}\). The test set is checked against at every epoch, while the validation set is only referred to at the end of the initial training. Following this initial training, we begin to train \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) as a whole as described in Algorithm 6, alternating between \({\mathbb {G}}_{\theta _{1}}\) and \({\mathbb {D}}_{\theta _{2}}\). The complete list of parameters used are in Table 3, with default values being used otherwise. The exact block design of \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) can be seen in Fig. 3, and the general layout in Fig. 1. For the complete NN design we refer readers to Fig. 14 and Table 6 in the Appendix.

5.3 Real world data

Real-world instances, similar to the artificial data, contain 12 time steps with 30 minutes between each step. We focus on Station D from Hennings et al. (2020), and present only results for this station. The topology for Station D can be seen in Fig. 13 in Appendix 1. Station D can be thought of as a T intersection, and is of average complexity compared to the stations presented in Hennings et al. (2020). The station contains 6 boundary nodes, but they are paired, such that for each pair only one can be active, i.e., have non-zero flow. Our validation set for the final evaluation of \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) consists of 15 weeks of live real-world data from our project partner OGE, where instances are on average 15 minutes apart and total 9291. Statistics on the these real-world MILP instances are provided in Table 1.

Table 1 MILP instance statistics over real-world data

6 Computational results

We partition our results into three subsections. Section 6.1 focuses on the data generation methods, Sect. 6.2 on \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) during training and its performance on synthetic data, and Sect. 6.3 on the performance of trained \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) on 15 weeks of real-world transient gas data.

For our experiments we use PyTorch 1.4.0 (Paszke et al. 2019) as our ML modelling framework, Pyomo v5.5.1 Hart et al. 2017, 2011 as our MILP modelling framework, and Gurobi v9.02 (Gurobi Optimization 2020) as our MILP solver. The MILP solver settings are available in Table 5 in Appendix 1. \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) is trained on a machine running Ubuntu 18, with 384 GB of RAM, composed of 2x Intel(R) Xeon(R) Gold 6132 running @ 2.60GHz, and 4x NVIDIA Tesla V100 GPU-NVTV100-16. The final evaluations are performed on a cluster using 4 cores and 16 GB of RAM of a machine composed of 2x Intel Xeon CPU E5-2680 running @ 2.70 GHz.

6.1 Data generation results

Figure 4 (left) shows how our generated flow prognosis compares to that of historic real-world data. We see that Nodes A, B, and C function both as entries and exits, but are dominated by a single orientation for each node over historical data. Specifically, Node C is the general entry, and Nodes A and B are the exits. In addition to the general orientation, we see that each node has significantly different real-world flow distributions, as opposed to the near identical distributions over our artificial data. Figure 4 (right) shows our pressure prognosis compared to that of historic values. Unlike historic flow values, we observe little difference between historic pressure values of different nodes. This is supported by the optimal choices \(z_{1}^{*}\) over the historic data, see Fig. 11, as in most cases the gas network station is in bypass.

These differences between our synthetic data and real-world data are somewhat expected. The underlying distribution of the demand scenarios for both flow and pressure cannot be assumed to be uniform nor conditionally independent unlike in Algorithm 1. Moreover, the sampling range we use is significantly larger than that of the real-world observed values as we take a single maximum and minimum value over all entry-exits. We expect these differences to continue with our other data generation methods. Algorithm 3 was designed to output varied and valid initial states w.r.t. our MILP formulation; however, the choice of operation modes that occur in reality is unlikely to be uniform as generated in Algorithm 2. In reality, some operation modes occur with a much higher frequency than others. Additionally, we rely on a MILP solver to generate new initial states, and therefore cannot rule out the possibility of a bias. We believe these probable differences in distributions call for further research in realistic prognosis generation methods.

6.2 Training results

Figure 5 visualises the losses of \({\mathbb {D}}_{\theta _{2}}\) throughout the initial offline training. We observe that the loss decreases throughout training, highlighting the improvement in \({\mathbb {D}}_{\theta _{2}}\) for predicting \(f({\mathbb {P}}_{\pi }^{z_{1}})\). This is a required result, as without a trained discriminator we cannot expect to train a generator. Both the training and test loss converge to approximately 1000, which is excellent considering the generated \(f({\mathbb {P}}_{\pi }^{z_{1}})\) range well into the millions. The validation loss on synthetic data also converges to approximately 1000, indicating that \({\mathbb {D}}_{\theta _{2}}\) generalises to unseen \({\mathbb {P}}_{\pi }^{z_{1}}\) instances; however, we note that this generalisation doesn’t translate perfectly to real-world data. Despite this we believe that an average distance between \({\hat{f}}({\mathbb {P}}_{\pi }^{z_{1}})\) and \(f({\mathbb {P}}_{\pi }^{z_{1}})\), of 10000 is still very good. We discuss the issues of different underlying distributions of real-world data and our generated data distributions in Sect. 6.1.

Fig. 4
figure 4

Comparison of generated value distributions per node vs. the distribution seen in real-world data. For flow (Left), and pressure (Right)

Fig. 5
figure 5

The loss per epoch of \({\mathbb {D}}_{\theta _{2}}\) during the initial training of Algorithm 8. The dashed lines show the performance of \({\mathbb {D}}_{\theta _{2}}\) after \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) has been completely trained. A log scale is used for better visibility of later epochs

Fig. 6
figure 6

The loss per epoch of \({\mathbb {D}}_{\theta _{2}}\) as it is trained using Algorithm 6

Fig. 7
figure 7

(Left) The training loss per epoch of \({\mathbb {G}}_{\theta _{1}}\) as it is trained using Algorithm 6. On the left the loss over all epochs is shown. (Right) A magnified view of the loss starting from epoch 20

The training loss during Algorithm 6 for \({\mathbb {D}}_{\theta _{2}}\) is shown in Fig. 6, and for \({\mathbb {G}}_{\theta _{1}}\) in Fig. 7. The observable cyclical increases in the training and test loss of \({\mathbb {D}}_{\theta _{2}}\) occur during the periodic retraining of \({\mathbb {G}}_{\theta _{1}}\). We believe that \({\mathbb {G}}_{\theta _{1}}\) learns how to induce suboptimal predictions during this periodic retraining. \({\mathbb {D}}_{\theta _{2}}\) in turn quickly relearns, but this highlights that learning how to predict \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) is unlikely without some error. Figure 7 (left) shows the loss over time of \({\mathbb {G}}_{\theta _{1}}\) as it is trained, with Fig. 7 (right) displaying magnified losses for the final epochs. We observe that \({\mathbb {G}}_{\theta _{1}}\) quickly learns important \(z_{1}\) decision values. We hypothesise that this quick descent is helped by \(\hat{z_{1}}\) that are unlikely given our generation method in Algorithm 2. The loss increases following this initial decrease in the case of \({\mathbb {G}}_{\theta _{1}}\), showing the ability of \({\mathbb {D}}_{\theta _{2}}\) to further improve. It should also be noted that significant step-like decreases in loss are absent in both (left) and (right) of Fig. 7. We believe such steps would indicate \({\mathbb {G}}_{\theta _{1}}\) discovering new important \(z_{1}\) values (operation modes). The diversity of produced operation modes, however, see Fig. 11, implies that early in training a complete spanning set of operation modes is derived, and the usage of their ratios is then learned and improved.

6.3 Real-world results

We now present results of our fully trained \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) applied to the 15 weeks of real-world data. Note that 651 instances have been removed as warm starting resulted in an inconsistency with the set optimality tolerances. These instances have been kept in the graphics, but are marked and conclusions will not be drawn from them. We also note that the linear programming relaxation of the MILP formulation from Hennings et al. (2020) is rather weak, largely due to the big-M constraints that model the controllable network elements. We believe that the weak relaxation is partly responsible for long run times, especially for scenarios that require a lot of slack and need to branch extensively to prove global optimality. This hypothesis is supported by Fig. 9, where the MILP instances that hit the time limit are predominantly those with large objective values.

Figure 8 compares predicted and true objectives for both artificial and real-world data. As expected, the distribution of objective values is visibly different for the artificial validation set compared to the real-world validation set. Our data generation method was intended to be as independent as possible from the historic data, and as a result, the average scenario has optimal solution larger than any real-world data point. The performance of \({\mathbb {D}}_{\theta _{2}}\) is again clearly visible here, however, with \({\hat{f}}({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) and \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) being near identical over the artificial data, keeping in mind that these data points were never used in training. We see that this ability to generalise is relatively much worse on real-world data, which we hypothesise is mainly due to the the lower values of \(f({\mathbb {P}}_{\pi })\).

Fig. 8
figure 8

\({\hat{f}}({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) for the validation set, and \({\hat{f}}({\mathbb {P}}_{\pi }^{z_{1}^{*}})\) for real-world data, compared to \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) and \(f({\mathbb {P}}_{\pi })\) respectively. Linear scale (Left) and log-scale (Right)

Fig. 9
figure 9

A comparison of \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) and \(f({\mathbb {P}}_{\pi })\) for all real-world data instances

Figure 9 shows the comparison of \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) and \(f({\mathbb {P}}_{\pi })\). In a similar manner to \({\mathbb {D}}_{\theta _{2}}\), we see that \({\mathbb {G}}_{\theta _{1}}\) struggles with instances where \(f({\mathbb {P}}_{\pi })\) is small. This is visible in the bottom left, where we see \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) values much larger than \(f({\mathbb {P}}_{\pi })\) for identical parameter values \(\pi \). This comes as little surprise given the struggle of \({\mathbb {D}}_{\theta _{2}}\) with small \(f({\mathbb {P}}_{\pi })\) values. Drawing conclusions becomes more complicated for instances with larger \(f({\mathbb {P}}_{\pi })\) values, because the majority hit the time limit. However, the value of our primal heuristic is clearly visible from those instances where the heuristic retrieves a better solution than the MILP solver does within an hour. Additionally, we see that no unsolved instance above the line \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) = \(f({\mathbb {P}}_{\pi })\) is very far from the line, showing that our primal heuristic produces a comparable, sometimes equivalent solution, in a much shorter time than the MILP solver’s one hour. For a comparison of solution times, see Table 2.

Table 2 Solution time statistics for different solving strategies
Fig. 10
figure 10

A comparison of \({\hat{f}}({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) and \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) for all real-world data instances

Figure 10 shows the performance of the predictions \({\hat{f}}({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) compared to \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\). Interestingly, \({\mathbb {D}}_{\theta _{2}}\) generally predicts \({\hat{f}}({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) values slightly larger than \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\). We expect this for the smaller valued instances, as we know that \({\mathbb {D}}_{\theta _{2}}\) struggles with \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) instances near 0, but the trend is evident for larger valued instances too. We observe that no data point is too far from the line \({\hat{f}}({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\) = \(f({\mathbb {P}}_{\pi }^{\hat{z_{1}}})\), and conclude, albeit with some generalisation loss, that \({\mathbb {D}}_{\theta _{2}}\) can adequately predict \(\hat{z_{1}}\) solutions from \({\mathbb {G}}_{\theta _{1}}\) despite the change in data sets.

Fig. 11
figure 11

Frequency of operation mode choice by \({\mathbb {G}}_{\theta _{1}}\) compared to MILP solver for all real-world instances. (Left) Linear scale, and (Right) log scale

We now compare the operation modes \(\hat{z_{1}}\) that are generated by \({\mathbb {G}}_{\theta _{1}}\), and the \(z_{1}^{*}\) that are produced by our MILP solver. To do so we use the following naming convention: We name the three pairs of boundary nodes N (north), S (south), and W (west). Using W_NS_C_2 as an example, we know that flow comes from W, and goes to N and S. The C in the name stands for active compression, and the final index is to differentiate between duplicate names. As seen in Fig. 11, which plots the frequency of specific \(z_{1}\) if they occurred more than 50 times, a single choice dominates \(z_{1}^{*}\). This is interesting, because we expected there to be a lot of symmetry between \(z_{1}\), with the MILP solver selecting symmetric solutions with equal probability. For instance, take operation modes W_NS_C_1 and W_NS_C_2, which differ by their usage of one of two identical compressor machines. \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) only ever predicts W_NS_C_2; however, with half the frequency the MILP solver selects each of them. We now suspect that these duplicate choices do not exist in bypass modes, and the uniqueness of \(z_{1}\) determined by open flow paths, results in different \(f({\mathbb {P}}_{\pi }^{z_{1}})\) values. We believe that the central importance of NS_NSW_1 was not learnt by \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) as over generalisation to a single choice is strongly punished. For a comprehensive overview of the selection of operation modes and the correlation between \(\hat{z_{1}}\) and \(z_{1}^{*}\), we refer interested readers to Table 4 in Appendix 1.

Fig. 12
figure 12

The combined running time of solving \({\mathbb {P}}_{\pi }^{\hat{z_{1}}}\), and solving a warm started \({\mathbb {P}}_{\pi }\), compared to solving \({\mathbb {P}}_{\pi }\) directly

As discussed above, \({\mathbb {N}}_{\{\theta _{1},\theta _{2}\}}\) cannot reliably produce \(z_{1}^{*}\). Nevertheless, it produces near-optimal \(\hat{z_{1}}\) suggestions, which are still useful in a warm start context, see Algorithm 5. The results of our warm start algorithm are displayed in Fig. 12. Our warm start suggestion was successful 72% of the time, and the algorithm resulted in an average speed up of 60.5%. We use the shifted geometric mean with a shift of 1(s) for this measurement to avoid distortion by relative variations of the smaller valued instances. Especially surprising is that some instances that were previously unsolvable within the time limit were easily solvable given the warm start suggestion. As such, we have created an effective primal heuristic that is both quick to run and beneficial in achieving global optimality.

7 Conclusion

In this paper, we have presented a dual NN design for generating decisions in a MILP. This design is trained without ever solving the MILP with unfixed decision variables. The NN is both used as a primal heuristic and used to warm-start the MILP solver for the original problem. We have shown the usefulness of our design on the transient gas transport problem. While doing so we have created methods for generating synthetic transient gas data for training purposes, reserving an unseen 9291 real-world instances for validation purposes. Despite some generalisation loss, our trained NN results in a primal heuristic that takes on average 2.5s to run, and results in a 60.5% decrease in global optimal solution time when used as a warm-start solution.

While our approach is an important step forward in NN design and ML’s application to gas transport, we believe that there exist three primary directions for future research. The first is to convert our approach into more traditional reinforcement learning, and then utilise policy gradient approaches, see (Thomas and Brunskill 2017). The major hurdle to this approach is that much of the computation would be shifted online, requiring many more calls to solve the induced MILPs. However, this could be offset by using our technique to initialise the NN for such an approach, thereby avoiding early stage training difficulties. The second is focused on the recent improvements in Graph NNs, see (Gasse et al. 2019). Their ability to generalise to different input sizes would permit the creation of a single NN over multiple gas network topologies. The final direction is to improve data generation techniques for transient gas networks. There exists a gap in the literature for improved methods that are scalable and result in real-world like data.