In a first step, the dual characteristics of specific design solutions of the Pareto fronts are investigated to better outline the process of dual mapping of optimal design solutions. For that, in Fig. 4a, the primal graph of one single design solution of the large WDN is shown. This dual graph is part of the 500,000th (final) generation and has in total 878 dual nodes and 1326 dual edges. The resilience value Ir of that solution is 0.785 and the costs are 5.85 million € (see also Fig. 3). The line width of the edges corresponds to the real diameters, and the colors of the edges and the vertices correspond to the order of dual nodes found.
Figure 4b shows the dual graph of the same WDN with force-directed element placement and marker sizes according to the node degrees. The mean node degree in this dual graph is 3.02 with a maximum node degree of 64. The colours of the vertices are according to the order of dual nodes found. The used colors in Fig. 4b have the same order as in Fig. 4a. However, the spatial allocation between Fig. 4a, b is difficult to determine.
To approximately preserve the spatial location of the components and to better understand how the generalization model reduced the complexity of the primal graph (e.g., for detailed analysis or discussions), the spatial location is now considered for plotting the dual graph. Therefore, in Fig. 4c, the median Euclidean coordinates of the nodes in the primal graph, being part of a dual node, are used as new coordinates for the spatial layout of the dual graph. It can be seen that the node with the highest degree indeed connects many nodes from many regions of the network. This demonstrates that the edges integrated into this dual node can be viewed as a connector (connecting and transporting flows to many parts of the network), whose existence is of great importance for the functioning for the entire system. That dual node has the minimum available diameter (76.2 mm) of the design process for the generalization. More than 74% (2987 edges) of the pipes have that diameter in the primal network, 1017 of which are integrated into that dual node with a total pipe length of 58.48 km. The sum of total head loss in that dual node is 342.41 m (of in total 1625 m in the entire network), and the median flow in these edges is 1.04 L/s (maximum 1886.7 L/s in the entire network). This means that these edges with low diameter are responsible for 21% of the head losses, while only 0.05% of the maximum flow is transported.
As a first indicator of the differences in dual characteristics throughout the different generations, a dual graph of a design solution from an early generation is now exemplary investigated in Fig. 4d. The dual graph is again plotted with force-directed element placement and marker sizes according to node degree. The mean node degree in this dual graph is 3.18, which is almost the same as the network extracted from the 500,000th generation but with a maximum node degree of 25. This implies that the design solution taken from the earlier stages of optimization yielded much smaller maximum node degree. Similarly, the colour is based on the colors of the edges and the vertices is according to the order of dual nodes. In comparison to Fig. 4b, one can observe a much finely resolved structure, with lesser edges generalized into a dual node. This, indeed, indicates a heterogeneous formation of diameter distribution in the early stages of optimization, lessening the integration of identical edges into dual nodes.
To now gain a complete picture of the dual graph characteristics over the different generations and the populations within them, statistical analysis of the properties of all design solutions are investigated. Therefore, in the following, statistical evaluations of the entire multitude of investigated dual graphs are shown. By that we sought to answer the questions: how the dual characteristics change during the optimization process (i.e., with increasing generations) and are there some useful parameters that can describe how close a Pareto front is from an optimal solution?
Therefore, in Fig. 5, the basic graph characteristics of in total 14,300 (47 + 47 + 49 generations with a population of 100 each) dual graphs are investigated. For the small case study (Fig. 5a), the number of dual nodes and edges continuously decreases with increasing the number of generations. While the average number of dual nodes for the first (randomly initialized) generation is around 216, it gets substantially reduced to 20, reaching the 300,000th generation. Notably, all dual mapped design solutions have an almost constant relationship between the number of dual nodes and the number of dual edges. A linear correlation results in a slope of the regression line of k = 1.54 with a coefficient equal to R2 = 0.9917. Although, for the more optimized solutions, this linear regression tends to overestimate the number of dual edges, while with the solutions with more randomness, the number of dual edges in comparison to the dual nodes is underestimated. A generation-wise linear fit produces relatively constant k = 1.48 for generations above 200, and k≈1.6 for the first generations.
Comparing these characteristics with those of the other case studies, a similar behavior can also be observed for the mean dual node degree for all the solutions within all the generations for the small case study (Fig. 5d). For the random initialization, the median node degree is 3.19 (almost no variation). With progressing optimization, the mean node degree sharply declines at the beginning, whereas it starts to rise/recover from generation 30 to 100. Within this region, the search algorithm in GALAXY appeared to have certain problems preserving the entire range of design solutions with respect to resilience (see also Fig. 6). These results could be of interest for further enhancing the capabilities of search operators.
For the small case study with a smaller number of demand nodes (Fig. 5b), the linear correlation analysis results in a slope of the regression line of k = 1.48 and R2 = 0.9848. A generation-wise linear fit produces relatively constant k = 1.50 for generations above 200 and k≈1.6 for the first generations.
A similar pattern can be observed for the large case study (Fig. 5c). The number of average dual nodes is decreasing from the initial population with an average of 3360 to 831 in the final generation. Analogous to the both small case studies, there exists an almost a linear correlation between the number of dual nodes and the number of dual edges for the large case study with a slope of the regression line of 1.58 and R2 = 0.992 (see Fig. 5c). However, again a slight nonlinear trend can be observed with less dual edges for more optimized solutions (higher number of generations). A generation-wise linear fit produces relatively constant k = 1.35 for generations above 60,000 and k≈1.6 for the generations before.
For the mean dual node degrees of the design solutions of the large case study (Fig. 5f), the behavior until the 100th generation shows a similar pattern compared to both small case studies (i.e., a sharp decline in the first generations, and then a quick recovery). However, thereafter a plateau is formed, lasting approximately until generation 10,000. One could now hypothesize that this plateau is also present for the small case study, but with much less extent (only until generation 300). After the plateau period, also the median dual node degree for the large case study decreases to 2.97. However, this drop is not as significant as for both small case studies. As a result, our study reveals that all case studies observed a sharp drop in the dual node degrees after the first generations followed by a plateau zone during the last stages of optimization (see also the interquartile ranges, specifically, the formation of a large interval shown in Fig. 5e. This explains that the mean degrees might be remaining constant in the upper territories). This implication indicates that the mean dual node degrees tend to remain constant after around the 1000 generations.
In Fig. 5a-c, the approximately linear relationship between dual nodes and dual edges was overserved. However, it has been observed that lower generations (dark blue dots) tend to be on the upper limit of dual elements, while the more progressed solutions (e.g., dark red dots) tend to be at the minimum number of dual elements. Therefore, the number of dual elements in dependence of the generations, specifically of the dual nodes, is now investigated in detail in the following to see if this characteristic can be used as indicators for the progress of the optimization.
To systematically analyze the dual characteristics in dependency of the generations, in Fig. 6a–c, the number of dual nodes (y-axis) against the number of generations (x-axis) for three case studies throughout the optimization process is plotted. For each generation, the entire population (all 100 solutions) is shown. The colors of the nodes are according to the generations. For statistical insight, in addition, the median value (Q50), the 25% (Q25), and 75% (Q75) percentiles are plotted to account for the uncertainties around the number of dual nodes. For the interpretation of the progress of the optimization, the ranges of the resilience values are also important. E.g., at the random initialization of the optimization process a very small range of the resilience values is present (see also Fig. 3, dark blue markers). Therefore, in Fig. 6d–f, the median (Q50), the 25% (Q25), and the 75% (Q75) percentiles of the resilience ranges for the different generations are plotted. A narrow range of resilience values for the initialization process can be observed for the first generation for the small case studies (Fig. 6a, b), while for the large case study, negative values are obtained for the initial generation (technically not feasible solutions, and not fulfilling the pressure constraints).
When having a closer look at Fig. 6a, one can observe that a formation pattern is almost analogous to the patterns of dual node degree in all generations; however, the number of dual nodes largely wanes at the tail of optimization despite some fluctuations in between. For the random initialization, the number of dual nodes is a little less than the number of decision variables (edges). This means that occasionally, two or more adjacent edges were found to have the same diameter and therefore, integrated into a dual node. Subsequently, there is a continuous decline in the number of dual nodes until generation 30, and the resilience value range of the solutions covers higher values. In this first period of the optimization process, the resilience of the solutions is continuously increasing, while low resilience solutions have not been found yet. For the 30th generation of the small case study, also the interquartile range of the resilience values of the design solutions is very small and between 0.81 and 0.98 (for example, from the 2000th generation on it is between 0.18 and 0.93) (see Fig. 6d). This pinpoints again that the search operators of the used evolutionary algorithm have certain troubles preserving/ensuring a wide range of technically feasible solutions during that stage of the optimization process. However, after the 30th generation, also the low resilient solutions are exploited, achieving a wide range of resilience values. Interestingly, the median number of dual nodes in Fig. 6a drops from 230 for the initial generation to approximately a constant value of 20 after generation 10,000. Furthermore, the resilience range in Fig. 6d has also its full extent and is not changing anymore. When having a closer look at Pareto fronts as well in Fig. 3a, one can notice that beyond generation 10,000, there is hardly any progress in the optimization process anymore. Therefore, the number minimal number of dual nodes could be an interesting indicator to assess the progress of the optimization process. While such an indicator could be calculated during the optimization itself, it would be even more beneficial to tell the approximate minimal number of dual nodes at the beginning of the process (i.e., based on the topology and the demand distribution). Based on the demand edge betweenness centrality, EBC(k), one could develop such an indicator. As described in the “Methods” section, with dEBC it can be estimated, how many diameter changes are necessary with a given demand distribution. Therefore, in the following, the dEBC values are evaluated and compared with the minimal numbers of dual nodes.
The red straight lines in Fig. 6a–c indicate the number of flow classes in the dEBC values (calculated based on Eq. 4). For the small case study, the dEBC is 20 which is very close to the median number of dual nodes in the design solutions from the higher generations. This means after the dEBC value is reached, it is just a back and forth changing of diameters without any further progress in the optimization process. This is also supported by Fig. 3, where no significant changes from that generation further on can be observed.
The indicator dEBC is based on the demand edge betweenness centrality and reflects the demand distribution within the WDN. To prove that the identified indicator works also for a different demand distribution, the small case study but with only 10 demand points is now investigated regarding minimal number of dual nodes and dEBC (Fig. 6b). In general, almost similar behavior can be observed for the small case study with only 10 demand points (Fig. 6b). Notably, the number of dEBC decreases to 12 while having less demand nodes. The median number of dual nodes during the middle generations (i.e., after 10,000th generations) converges again towards dEBC. This gives a strong indication that dEBC gives an estimation of the expected minimal number of dual nodes during the evolutionary optimization.
As a next step, the correlation between dEBC and the minimal number of dual nodes is investigated for the highly complex large case study. In principle, the same behavior can also be observed for the large case study (Fig. 6c). At the beginning of the optimization process (until the sixth generation), the resilience values are negative (technically not feasible solutions with pressure violations). After exploiting the high resilient solutions until the 40th generation, there is a rise in the number of dual nodes again while augmenting the resilience range of the design solutions (Fig. 6f). After a plateau of the mean number of dual nodes until the 10,000th generation, there is similarly a continuous decline. Most interestingly again, the minimal number of dual nodes converges towards dEBC of 831, meeting each other between the 400,000th and 500,000th generations. In combination with the results from the two scenarios of the small case studies, it strongly indicates that from that point on there are no further changes expected regarding the mean number of dual nodes. Therefore, this territory can be regarded as a measure from which the final stage of the optimization process is achieved. Note that for the 500,000th generations, the optimization with GALAXY already took 35 weeks of computation time, indicating that no further changes are expected, and therefore, one can save a significant amount of further computation time.
Any evolutionary optimization process starts with some random initialization of the population. This implies that for different random starting points, the optimization could take different solutions paths. To gain better confidence in the obtained results, the experiments and evaluations are now repeated 10 times for the small case study with 118 demand nodes to see whether the conclusion that dEBC is a good indicator for the expected minimal number of dual nodes in the optimization process still holds.
In Fig. 7a, ten Pareto fronts of the final generation (300,000th) of the independent runs of GALAXY are shown. From a visual comparison, hardly any differences in the quality of the Pareto-fronts can be identified. This indicates that no further improvements can be expected with the given evolutionary algorithm and the chosen parameters. Note that GALAXY was conceptualized to optimize WDNs with only a minimum number of parameters which are the population size and the number of generations [26]. From additional numerical tests with a higher number of population sizes (up to 1000), no changes in the quality of the final generation were observed. Nevertheless, a significant additional computational burden was required. However, due to clarity of this manuscript, these results are not shown here, as they do not shed new light on the matter. In Fig. 7b, the median number of dual nodes during the different generations is investigated and compared to the dEBC value. The behavior of the ten simulation runs fully supports the conclusion that the median number of dual nodes reaches a minimal value, and after that no further improvements in the optimization can be expected. Again, most interestingly, dEBC gives, before the optimization starts, a target value for the minimal number of dual nodes. For completeness, in Fig. 7c, again the median resilience values of the different generations are shown (without the 25% and 75% percentile for better clarity). It can also be concluded from Fig. 6 that the resilience values do not change anymore beyond a certain generation (in this case approximately 10,000).
From the analysis shown in Figs. 6 and 7, it is inferred that 10,000 generations would be enough for the small case study to achieve optimal solutions. Although one could not assess the 10,000 at the beginning, with dEBC representing the topology and the demand distribution, one has an estimate of the median of the minimal number of dual nodes before the optimization starts. Thus, in the optimization process, one can observe the number of dual nodes in a generation and decide whether more generations should be run with the evolutionary optimization engine. As the Pareto-front of that case study should have similar patterns under other evolutionary algorithms, the proposed approach of estimating dEBC and comparing the number of dual nodes of the design solutions is also applicable to another evolutionary optimization of WDNs.