1 Introduction

Water distribution systems (WDSs) are underground networks designed to transport and distribute safe drinking water. Pipe bursts constitute a major challenge for WDS managers as they cause severe disturbance in the operation of the system, limit the availability of sufficient and clean water (Al-washali et al. 2016; Fox et al. 2016), causing significant financial losses (Farley et al. 2001).

To reduce the impact of pipe bursts, water utilities resort to digitalization by installing pressure and flow monitoring sensors that automatically relay data to an operations center (Adedeji et al. 2017; Gupta and Kulat 2018). Monitoring allows water utilities to detect bursts early on, mobilize their repair crews swiftly and ultimately limit their negative consequences and promote economic and environmental sustainability (Cassidy et al. 2021; Bakker et al. 2012).

1.1 Related Studies

Timely detection of bursts is crucial to the water utilities, and two primary approaches exist: model- and data-driven approaches (Hu et al. 2021). Model-based approaches compare observations of the real network data with simulations of the WDS (Pérez et al. 2011). Despite numerous successful applications (Casillas Ponce et al. 2014; Sophocleous et al. 2019), these approaches require expert-calibrated models (Hu et al. 2021; Pérez et al. 2014) and a high degree of supervision by the user. Furthermore, these methods need expensive recalibration of the underlying hydraulic model when the WDS changes (Kang and Lansey 2011). On the other hand, data-driven methodologies rely on signal processing, statistical analysis, and machine learning (ML) to process the acquired data, disregarding in-depth understanding of the layout and operation of the WDS (Mounce et al. 2002). Recently, ML methods emerged as the most common data-driven approaches for burst detection. This family of methods usually work in a burst-no/burst binary classification fashion (Caputo and Pelagagge 2003; Mounce and Machell 2006; Mounce et al. 2014). However, acquiring balanced datasets for training is challenging since bursts are infrequent (Wu and Liu 2017). A proven strategy to tackle this issue involves initially training models to reproduce sensor trajectories on burst-free datasets. In the testing phase, the system identifies potential bursts by flagging deviations from the predicted values that surpass a set threshold (Hu et al. 2021; Romano et al. 2014).

All ML-based approaches for burst detection proposed in the literature operate with a fixed number of sensors or a set WDS topology. This requires the development of a new model every time there is a change in the sensor setup or in the physical network structure. Furthermore, training a new model relies on the acquisition of sufficient data under the new configuration, potentially leading to significant delays in detecting bursts. To overcome this issue, new models could reuse the knowledge captured by existing ones, rather than starting from scratch with each modification. This can be achieved by leveraging transfer learning, a ML technique that allows a model to apply knowledge learned from one task to a related task (Pan and Yang 2010; Torrey and Shavlik 2010). Unfortunately, common ML approaches based in non-parametric methods such as Decision Trees and Random Forest (Lučin et al. 2021; Zhang et al. 2022) do not transfer because they rely on fixed architectures that cannot adapt to new data distributions or changes in features. On the other hand, traditional Deep Learning (DL) architectures based on the Multi-Layer Perceptron (MLP) can be retrained to accommodate for changes in the data distributions of their inputs/outputs (e.g., due to a change in the physical network), but they cannot transfer knowledge when features are added or removed (i.e., following the installation or removal of sensors). Furthermore, MLPs are prone to suffer from the curse of dimensionality, and require considerable amount of data to achieve good performances (Russell and Norvig 2010).

Modern DL methods like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are designed to bypass the curse of dimensionality by using inductive biases and shared parameters, which promote better knowledge transfer and smooth adaptation to varying data and input configurations (Bentivoglio et al. 2022). The sequential inductive bias of RNNs is particularly suitable for processing the time-series data measured by sensors in WDSs. Furthermore, gated RNN neurons, such as Long Short-Term Memory (LSTM) cells can effectively manage long sequences by selectively processing and propagating crucial information across time steps (Hochreiter and Schmidhuber 1996; Lai et al. 2018). This property renders them particularly attractive to handle the long-term correlations in flow, pressure and water demand data sensed in WDS. Despite these advantages, only a handful of studies have utilized LSTMs for burst detection. Wang et al. (2020) used an LSTM network and flow data to detect bursts in a real-life DMA in China, but their dataset was limited to simulated bursts and lacked pressure information. Similarly, Lee and Yoo (2021) worked with flow data to detect a single burst in a WDS, which is not representative of real DMAs. Xu et al. (2020) used flow and pressure signals with an LSTM but only tested on five fire hydrant simulated bursts in a non-DMA city-wide network.

No study explored how transfer learning in LSTMs can improve burst detection in operational settings. In this paper, we aim to address this gap by proposing a novel data efficient LSTM-based approach that leverages transferability to handle a varying number of sensors. When adding new sensors, we augment the original LSTM by duplicating weights for the newly added channels. These augmented models are fine-tuned, not re-trained from scratch, reducing data requirements for burst prediction under the modified setup. We validate our approach on simulated fire hydrant bursts and real bursts across 10 DMAs of Sutton and East Surrey Water Services Ltd (SES Water) in England. We also perform an extended sensitivity analysis to assess the impact of input time resolutions, providing insights into how data granularity affects the overall performance.

2 Case Studies

Table 1 reports the 10 anonymized DMAs of SES Water used in this study. Using satellite imagery, we identify three land use categories: urban, rural and mixed. Urban DMAs are characterized by dense urban fabric and very little unbuilt area. Rural DMAs are scarcely populated and are mostly covered by agricultural fields. Mixed DMAs lie in between the two previous categories. Regardless of their classification, all DMAs follow the layout depicted in Fig. 1 (left), with flow and pressure sensors installed at the inflow point, and an additional pressure sensor installed at the critical point. The three DMAs where fire hydrant bursts were simulated are listed in Table 2, and they have five to seven additional pressure sensors as shown in Fig. 1 (right). The data for this study was collected from 14 October 2016 to 29 March 2022, with varying data availability across the different sensors. All data has an original time resolution of 15-min.

Table 1 Characteristics of DMAs used with training, validation and testing period partitioning
Fig. 1
figure 1

A schematic representation of a typical DMA with distributed pressure and flow sensors for the case of real bursts (left) and simulated fire hydrant bursts (right)

Table 2 Details on artificial fire hydrant experiments, including duration and burst size compared to mean DMA inflow

The different length of the training, validation and testing subsets shown in Table 1 is a result of the requirement to have consistent flow and pressure signals, unaffected by sensor replacements and/or recalibration.

2.1 Simulated Fire Hydrant Bursts

Fire hydrant bursts were executed on March 10, 2022 (i.e., Beta and Delta DMAs) and March 15 (i.e., Zeta DMA) during daytime, after the installation of additional pressure sensors in early January 2022. The experiments were conducted for a total of 2.5 h with a progressively increasing discharge to avoid unnecessary harm to the network pipes. Details on the experiments, along with the burst discharge relative to the mean DMA inflow αburst, are shown in Table 2.

At the original 15-min time resolution, each 2.5 h long simulated burst corresponds to 11 time steps, with the start and end time of the burst included.

2.2 Real Bursts

A total of 192 real bursts were available across the 10 DMAs (see Table 1). The burst records included detection datetime, repair datetime and a short description of their nature. As discussed later, the quality of measurements and the veracity of the burst records is inhomogeneous across the different DMAs. Furthermore, there are common challenges with the dataset. The registered bursts started before the operator detected them. As a result, the information available only partially represents “ground truth”. Similarly, “burst-free” records contain background leaks and/or undetected bursts. Furthermore, it is possible that the sensors were recalibrated or replaced during the recording period, which undermines the consistency of the dataset.

3 Methodology

3.1 Overview of the Approach

The proposed detection mechanism works in a two-step prediction-classification fashion (Wu and Liu 2017). In the first step, the model makes a prediction of flow and pressure(s) for the next time step \(t+1\), using autoregressive inputs until time \(t-k\), where \(k\) is a fixed time window, and known information of datetime-related features at time \(t+1\). If H is the vector observed hydraulic features, the first stage can be expressed mathematically as

$${\widehat{H}}_{t+1}=\varphi \left({H}_{t}, {H}_{t-1}, \dots , {H}_{t-k}, {D}_{t+1}\right)$$
(1)

where \(\widehat{H}\) identifies the predicted hydraulic features, \(D\) are the datetime features and \(\varphi\) identifies the DL model. In the first stage, the goal of the DL model is to minimize the prediction error \({E}_{t+1}\) in the training dataset without overfitting, expressed below for a single instance using the mean squared error or \({L}_{2}\) norm.

$${E}_{t+1}=\frac{1}{n}\Vert {H}_{t+1}-{\widehat{H}}_{t+1}\Vert \begin{array}{c}2\\ 2\end{array}$$
(2)

where \(n\) is the number of hydraulic features to predict. The choice of the squared error was mainly driven by the convexity of the metric, as well as the emphasis on larger errors, both of which simplify the optimization process in model training. In the second stage, bursts are flagged by comparing the prediction error against a time-varying threshold that changes with the time of the day to account for the cyclical nature of water demand (Hutton and Kapelan 2015). The thresholds are selected based on the distribution of the prediction errors on the validation dataset to strike a compromise between the sensitivity of the method and the excessive flagging of false positives (Taormina and Galelli 2018).

3.2 Input Features

In addition to past values of the hydraulic features \(H\), we create two additional datetime features \(D\), to help the DL model recognize the daily and weekly water consumption behavioral patterns (Hutton and Kapelan 2015). The first, named “Day Index” (\(DI\)) represents an engineered version of the weekday index with values in the [0, 1] range

$$DI=\frac{0.2}{1-0.8\cdot \mathrm{cos}\left(\left(\left(Weekday\;Index+1\right) \% 7\right)\cdot \frac{\pi }{3}\right)}$$
(3)

with Monday having a weekday index of 0, Sunday having a weekday index of 6 and “%” representing the modulo operator. According to Eq. (3), the \(DI\) values of working days are close to 0 and the \(DI\) values of weekends are equal to 1. The \(DI\) values of public holidays were set to 1 after statistical analysis confirmed behavioural patterns resembling weekend consumption, mainly due to the delayed morning peaks. The second datetime feature is the minute-of-the-day (\(MD\)), which accounts for the different expected consumption within the day and takes values in the range [0, 1439]. All hydraulic and datetime features are scaled in the range [0, 1] based on the value range they exhibit in the training dataset.

3.3 Neural Network Architecture

The proposed architecture includes both LSTM cells (Hochreiter & Schmidhuber 1997) and traditional neurons. Specifically, there are two different hidden layers; one consisting of 16 LSTM neurons that takes as input sequences of hydraulic features \(H\) and one consisting of 2 regular neurons that takes as input singular values of the time features \(D\). The number of LSTM neurons resulted from a preliminary hyperparameter tuning. The outputs of both layers are then concatenated into an additional layer consisting of regular neurons that predict the hydraulic features \(\widehat{H}\). The number of neurons in the output layer is equal to the number of the predicted hydraulic features, i.e., 3 for real bursts and 8 (= 3 + 5) or 10 (= 3 + 7) for the engineered fire hydrant bursts. Regardless of the sensor setup, all models are trained to minimize the mean squared error in Eq. (2) computed for the entire training dataset. Finally, to reduce the possibility of overfitting, recurrent dropout is used when training the LSTM cells. Initial hyperparameter tuning revealed that of the different combinations of dropout rate and other parameters of the NN structure (details not provided here due to limited space), a dropout rate of 20% is preferred. This high rate is most likely justified by the relatively small size and noise of the dataset, especially for the simulated bursts.

3.4 Transfer Learning

Transfer learning refers to the “improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned” (Torrey and Shavlik 2010). This technique enables the development of a full-scale model when limited data is available for the new task, by building on an existing model that allows for knowledge transfer. In the burst detection domain, changes in the WDS topology, the number, calibration or type of sensors can significantly limit the length and consistency of datasets available for model training. Such changes often necessitate the model to be set “offline”, until the assimilation of new, long-enough datasets are available and model re-training is possible. In this work, we leverage LSTM transferability for detecting the engineered fire hydrant bursts, since the additional pressure sensors are available only for the first three months of 2022. Specifically, we first develop a model using the data before 2022. Second, we expand this model by adding new input channels to the LSTM, corresponding to the number of new pressure sensors. Then, we initialize the additional trainable parameters in the LSTM gates with the weights from the pressure sensor at the inflow of the DMA. This results in a new model with pre-trained weights for all pressure signals, which is then fine-tuned and tested with limited data to detect the fire hydrant bursts.

3.5 Multi-Threshold Classification

The second stage of the proposed approach requires the comparison of the prediction error against set thresholds. These thresholds are set at the 99.9th percentile of the prediction errors observed in the validation dataset. This value was selected to limit the false positives to 0.1% in operational settings, assuming the ideal case where the validation dataset accurately represents test conditions. However, due to the periodical components of water demand patterns, the variation of the prediction error is usually higher during periods of intensely varying water consumption, and smaller when water consumption is relatively stable, e.g., during night-time (Hutton and Kapelan 2015). To account for the heterogeneity in the error distribution, we segmented the prediction errors into \(h=3\) hour intervals, creating contiguous clusters. Moreover, we distinguished between working days and weekends or public holidays. This resulted in 16 distinct thresholds, each based on the 99.9-th percentile of prediction error distributions within their respective clusters. This number of clusters offered the best trade-off between the threshold resolution and the reliability of the computed percentiles. Smaller values of \(h\) created too small clusters from which to extract a high percentile. This impacted the model performance by “flagging” too many non-burst instances as bursts. Higher values of \(h\) resulted instead in a too coarse clustering of the daily consumption behavior and loss of fidelity.

3.6 Performance Assessment

We employed both event- and timestamp-based metrics to assess the methodology. The event-based metrics factor in unique alarms — instances where prediction errors exceed the time-dependent threshold — that occur within the repair timeframe and one week before a burst is identified by the operator. This one-week lead time accounts for the potential delays in burst detection, which can be due to delayed customer reports or the time it takes to recognize substantial water loss. Conversely, we derived timestamp-based metrics on a per value basis by comparing the generated alarm instances to the burst records. Event-based metrics are useful to water utilities and engineers, since they show the performance in terms of number of bursts, whereas the value-based ones are more useful to NN experts, since they reflect the per-datapoint performance of the model.

Three event-based metrics are calculated: \(Recal{l}_{e}\), \(Precisio{n}_{e}\) and \(f1 scor{e}_{e}\), where subscript “e” denotes the event-based calculation of these metrics. The timestamp-based metrics calculated are \(Recall\), \(Fall out\) and \(Precision\), defined as shown in Table 3.

Table 3 Event- and timestamp-based performance metrics

Bearing in mind the distinctions between event- and timestamp-based metrics, \(TP\) in Table 3 represent true positives, which are correct flags raised for actual instances (either events or timestamps); \(FN\) denotes false negatives, where true instances are not flagged; \(TN\) refers to true negatives, cases where no burst occurs and no flag is raised; and \(FP\) indicates false positives, instances where a flag is raised erroneously as no burst is occurring.

Based on the above definitions, the \(Recall\) metrics indicate the proportion of actual burst events that were correctly identified, emphasizing the model's sensitivity to detecting bursts. On the other hand, \(Precision\) represents the proportion of flagged burst events that are true burst events, showcasing the reliability of the alarms generated. The \(f1 score\) is the harmonic mean of Precision and Recall, a composite measure that is suited for imbalanced datasets, such as the one in this study where burst-free conditions are the norm. Lastly, \(Fallout\) is the probability of false alarms. A good methodology should have high \(Recall\), \(Precision\) and \(f1 score\), whereas \(Fall out\) should be minimal. For simulated burst cases, we introduce an additional \(Detection\;Delay\;(DD)\) metric. This metric measures the time interval between the actual onset of a fire hydrant burst and the moment the burst is flagged.

3.7 Experimental Setup

We split the datasets for each DMA into two parts: one for training and validation, that is cleaned of any registered bursts, and one for testing. To prevent the occurrence of vanishing gradients in detecting real bursts, we reduce the length of the input sequence fed to the LSTM cells by resampling the original data at 30-min resolution. This yields input sequences for the hydraulic variables of 192 timesteps, with data up to 4 days before the prediction time. We implement the neural networks using the Keras package in Python. We train all models for 100 epochs using the Adam optimizer (Kingma and Ba 2015), storing the best model based on the validation performance at each epoch. The optimizer’s suitability of this problem stems from its adaptive learning rate mechanism, its ability to store moving averages of both the gradients and the squared gradients of the parameters, and not their past instances. The model is trained using a decaying learning rate and early stopping, regulated on the validation loss. The initial learning is 0.010 and reduced by 20% whenever the validation loss does not significantly for 6 consecutive epochs. Similarly, the training stops if the validation loss does not improve significantly after 10 epochs. For both cases, a loss improvement of 0.001 is used as tolerance.

4 Results and Discussion

4.1 Results for Fire Hydrant Bursts

Three different scenarios are initially investigated: (i) Scenario A, where only the 3 originally installed sensors are used, but the models are developed using all the data available for training and validation (see Table 1), (ii) Scenario B, where all the available sensors are utilized, but the models are developed only with data from 2022, with training and validation taking place in the “fine-tuning” period of early 2022 (see Table 2), (iii) Scenario C, with all the available sensors, but with transfer learning of the model weights from scenario A, and fine-tuning the expanded model for a period of 1 month (see Table 2).

Table 4 shows that for scenario A, only 1 of the 3 performed fire hydrant bursts is detected and only after a delay of 60-min. This is explained by the distant location of the bursts compared to the original pressure sensors and the significant fragmentation of the training and validation subsets. Because Scenario A leverages the longest training and validation subsets compared to the other scenarios, it is plagued by multiple sensor replacements/recalibration and operator-induced pressure adjustments, which limited the “usable” dataset length to be fed to the LSTM neurons. Scenario B results are also poor due to the very short length of the training subset with all sensors, which does not capture the inter-annual seasonality.

Table 4 Detection performance on fire hydrant bursts

The benefits of training on data before adding additional sensors and then employing transfer learning become clearer when we compare scenarios B and C; the latter of which does use transfer learning. We see improvements in performance across the board, especially in the fact that we can detect all three bursts within either 15 min or 30 min.

Regardless of the considered Scenario, \(Fall out\) values for the DMA Delta are high. Upon closer investigation of the raw time-series, it was found that the pressure sensor installed at the critical point was faulty, with pressure readings increasing by more than 30 m. To assess the impact of the faulty sensor, Table 4 reports two additional scenarios: Scenarios D and E, which are identical to B and C respectively, with the removal of the pressure sensor at the critical point.

As expected, the performance on DMA Delta improves for both Scenarios, with \(Fall out\) plummeting from 23.0% and 22.6% to 2.1% and 1.7%, respectively. Also \(Precision\) increases from 1.8% and 1.9% to 9.6% and 7.0%, respectively. For the same DMA \(DD\) is not improved and \(Recall\) is reduced. The latter is caused by the misclassification of a few burst instances as false negatives. However, this has limited to no impact in the actual detection, since \(DD\) remains unchanged to 30 min.

For DMAs Beta and Zeta that had no faulty sensor, the exclusion of the sensor at the critical point deteriorates the performance in scenarios D and E compared to B and C respectively. This is reflected on the \(Recall\) decrease, the \(Fall out\) increase, and \(Precision\) decrease. Especially for DMA Zeta, scenario D parameter selection is detrimental and leads to the burst going completely unnoticed (with a \(Recall\) of 0%). This signifies the importance of the spatial coverage of every sensor and that unnecessary information removal has consequences for the overall performance.

The robustness of the method in detecting the simulated bursts is supported by the fact that the bursts in DMAs Beta and Zeta for Scenario C and DMA Delta for Scenario E (where the problematic pressure sensor signal was removed, takes place within either one, i.e., 15-min, or two, i.e., 30-min, time steps. Furthermore, for these cases, fall out is 2.8%, indicating a very low rate of false alarms.

Figure 2 presents a 24-h snapshot of the fire hydrant bursts, originated from an identical, yet different run of the algorithm. A visual inspection of these figures reveals the existence of residual alarms, i.e., additional exceedances of the error threshold, after the simulated burst stops. This phenomenon is likely due to the lingering effects of the recent burst; the LSTM cells continue to use the disrupted hydraulic data from the burst period in their subsequent predictions for some time. Although operators can readily eliminate these incorrect flags following a burst repair, we have chosen to categorize them as false alarms in this study. This classification yields higher \(Fall out\) and lower \(Precision\).

Fig. 2
figure 2

Fire hydrant bursts in DMAs Beta (top), Delta (middle) and Zeta (bottom). Left sub-figures are for Scenario A. Right sub-figures are for Scenario C. Top row is for the discharge at the inflow of the DMA.. Middle row is for the pressure at the inflow of the DMA and the critical point. Lower row is for the MSE (error), the error threshold, the burst start and repair time and the raised alarms

4.2 Detection of Real Bursts

Table 5 shows that real burst detection performance varies greatly. \(Recal{l}_{e}\) \(Precisio{n}_{e}\) \(f1 scor{e}_{e}\) range from very low to very high values. Similar ranges are exhibited in the timestamp-based metrics. Performance in DMA Epsilon is very good, with the highest \(f1 scor{e}_{e}\) of 66.7%, the lowest \(fall out\) of 0.2% and the highest \(Precision\) of 98.%. Performances are particularly low for DMA Delta, with a \(f1 scor{e}_{e}\) of 6.7%, the highest \(fall out\) of 12.4% and the lowest \(Precision\) of 12.2%.

Table 5 Performance of LSTM-based model on real bursts. For the detection of the real bursts a single flow sensor and two pressure sensors were utilized, as described in the Section 2

It is also worth noting the high correlation between the number of bursts recorded in each DMA and the \(Precisio{n}_{e}\) metric, evidenced by a correlation coefficient of 0.848. A substantial correlation is also observed between the timestamp-based \(Precision\) aand the number of bursts, with a coefficient of 0.750. Given that the datasets for the various DMAs are roughly equal in length (see Table 1), it is plausible that these correlations arise from varying degrees of public alertness, which typically plays a significant role in pipe burst identification.

The unreported bursts may explain the differences in detection performance across the DMAs investigated. This claim is supported by the land use cover of the different DMAs. Namely, DMAs Delta and Eta that exhibit the worse performance are mostly rural, whereas DMAs Alpha, Beta and especially Epsilon are heavily urbanized. This is an indication that several actual bursts in the rural DMAs may go completely unnoticed. This has a two-fold impact on our methodology. First, not all bursts are removed for training and validation, thus impairing the ability of the model to “learn” burst-free (i.e., normal) behavior. Second, the existence of multiple unregistered bursts in the testing subset leads to an overwhelming number of \(FP\) s, which should be in fact be labeled as \(TPs\).

4.3 Sensitivity Analysis

To study the effect of time resolution on the burst detection performance, the entire process of training, validation and testing is repeated for the urban DMA Beta and the rural DMA Eta. Different combinations of time resolution (15 min, 30 min and 60 min) and length of the input hydraulic feature time-series (1 to 7 days) are investigated. Tow conditions apply to this analysis: (1) The length of the time-series is limited to a maximum of 250, so that the efficiency of the LSTM cells is not severely reduced by vanishing gradients; (2) An integer number of days is used, so as to not interfere with the 24-h behavior seasonality of water consumption. Table 6 shows the event- and timestamp-based performance metrics for the two DMAs along with the original combination of length = 4 days and time resolution of 30 min used in the previous experiments.

Table 6 Sensitivity analysis of input time-series length and time resolution

Table 6 shows that the impact of time resolution and/or length of the input time-series is not negligible. More specifically, coarser resolution leads to significantly higher values of the \(Fall out\) and lower \(Precision\)., which translates to less confidence on the alarms. This phenomenon can be attributed to the relative prominence of spurious exceedances of the error thresholds when compared to the same number of corresponding instances in finer resolution datasets.

Furthermore, in the 60-min resolution, the model cannot “decode” the short-term dynamics of bursts, because pressure and flow data are aggregated into 1-h intervals. Even though there is a slight increase in the values of both \(Recal{l}_{e}\) and \(Recall\) for coarser resolution, it is maybe preferable to sacrifice the detection of a handful of bursts, for the sake of higher confidence, which is reflected in the increase of all the other metrics. Based on these results, it emerges that combination of 2-day long time-series at 15-min resolution is superior with respect to all the other combinations. This is supported by the better scores across all the performance metrics. However, a coarser resolution of 30 min provides more time to the sensors to relay their measurements.

Furthermore, we studied the sensitivity of the performance to the time-varying error threshold. This analysis was performed only for DMA Beta, which is the most heavily urbanized and it is characterized by the highest number of registered bursts. The results are shown in Fig. 3 for the metrics \(Recal{l}_{e}\), \(f1 scor{e}_{e}\), \(Precisio{n}_{e}\) and \(Precision\).

Fig. 3
figure 3

Sensitivity of the performance metrics to the error threshold

As is expected, Fig. 3 shows that lower thresholds lead to higher sensitivity, with more bursts being detected and an overall higher \(Recal{l}_{e}\). However, this is accompanied by more false alarms impacting \(Precision\), which reduces significantly from over 80% to less than 50%. This trade-off is also seen in \(f1 scor{e}_{e}\), which has the largest value for the 99.9th percentile, for both curves. As for \(Precision\), lowering the error threshold reduces it, but not as much as \(Precisio{n}_{e}\).

4.4 Comparison to Other Burst Detection Methods

We assess our burst detection method through qualitative comparison with existing LSTM-based techniques. The lack of available code implementation and the difference in the case studies prevents direct comparisons. Wang et al. (2020) employed a pure LSTM model to detect both simulated and synthetic bursts in a single DMA. In detecting simulated bursts, their model was able to detect bursts after two time steps using 5-min resolution data. This is comparable to our findings (see Table 4) although we were able to detect bursts also at the first time-step and at coarser resolutions. Lee and Yoo (2021) implemented a different LSTM model with flow data only, to detect a single burst. Their approach yielded inferior results (sensor-based \(Recall\in \left(46.46\%, 99.81\%\right)\) and \(Fall out\in \left(0.11\%, 29.88\%\right)\) compared to \(Recal{l}_{e}\in \left(16.7\%, 100\%\right)\) and \(Fall out\in \left(0.10\%, 12.4\%\right)\). In summary, our burst detection method exhibits promise compared to existing LSTM methods, with specific performance variations dependent on the dataset and methodology employed.

We conducted a quantitative comparison against Autoencoders (AE), a DL architecture successfully used in the detection of other types of anomalies in WDS, i.e., cyber-physical attacks (Taormina and Galelli 2018). We use AEs to compare their performance against our LSTM-based method on the same real burst dataset. As demonstrated in Table 7 of the Appendix, our LSTM model outperforms the AE approach, as there is an overall improvement in \(f1\;score\) computed across all the DMAs. This highlights the LSTM effectiveness in detecting bursts, most likely due to the incorporation of sequential inductive bias.

Evaluation of the model's performance on well-established benchmark datasets does not take place. Even though this would facilitate comparisons with existing methodologies, it would not suffice for direct comparisons with established LSTM-based techniques. None of the existing LSTM methodologies has undertaken such comparisons, and they hold particular significance for our approach as we aim to enhance upon them. Moreover, assessing the adaptability of previously benchmarked models is challenging due to their structural constraints, which often preclude the incorporation of inductive bias and handling extremely brief datasets, as exemplified by the two-month records for the simulated bursts in this context.

5 Conclusions

This work presents a novel LSTM-based method for pipe burst detection in water distribution systems. The developed model harnesses the potential of LSTM networks to predict flow and pressure during normal operational circumstances. Notably, the algorithm exhibits elevated prediction errors when exposed to data stemming from pipe burst incidents. The salient attributes of the LSTM architecture encompass its power in managing extensive temporal sequences and its inherent adaptability that enables integration and exclusion of information streams. Importantly, the technique leverages transfer learning to overcome the constraints arising from limited training datasets and a varying number of sensors.

Testing on real bursts in 10 DMAs in the UK revealed that the developed LSTM-based method exhibits varying performance, with \(Recal{l}_{e}\in \left[16.7\%, 100\%\right]\) and \(Fall out\in \left[0.1\%, 12.4\%\right]\) reflecting a varying confidence to the correct identification of bursts and the incorrect classification of non-bursts respectively. This inconsistent performance across DMAs was noted due to the, sometimes poor, burst record quality. This data is crucial for training the model in burst-free conditions and correlating alarms to actual bursts. Limited public awareness, especially in rural DMAs, and unnoticeable smaller bursts impact the proposed approach. Finer data resolution, namely 15-min time steps, enables capturing abrupt discontinuities, such as pipe bursts better, and enhances burst-detection performance, For urban settings this increases \(Precisio{n}_{e}\) from 78.6% to 93.3% and for rural settings the same metric increases from 3.0% to 10.3%. Sensitivity analysis shows that variable error threshold mirrored to the daily water consumption behavior further improves burst detection robustness.

Testing on simulated fire hydrant bursts emphasizes burst-sensor proximity. As noted before in the literature, distant bursts pose detection challenges. However, installation of additional sensors in the DMA reduces this issue and enables the timely detection of bursts corresponding to as little as 11% of the mean DMA inflow. LSTM's inherent flexibility and transfer learning facilitate easy integration of extra data streams.

It is paramount to acknowledge the study's inherent limitations, notably the requisite reliance on burst-free training pressure and flow datasets. The acquisition of such datasets remains an arduous task, particularly within sparsely populated rural settings or DMAs characterized by a scarcity of monitoring sensors. The LSTM-based burst detection algorithm sensitivity to sensor recalibration and replacement accentuates the necessity for meticulous identification of temporally consistent measurement periods suitable for model training and validation. In addition, the alarm persistence past the burst repair is noted as a weakness too and can be circumvented, at an operational level, by temporarily “suspending” the model from running after alarms are raised. Finally, testing this methodology in simulated bursts taking place at nighttime is recommended for evaluating its robustness in more realistic conditions.