The detection of anomalous and recurring pressure patterns is divided into three steps: detection of anomalous events (Fig. 1a), clustering of events (Fig. 1b) and visualization of recurrence history (Fig. 1c).
Access to actual and historical pressure sensor data was provided by Vitens, a Dutch drinking water company. A known case of recurring anomalous pressure patterns followed by a pipe burst was investigated from 1/6/2012 to 1/6/2013, hereafter referred to as the 2013 data set. In addition, a recent data set from another pressure sensor is used, with measurements from 18/5/2017 to 17/11/2017, hereafter referred to as the 2017 data set. Both pressure sensors were situated close to water reservoirs.
As a preprocessing step, erratic measurements were removed. Resampling and linear interpolation in time were used to obtain a constant sampling interval of one second.
Anomalous events were detected using a moving window range statistic, defined as the difference between maximum and minimum values of every ten-seconds moving window, divided by the window size of ten seconds. A ten-seconds window range statistic was used instead of the derivative, so as to avoid problems associated with noise present in the pressure measurements. Measurements with a range statistic of more than two kPa/s were flagged as anomalous (Fig. 1a), since rapid pressure changes of this magnitude are most often caused by events that are relevant for the purposes of this study. Although quite simple, the range statistic and absolute range threshold were found to be able to detect all relevant anomalous events. Since anomaly detection is an important and complicated process, a more extensive definition of anomaly detection will most likely improve performance (Branisavljević et al. 2011; Mounce et al. 2014; Scozzari and Brozzo 2017). However, for illustration of our method on the aforementioned data sets, the current metric is sufficient and suitable.
The anomalies were combined into events, where anomalous measurements within a 15 min duration were considered to be part of one event (Fig. 1a). Next, each event was extended with two minutes of preceding and two minutes of succeeding measurements to ensure the entire anomaly and context were captured as a single event.
Recurrence of anomalous pressure patterns was defined as the repetition of similar anomalous events. Events were clustered in order to detect which events are similar and probably have the same cause. Clustering is an unsupervised method for grouping of similar events based on the distances between events. For this, events were represented by vectors, after which the distance between these vectors can be calculated. Events with a low distance between them are deemed similar and were included in the same cluster. Each cluster corresponds to a specific recurring and anomalous pattern (Fig. 1b). The vectors assigned to each event were based either on event measurements (instance-based) or on each event’s characteristic features (feature-based) (Fulcher and Jones 2014).
In this study, clustering was performed using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) (McInnes et al. 2017), which clusters events based on their density within a vector space. Unlike similar clustering methods, such as DBSCAN (Ester et al. 1996) and Mean Shift (Ray and Benammar 2002), HDBSCAN uses a hierarchical minimum density threshold and is better in detecting varying cluster shapes. HDBSCAN also allows clustering with a precomputed distance matrix and has the capacity to distinguish core samples from outliers.
Since the presented method is intended for real-time application, clustering needs to be performed anew when novel events are detected. Clustering was performed over the most recent 150 events using a moving window of these events whenever a new anomalous event was detected (Fig. 1b). This moving window approach ensures real-time applicability and detection of distinct clusters for the different recurring patterns present in the investigated data. The window size can be adjusted if requested. However, larger window sizes potentially result in merging of clusters due to a higher overall density of events, making the distinguishing of local denser areas more difficult. Smaller window sizes might result in failure to detect recurring patterns with a low frequency of occurrence.
Distances for Instance-Based Clustering
In order to calculate the distance between two event vectors of different lengths, the vectors are clipped to equal lengths. Clipping was done based on the maximum cross-correlation between both events (Fig. 2b). For every pair of event time series, the lag related to the maximum cross-correlation was removed (Fig. 2a), followed by clipping of the non-overlapping tails of both events (Fig. 2a) to obtain events of equal length (Fig. 2c).
Optionally, Dynamic Time Warping (DTW) can then be applied in order to correct for temporal drift, which increases the accuracy of the succeeding distance calculations (Fig. 2d) (Aghabozorgi et al. 2015). In this study, DTW was limited to warping of up to 5% of the total event duration in both directions. After clipping and DTW, the Euclidean distance between events was calculated and corrected by dividing by the length of the events before being subjected to clustering.
Distances for Feature-Based Clustering
For each event, 43 features were calculated (Appendix 1, Table 1). In each clustering window of 150 events, the features of these events were scaled by median subtraction followed by interquartile range division, ensuring that scaling was robust for outliers.
The features were chosen so as to be robust for distinguishing between a limited number of recurring patterns. After scaling, the distances between each event pair’s feature vectors were calculated and the resulting distance matrix was subjected to the clustering method.
Fingerprint graphs (Fig. 3) present an effective overview of the periods of recurrence for different type of patterns and their respective frequency of occurrence. When a new anomalous event is detected, the clustering results of the corresponding 150-event window is added to the fingerprint graph as a vertical white slice. Each colored area depicts a recurring pattern, where each pattern’s height depicts its frequency of occurrence within the 150-event window and its length corresponds to the duration of the pattern recurrence (Fig. 1c).
The validation report depicts the precision (fraction true positives among detected positives), recall (fraction of true positives among actual positives) and F1-score (2 ∗ precision ∗ recall/(precision + recall)) for each true recurring pattern present in the manually labeled validation data (van Rijsbergen 1979). In order to calculate these scores, cluster ID numbers were mapped to the validation labels. Clusters mapping to the same pattern were deemed a single cluster for the sake of accuracy scores calculation only.