1 Introduction

Managing infrastructure assets across vast geographical areas presents unique challenges due to their complexity, diversity and scale. According to National State of the Assets Report [4], 36% of the public infrastructure assets are in poor and fair conditions, requiring significant maintenance, and investment to maintain service levels. Governments together with the local councils invest substantial amount of money and time to ensure effective infrastructure management and deliver expected levels of service to their communities. It was noted that most infrastructure inspection processes are labor-intensive and time-consuming [23, 62]. To overcoming the prevailing issues, recent approaches have focused on digitalizing processes across many government agencies. With the advent of advanced technologies, traditional manual practices are undergoing a transformation into fully and/or semi-automated processes which has been discussed in various academic and industry platforms [23, 44]. The integration of digital technologies and data-driven solutions has led to more efficient, resilient, and sustainable infrastructure systems.

This study focuses on stormwater pipelines in particular condition assessment process. Stormwater assets constitute 19% of total local government infrastructure in Australia [4] receiving significant attention. As critical infrastructure, stormwater pipelines require adequate maintenance and regular inspections to prevent flooding and leakages. Current inspections rely on closed-circuit television (CCTV) technology with support from skilled personnel. However, these processes are tedious and costly [12, 61]. Although stormwater pipe networks span vast distances, only a small percentage undergo inspection annually due to current approaches [52]. Determining the condition of stormwater systems is crucial not only for predicting future conditions but also for efficient financial planning of assets [15]. Numerous studies have utilized Deep Learning (DL) techniques incorporating computer vision-based defect detection methods, which have been widely employed in civil-related studies [10, 14, 21, 63]. However, a few studies and practices have showed on stormwater pipe infrastructure. This is primarily due to the unique and challenging environmental conditions of stormwater pipes, such as variable lighting, debris accumulation, and turbulent flow conditions, which complicate the acquisition of consistent, high-quality data [26]. Additionally, the economic and logistical constraints of maintaining extensive stormwater networks have led to less frequent inspections and data collection [39]. Given the critical role stormwater pipes play in preventing urban flooding and infrastructure damage, further research is essential to develop robust and accurate defect detection models tailored to these unique conditions. To this end, this study proposes an evidence-based, semi-automated approach to facilitate cost-effective and efficient practice for the condition assessment of stormwater pipe networks. This research harnesses instance segmentation, a cutting-edge computer vision technique incorporating the real case study, to enhance condition assessment of stormwater pipes infrastructure. This enhancement is a primary contribution of this paper. A case study: Banyule Council was employed to conduct a cost–benefit analysis of the proposed model, highlighting the evidence-based approach which is significant for this paper.

This paper is organized as follows: Sect. 2 discusses current practices in condition monitoring systems for stormwater pipelines through a comprehensive literature review. Section 3 details the methodology used in this research. Section 4 presents the results and analysis. Section 5 offers a discussion of the findings and future works, and Sect. 6 concludes the study. This model serves as a tool for local councils in Australia to conduct condition monitoring assessments for stormwater pipeline work, adhering to standard practices. It enables a proactive asset management framework, identifying maintenance needs, predicting the asset’s useful life, improving performance, and minimizing costs, ultimately contributing to resilient and safe infrastructure asset management.

2 Literature review

2.1 Condition monitoring approaches for stormwater pipelines

Stormwater pipe networks are vital components of urban infrastructure, given their primary function to convey the excess rainwater or runoff from storms. They help prevent flooding by quickly directing rainwater away from urban areas. Local councils and government agencies devote significant funds and resources to establish proactive maintenance strategies for critical infrastructure. This investment is geared towards enhancing service quality and preserving the assets’ optimal condition. Efficient and effective maintenance of these pipe networks, which exist both underground and above ground, is paramount for sustaining the level of services to the community and mitigating any risks of failures [61]. While many councils primarily employ a reactive maintenance strategy, there has been a recent focus towards implementing a more proactive maintenance approach. Such strategies are implemented to understand the structural and operational deterioration of the structure to make decisions, such as maintenance requirements, repair, replacement, and disposal strategies.

The conventional condition for pipeline inspections is from manual procedures where engineering personnel visually inspect the pipeline on-site. This method becomes more unsafe when pipelines are underground or difficult to access [12, 20]. The prevalent method employed by councils worldwide for inspecting stormwater pipelines nowadays involves the use of CCTV technology [61]. This approach can be implied as a semi-automated process. Engineering or skilled personnel on-site visually inspect the pipeline and assess it according to established guidelines. Further, CCTV footage is taken on-site and assessed off-site. Mostly, local councils outsourced these services, taking CCTV footage to specialized contractors [46, 61]. However, these conventional practices have several deficiencies. The experienced professionals will examine the image or real condition which is time-consuming and subject to judgment with the different expertise levels and experience [12]. This method still requires professional involvement for assessing the defects and making decisions. Such manual interpretation and evaluation with record keeping is time-consuming and labor-intensive [13, 29]. Further, it was noticed that there are backlog in assessing all the videos [24, 49].

Many technologies have been involved in inspecting the condition of the pipelines recently to mitigate current practices. Current 360-degree cameras are nowadays replaced with high-resolution cameras such as RGB to obtain high-resolution images [34]. This enables accurate assessment of the defect measurements. Furthermore, technologies, such as UAVs, robots, and Light Detection And Ranging (LiDAR) technology, have been incorporated to capture the images accessing more areas with high-efficiency levels and non-contact inspection images [35]. With the use of CCTV robots and UAVs for monitoring and inspection of infrastructure, a large number of images were generated annually [22]. There was a need to analyze these videos more effectively and efficiently. Computer vision technology can offer substantial benefits, including reducing time, improving cost-effectiveness, and minimizing human errors in judgment. Numerous studies have highlighted the importance of computer vision technologies in analyzing images captured through CCTV [6, 8, 19, 53]. Despite its complexity, this method offers faster processing times [20]. Various researchers have proposed algorithms and techniques to address existing challenges in this process. Specifically, DL algorithms, which leverage neural networks to autonomously learn from vast datasets, have emerged as a promising avenue for accurately detecting and assessing damages within stormwater pipelines, thereby facilitating prompt maintenance interventions and minimizing potential risks.

2.2 Vision-based DL applications for assessing the condition of pipe networks

The utilization of DL algorithms within the civil and infrastructure discipline is not yet widespread, despite its demonstrated effectiveness in defect identification tasks in various infrastructure. Structural health monitoring (SHM) of critical infrastructure, such as bridges, pavements, roads, pipelines (including sewer and stormwater systems), and building structures, represents a significant area of application for these advanced algorithms. Within this domain, numerous studies have explored the capabilities of DL algorithms to predict and identify a range of structural issues and defects [21, 59]. Image-based SHM using advanced computer vision technology has garnered interest in infrastructure, such as concrete pavements, bridges, and sewer networks [43, 62, 63]. Researchers have focused on predicting concrete cracks, a common concern in many structures, as well as concrete spalling and fatigue cracks in steel elements [21]. Additionally, DL algorithms have been applied to forecast steel corrosion, identify asphalt defects, and detect various types of defects in concrete structures and pipelines [21]. These defects encompass a broad spectrum of issues, including but not limited to cracks, deformations, leaks, and material deterioration [44]. The use of DL algorithms for damage detection is of two kinds: using data, such as vibration and temperature, from raw measurements to predict the condition of infrastructure like bridges and roads [37]. Other studies use computer vision (i.e., images) to predict the condition of the infrastructure. This study falls into a second category, where employing computer vision technologies to detect the condition of the infrastructure. Use of image detection methods in crack detection was used mainly to reduce the complexities and also it has higher potential for detecting thinner cracks [6]. Recent research papers on DL algorithms combined with computer vision technologies were reviewed, particularly those concerning sewer, water, and stormwater pipe inspections. Table 1 presents a summary of these studies.

Table 1 Summary of research papers on automated defect identification by DL and computer vision technology of sewer and stormwater pipe networks

As highlighted in Table 1, there is a scarcity of high-quality datasets specifically for stormwater pipes. Most existing datasets are tailored to sewer pipes, which do not capture the unique characteristics and conditions of stormwater infrastructure. While defect types may be similar between sewer and stormwater pipes, sewer pipes are generally more controlled environments. In contrast, stormwater pipes are subject to a wider range of external factors, such as debris and variations, in defect types. Applying models trained on sewer pipe datasets to stormwater pipes typically results in lower performance. The lack of available data for such experiments and the unrecognized need for monitoring to inform stormwater management, compared to other infrastructure, could be reasons for the low attention to research in this domain [33]. Given the critical role that both types of pipelines play in supporting urban development, it is imperative to allocate equal priority to both infrastructures.

According to the review, the studies have defined many types of defects for determining the condition of the inspections. Researchers have identified various types of defects while a crack is the most popular defect in this pipe inspections. In addition, root intrusion, deformation, breakings, and infiltrations are some other types identified in the research [41]. The defects result from various internal and external factors, such as age, weather conditions, and use [21, 25, 28]. Sun et al. [46] divided defects in pipe networks into two categories: structural and functional defects. Structural defects are cracks, deformation, corrosion, leakages, joint eccentricity, open joint lateral protruding intruding sealing material corrosion and intrusion. Functional defects are root intrusion, residential walls, obstacles, scum, attached deposits and depositions. These defects along with their measured sizes, are then used to evaluate the condition of the assets in terms of their structural ability and serviceability. The assets are also graded on a scale to indicate their condition. This process helps in determining maintenance needs, deciding on replacements and repairs, and predicting long-term conditions [28, 41, 46]. It is also noted that different countries, regions, and institutes use different grading scales to monitor the condition of the pipelines. Pipe Assessment and Certification Program (PACP), American National Association of Sewer Service Companies (NASSCO), Manual of Sewer Condition Classification by Water Research Centre (WRC) in the UK, the EN 13508-2 in Europe, and WSA code in Australia [3, 46, 66] have graded the condition of the pipes variously.

The inference of the DL framework with advanced algorithms has enabled an accurate and efficient defect detection process. Detecting defects using DL-based vision technologies involves methodologies that leverage neural networks to identify, classify, and predict defects. A few DL frameworks were also noticed in previous studies. CNN has been applied for various construction applications [12, 57, 58]. Basic CNN steps include feature extraction from raw images; model training by feeding the extracted features forward to class prediction using training dataset; loss calculation between predicted and actual values; model optimisation according to feedforward error–backpropagation approach; optimal hyper-parameters selection (e.g., the number of layers, size of convolution kernels, activation function, and learning rate) based on validation dataset and final external validation of test data [28]. Further to CNN, an improved model has been developed such as a faster R-CNN to mainly extract from the input images. Overall, R-CNN [5], fast R-CNN [17], and faster R-CNN [42] belong to two-stage object detection approaches. These models introduce a region proposal network (RPN) that refines object localization iteratively, leading to higher precision, especially for complex and small object detection tasks. There are other single-stage detection methods, such as single-shot detector (SSD) and You Only Look Once (YOLO), which tend to be even faster [28]. The YOLO series is renowned for its efficiency and effectiveness in various applications of computer vision, including defect identification in infrastructure. This series has been harnessed to identify defects across a diverse array of infrastructures, such as underground pipelines [11], concrete structures [27], and bridges [48]. Its robust capabilities extend to tasks like image classification, object detection, and segmentation, which are integral to pinpointing and delineating defects.

The majority of DL-based applications for assessing pipe network condition have primarily focused on using object classification or detection methods, aiming to identify and localize defects within the images [5, 18, 31]. However, a few studies have adopted segmentation techniques, which offer the advantage of quantifying the scale of defects, including their width and extent [38]. Object segmentation is vital for quantifying damage, as well as assessing the serviceability and structural integrity of infrastructures like pipelines [47]. The accuracy of the segmentation process directly influences the reliability of the damage assessment and, consequently, the decisions regarding necessary repairs or maintenance. Further, the studies have been conducted image registration framework for extracting crack development information to estimate the relative change for a given defect from images over a period [25]. It was observed that following defect detection, pipelines could be graded on a scale of condition 1–5, as outlined in the WSA code. This grading process requires additional information about defects, such as their length, width, and other characteristics, which segmentation techniques can provide. Notably, this aspect has received minimal attention, especially in the context of stormwater pipe infrastructure networks. Although the current study shares similar motivations with previous research, the authors aim to address this gap by proposing a methodology to grade pipelines according to the coding standards, leveraging the detailed defect parameters obtained through segmentation methods.

It is important to acknowledge the drawbacks of YOLO compared to two-stage detectors like Faster R-CNN. YOLO can have difficulty accurately detecting tiny and overlapping objects at certain frame rates during video processing. Additionally, YOLO might have reduced localization precision since it lacks the iterative refinement process of two-stage detectors. It also faces challenges in handling complex scenes with many overlapping objects and might not handle class imbalance as effectively during training [56]. However, its ability to handle video streams efficiently makes it well-suited for stormwater pipeline defect detection. This high-speed, low-latency model can quickly process large volumes of CCTV footage ensures timely maintenance and intervention, making it an optimal choice despite its limitations. Therefore, this study adopts a basic object segmentation algorithm within the YOLOv8-seg model. The model is a multitasking perceptual model that performs both detection and instance segmentation. Its head is designed to handle both tasks, producing bounding boxes for detection and pixel-wise masks for segmentation. As a single-stage detector, it eliminates the need for complex two-stage processes, making it simpler and more efficient to deploy. The model is also easy to train, requiring fewer resources and less time, which is beneficial for applications needing frequent retraining. Incorporating the latest advancements in computer vision and DL, YOLOv8-seg remains state-of-the-art, outperforming older models. Based on these benefits, this study selects YOLOv8-seg as the primary segmentation model.

3 Methodology

As shown in Fig. 1, the research aims to create a framework for a more efficient and cost-effective condition monitoring system for urban stormwater infrastructure networks. This investigation proceeds through three key steps toward this objective. They are: (1) developing a grading scale for the stormwater pipelines, (2) developing DL-based model to predict condition of the pipelines and (3) perform a cost–benefit analysis. Given the collaboration with Banyule Council, the council serves as a case study within this research project. Banyule municipal authority situated in the northeastern suburbs of Melbourne, Victoria. Banyule City Council manages an extensive network of approximately 800 km of stormwater pipes. The detail of the steps is explained below.

Fig. 1
figure 1

Research process

3.1 Developing a grading scale

This stage aims to understand the existing process of the condition monitoring systems and how the infrastructure is graded based on detected defects. The local council’s current approach to condition monitoring is semi-automated, with their practice involving CCTV inspections of a 10-km segment of pipes annually. Figure 2 illustrates the steps involved in these current practices.

Fig. 2
figure 2

Current process of condition monitoring of stormwater pipelines

Presently, condition monitoring is outsourced, and a contract will take document the condition observed from the CCTV videos. Engineering experts analyze the pipeline condition by identifying various types of defects, assigning values based on their severity. These defects are evaluated during on-site investigations while capturing CCTV videos, and engineering experts off-site assess the condition to provide a detailed evaluation of the underground pipelines. The grading of the pipelines is based on the WSA 05-2008 Conduit Inspection Reporting guidelines and expert’s judgement. According to the WSA code (2008), the structural defect scores for stormwater drains made of rigid materials are determined based on defects, such as cracking, deformation, displaced joints, collapse, surface damage to concrete, and visible soil and voids. The WSA code also specifies the scale of defects in determining which grade the pipeline falls into. For example, a crack with a width of 1 mm would be described in a grade 1 condition. The grading system ranges from 1 to 5, with 1 being the best and 5 being the worst. The values are determined based on the severity of the defects, which help assess both the structural integrity and serviceability of the pipelines. Referring to the WSA code, this study focuses on identifying six primary types of defects crucial for evaluating the structural integrity of the pipeline. Those are cracking, joint displacement, and breaking.

3.2 Data preparation

The data employed in this research were procured from Banyule City Council. This study used CCTV inspections carried out between 2017 and 2023 and approximately 900 video recordings. These selected videos provided a representative distribution of various defects. Images were extracted from the videos using FFmpeg open-source software. The frame size of the images has a resolution of images 720 × 579 pixels. For our research, 2503 images from the videos have been selected. However, this study intends to predict the defect classes without measuring their size.

The next step involves identifying data or images for the experiment. From the existing videos, it proved challenging to categorize images into the six categories without expert opinion. Within these videos, diligent operators and contractors who had previously undertaken defect identification on-site were recorded with their judgement. We followed their judgment when selecting defect types. Leveraging this pre-existing defect identification work by experienced stormwater pipe operators and inspectors enhances the accuracy of image identification. Consequently, defining images falling into the above categories with a high accuracy rate becomes easier and less time-consuming. Once all defects are identified and the corresponding frames are extracted, the labelling process commences by placing bounding boxes around the specific defects.

The labelling process of images constitutes a crucial part of data preparation. For annotating defects within the extracted image frames, we utilized the Roboflow platform [1]. This platform streamlines the management and preparation of datasets for machine learning tasks by offering essential tools for annotating, enhancing, and transforming data to create datasets tailored for DL. Consequently, we annotated a total of 866 images for joint opening, 567 for breaking, 470 for cracking, 185 for holes in walls, 239 for spalling, and 176 for exposed rebar. The quality and the quantity of data significantly influence the training and subsequent performance of the DL model. We were able to maintain a balanced distribution of data across each category. Figure 3 displays the results of annotated images from each category.

Fig. 3
figure 3

Images of each category with annotations

The subsequent step involves allocating data into training, validation, and testing samples, and a methodical approach recognized for its pivotal role in developing an effective model process. Through a comparative analysis of the model’s predictions and the actual data in the validation set, signs of overfitting can be discerned. Setting aside a specific segment of the dataset for the testing set serves as the ultimate assessment for the trained model. After thorough training and evaluation of the training and validation sets, the model is subjected to an entirely new dataset, representing the testing sample. Roboflow user-friendly interface streamlined the dataset distribution process. An 8:1:1 ratio was employed for the training set, validation set, and testing set, respectively.

3.3 Model development

The next stage is model development. To accurately record the attributes of identified defects and conduct thorough condition assessments, this study employs a segmentation approach. This model facilitates the quantification of defect size and details for subsequent analysis. The primary objective of this study is to train the model utilizing YOLOv8 to achieve both high accuracy and efficiency in the process.

The architecture of YOLOv8 achieves the highest accuracy rates by incorporating features like CSP (Cross-Stage Partial connections), the PAN-FPN (Path Aggregation Network for Feature Pyramid Networks) feature fusion method, and the SPPF (Spatial Pyramid Pooling-Fast) module. YOLOv8 adopts an anchor-free model with a decoupled head to independently handle object detection, classification, and regression tasks, improving overall accuracy [45]. In the output layer, the sigmoid function is used as the activation function for the detection score, representing the probability of an object being contained within the bounding box. The YOLOv8 architecture represents a significant leap in the evolution of the YOLO series, attaining unprecedented accuracy rates in object detection tasks. This advanced version incorporates several innovative features designed to optimize performance and accuracy, effectively addressing some of the common challenges faced in earlier iterations. CSP in YOLOv8 helps reduce the computational cost while maintaining, and sometimes even enhancing, the learning capability of the network. By splitting the feature map of a layer into two parts and then merging them after separate convolution processes, CSP enhances model efficiency and aids in better gradient flow during training. Additionally, the inclusion of PAN-FPN feature fusion method optimizes how features at different scales are integrated, enhancing the detector’s capability to identify objects at various scales with higher precision. The SPPF module improves the network’s ability to manage input images of varying sizes by pooling features at different scales and concatenating them, leading to more robust feature representation [32].

In YOLOv8, the detection accuracy is refined using Complete Intersection over Union (CIoU) loss, as referenced in [64], and Distribution Focal Loss (DFL) [30]. DFL addresses the challenge of class imbalance that can occur during the training of neural networks. This type of imbalance typically arises when certain classes are more frequently represented in the training set than others, which can skew the model’s learning bias. DFL helps mitigate this by emphasizing the importance of less-represented classes, thus ensuring a more balanced learning process. The model’s overall loss is calculated as a weighted sum of these distinct loss components, integrating their respective contributions to enhance overall detection and classification performance.

3.4 Model training

In this research, we evaluate the model performance using key metrics, such as precision, recall, and mean average precision (mAP) [61]. These metrics provide a quantitative assessment of the model’s accuracy and effectiveness, offering essential insights into its practical applicability in real-world scenarios.

Precision: assesses the proportion of true positive predictions (correctly identified defects) concerning all positive predictions. It’s calculated as

$$\begin{array}{c}Precision=\left(\frac{\text{TP}}{\text{TP}+\text{FP}}\right)\times 100\%.\end{array}$$
(1)

Recall: quantifies the ratio of true positive predictions against the actual positives in the dataset, reflecting the model’s ability to capture all defects. The formula is:

$$\begin{array}{c}Recall=\left(\frac{\text{TP}}{\text{TP}+\text{FN}}\right)\times 100\%.\end{array}$$
(2)

Mean Average Precision (mAP): Pertains to the segmentation mean Average Precision (mAP) calculated at a threshold of 0.5 (mAP@0.5) and across a range of 0.5 to 0.95 (mAP@0.5–0.95). It gauges the model’s ability to create precise object segments in its predictions. This metric is represented by:

$$\begin{array}{c}mAP=\frac{1}{c}{\sum }_{i=1}^{C}\text{AP}\left(i\right).\end{array}$$
(3)

The F1 score is another crucial metric often used alongside recall and mean average precision (mAP) to evaluate the performance of object segmentation models. The F1 score provides a balance between precision and recall, giving a single score that gauges the accuracy of the model in identifying true positives while accounting for the FP and FN errors. The formula for the F1 score is given by the harmonic mean of precision and recall, calculated as follows:

$$\begin{array}{c}F1=2\times \left(\frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\right)\end{array}$$
(4)

where TP is true positive samples, signifying correctly identified defects, FP indicates false positive samples, which represent negative predictions, FN represents false negative samples, denoting positive samples erroneously identified as negative, AP signifies the average precision of segmentation for a category and C is the total number of segmentation categories.

3.5 Cost–benefit analysis

The cost–benefit analysis was conducted to primarily assess the economic viability and efficiency of the proposed approach. The assessment was supported by various sources, including software vendors and CCTV inspection contractors who published relevant information on their websites and online databases. Additionally, data for the current process were collected from the selected case study of Banyule Council’s stormwater pipe inspection. The cost–benefit ratio demonstrates the whole lifecycle cost over the whole lifecycle benefits. Figure 4 shows the entire workflow of this research.

Fig. 4
figure 4

Workflow with the YOLOv8 network

4 Experiments and results

4.1 Model training and the results

YOLOv8-seg provides five pre-trained models of varying sizes: nano, small, medium, large, and extra-large. As shown in Table 2, these YOLOv8-seg models vary in their number of parameters (Params) and floating-point operations (FLOPs), indicating their complexity and computational requirements. Parameters represent the model’s learnable weights, reflecting its capacity to learn complex patterns. FLOPs quantify the computational effort needed for a single inference pass, impacting the model’s efficiency and speed. Higher Params and FLOPs, as seen in larger models, suggest better performance and accuracy but require more computational resources. Conversely, smaller models are more efficient for resource-constrained environments but may have lower performance. We trained all these models using the dataset to evaluate cost efficiency for practical deployment, focusing on a trade-off between model parameters, speed, and accuracy.

Table 2 Scales of different YOLOv8-seg variants

For the training process, we harness the capabilities of Google Colab, a scalable and cloud-based platform for model training. The selected variant, YOLOv8-Seg, stands out as an ideal choice for semantic segmentation tasks, making it precisely suited for defect detection within stormwater pipes. The hyperparameter configuration for all YOLOv8-seg included a learning rate of 0.01, batch size of 16, momentum of 0.937, weight decay of 0.0005, training duration of 25 epochs, and an image size of 800 × 800 pixels, respectively. Following the training process, the models undergo validation using sample subsets, and subsequently, separate testing samples are employed to further evaluate their performance and generalization.

As shown in Fig. 5, the yolov8s-seg model showed superior performance compared to the others. This trend can be attributed to the optimal balance between model complexity and computational efficiency provided by the small variant. Larger variants, despite having higher parameters and floating-point operations, did not achieve the best performance in this case. This can be due to several factors. First, larger models with more parameters are prone to overfitting, especially if the dataset does not have enough diversity to justify the increased complexity. This can lead to reduced generalization capability and lower performance on validation data. Additionally, the increased parameters of larger models more training epochs to converge to the best possible solution within the same number of epochs. Lastly, the small variant strikes an optimal balance, having enough capacity to learn complex patterns without the excessive computational burden, making it more efficient in terms of both training time and resource use.

Fig. 5
figure 5

Training results for different YOLOv8-seg variants: a bounding box detection accuracy and b mask segmentation accuracy

Based on these observations, Yolov8s-seg provides the best trade-off between performance and computational efficiency, making it suitable for practical applications where resources are limited but high accuracy is still required. This selection ensures that the model not only performs well but also can be deployed effectively in real-world scenarios, maintaining both scalability and generalizability.

We run the Yolov8s-seg model with the prepared dataset for testing. Figure 6 shows the example of the testing results of each category. Table 3 presents the performance metrics from a training output conducted by YOLOv8s-seg, evaluating both bounding box and mask detections for identified objects. The high performance of all class detections in both bounding box and mask evaluations indicates that the model is highly effective across categories, demonstrating its capability to accurately detect and segment objects with high precision and recall. The consistent performance across different metrics and IoU thresholds indicates the model’s robustness and reliability for practical applications. These findings highlight the potential of YOLOv8s-seg in enhancing object detection and segmentation tasks, making it a valuable tool for stormwater pipe network assessment.

Fig. 6
figure 6

Prediction from testing sample

Table 3 Testing results of the YOLOv8s-seg model

Although the YOLOv8s-seg demonstrated high accuracy in detecting and segmenting pipeline defects in complex environments, some challenges were encountered that underscored the inherent difficulties in practical implementation. As shown in Fig. 7, occlusions caused by external coatings or debris and varying light conditions such as low light or shadows in underground settings occasionally led to detection failures. These occlusions often occur when pipelines are partially covered by insulation materials, adjacent infrastructure, or when textual descriptions within the image obscure critical views of the defects. Similarly, the lighting issues are often subject to the limitations of the camera’s light sensitivity, which can struggle to capture clear images in dimly lit areas or when shadows are present, further complicating the task of accurate defect detection.

Fig. 7
figure 7

Illustration of missed defect detections in pipeline monitoring due to occlusions and adverse lighting conditions

4.2 Cost–benefit analysis

The model is evaluated in terms of its economic viability, considering a Banyule Council as a case study. Currently, pipe inspections are performed by an outsourced contractor. They assess the condition of the pipes during on-site CCTV footage capture and subsequently off-site according to the WSA code. The average cost, representing industry-standard expenses for conventional pipe inspections, was determined from supplier websites [16, 40, 50, 51, 55] and historical data from Banyule Council. The average annual cost of outsourcing CCTV footage inspection services is projected to be $55,000, inclusive of consultancy fees. The proposed model for this study adopts a semi-automated approach, utilizing deep learning (DL) to assess CCTV footage of the pipeline. After capturing CCTV videos, the methodology involves a pre-processing stage to convert videos into images, which are then fed to the developed model to identify defect types. This method eliminates the need for expert involvement. The proposed method includes the following lifecycle cost–benefit elements:

  1. 1.

    Development Cost: The development of the model involves data collection, preparation (image extraction and labelling), and training the AI algorithm. This prototype is developed by a research team comprised of experts in AI, computer vision, and asset management. The cost represents the hours spent developing the model. A researcher with a PhD level performs six months of work, and the labor cost is quantified. As this is a small development that can be performed with low-capacity computers, it is assumed that there are no expenses for hardware and other physical equipment. The average cost is projected to be 50,000 AUD.

  2. 2.

    Maintenance and Operational Costs:

    • Maintenance Cost: The annual maintenance cost is assumed to be 1,000 AUD, considering the cost of fixing any errors in the model by the research team.

    • Operational Cost: The operation of the system is performed by in-house staff, with no requirement for additional staff hiring, resulting in zero operational costs.

    • Contractor Fee for CCTV Footage: The council needs to hire a CCTV contractor to capture footage of their pipe network only. This cost is identified based on current costs. Currently, the contractor spends three months on the entire process (footage capture and assessment), with two months allocated for capturing pipe CCTV videos. Assuming one-third of the work is for assessment, this cost is approximately 36,000 AUD per annum.

  3. 3.

    Operational Cost Savings: This is a benefit item resulting from not paying the entire current contractor cost. With the proposed method, the council pays only the CCTV footage cost, amounting to 19,000 AUD per annum (calculated as 55,000 AUD–36,000 AUD).

  4. 4.

    Labor Hour Savings: In the semi-automated process, it is presumed to take one week for image pre-processing and assessing the condition of the pipeline. The new method provides three weeks of time savings compared to the old method. This is quantified by multiplying average weekly salary of managerial-level personnel, which is 6900 AUD.

Figure 8 shows the lifecycle costs and benefits of the proposed method over a period of five years compared to the current approach. Considering the above costs and benefits, a cost–benefit ratio was quantified. Due to the preliminary development stage, the project life was considered to be 5 years, and a 4% real discount rate was applied to account for the time value of money [2].

$$\begin{gathered} {\text{Cost}} - {\text{benefits ratio }} = {\text{ life cycle cost/life cycle benefit}} \hfill \\ = \frac{{{\text{Development cost }} + {\text{annual maintenance cost}} + {\text{ CCTV contractor fee }}}}{{{\text{Labor hour savings }} + {\text{ operational cost savings}}}} \hfill \\ = \frac{{50,000{ } + \mathop \sum \nolimits_{i = 0}^{5} 1000 + { }36,000{ }}}{{\mathop \sum \nolimits_{i = 0}^{5} 6900{ } + { }19,000}} \hfill \\ = 5.9 \hfill \\ \end{gathered}$$
Fig. 8
figure 8

Summary of the cost–benefit analysis

According to cost–benefit ratio, it seems that costs override the benefits. However, compared to the previous approach, the council spent 51% less annually which is positive. Despite the quantified benefits, several non-quantified benefits are also identified. This method is efficient and fast, saving three weeks of labor hours. The assessment is highly accurate, and judgments are consistent, unlike with human involvement where variability exists. The model’s predictions remain constant, providing consistent and highly accurate results. Currently, the local council inspects 10% of the pipeline networks. With the proposed system, this can be extended significantly due to the additional hours saved. The local council can increase the number of asset inspections annually, leading to a reduction in reactive maintenance requirements and minimizing catastrophic events. This enhances a sustainable asset management approach.

There are several limitations in identifying cost and benefit parameters, as several assumptions are made due to the unavailability of data. However, details from the selected case study, along with insights gained from several publications, facilitated the establishment of more realistic assumptions and values. The comprehensive cost analysis highlights significant annual cost reduction and improved efficiency. Further analysis will explore how these implications extend to real-world applications and the broader landscape of stormwater pipe management.

5 Discussion and future works

This study aims to highlight the effectiveness of leveraging DL technologies for managing infrastructure assets, with a focus on stormwater pipe networks. Condition monitoring systems are pivotal for assessing the structurability and serviceability of pipelines and WSA code has provided with various types of defects in classifying the condition of the pipelines. This study aimed to define six key defect types for assessing the structurability of infrastructure. Based on the WSA code and the current assessment templates of case study, a grading scale is proposed for these defects as shown in Table 4. Based on this ranking, informed decisions regarding maintenance needs or replacement could be made effectively, contributing to improved asset management practices. Indeed, the model trained for this study focused on defect identification without considering the size of the defects. In line with the WSA code, the peak score was measured rather than the mean score. Further, manual counting of defects along each pipeline enabled the assignment of a structural ability score for the entire pipeline.

Table 4 Grading scale for stormwater pipe network using six defects

The study achieved an acceptable level of accuracy in predicting each category of defects, thus validating the efficacy of the model. Notably, using a real case study facilitated the assessment of pipe infrastructure conditions, enhancing the accuracy of model inputs and outputs. The defect identification for annotation was followed by incorporating expert judgment recorded in CCTV videos allowing to input high accurate image. To mitigate issues arising from low-quality images, similar pixel images were utilized. The use of YOLOv8 and instant segmentation for defect identification proved instrumental in providing highly accurate model predictions.

The third stage of the process identified the economic viability of the project, performing cost and benefit ratio. Although the initial costs exceeded the benefits, the new approach reduced annual costs by 50%. However, only the quantifiable benefits were included in this analysis. Non-quantifiable benefits, such as increased efficiency, speed, high accuracy, and consistency, also contribute to the positive financial outcomes of the method.

It was found that a similar approach has been conducted by a few councils in Australia. For example, a similar system implemented by the City of Ryde was able to collect and review nearly 17 km of CCTV footage in under six months, representing a roughly 300% improvement in historic review turnaround. Automating the review process eliminated 400 h of manual CCTV review for council engineers and digitized over 6000 features, nodes, and defects within the cloud [52].

The study presents several limitations that offer opportunities for future research. First, the proposed methodology focuses on defect detection and segmentation. However, an effective condition monitoring system should have ability to quantify defect sizes and grade them based on severity. Future research should incorporate Large Language Models to develop a grading query system, which can provide maintenance suggestions based on the grading scale [7]. Second, a more sophisticated model, integrating smart sensors and Internet of Things (IoT), could significantly enhance the system capabilities. For example, incorporating IoT-enabled vibration sensors with visual data can continuously monitor structural integrity and detect anomalies in real time. Additionally, smart temperature and humidity sensors can provide valuable data for assessing environmental conditions that may impact defect severity, thereby offering a more comprehensive and accurate evaluation [9]. Furthermore, integrating this model with asset management cloud applications could lead to strategic asset management plans. Despite these limitations, this study paves the way for more efficient pipe network assessments, reducing the time required for evaluations. The application of DL techniques for assessing stormwater infrastructure assets offers numerous benefits, including increased automation, accuracy, and efficiency. While challenges, such as data quality and model interpretability, persist, ongoing advancements in DL research and technology show great potential for improving asset management practices and bolstering the resilience of urban infrastructure systems.

6 Conclusion

Maintenance and condition monitoring of stormwater pipelines have become increasingly critical due to ageing infrastructure and the extensive span of these networks. Local councils allocate resources and funds annually to maintain these pipelines, aiming to provide efficient and effective services to their communities. Unmanaged pipelines pose significant risks, such as water leakage and flooding, threatening urban infrastructure. Many councils have turned to technology-based solutions to address daily challenges, and this study seeks to provide an evidence-based example by leveraging DL methods to automate condition monitoring, particularly for stormwater pipe infrastructure. While DL methods are gaining traction within civil and infrastructure disciplines, there remains limited research focused on advancing the capabilities of Structural Health Monitoring (SHM) systems. This study aims to transform the current approach to semi-automated processing by incorporating DL algorithms, ultimately yielding favorable outcomes for informed decision-making. The model achieves an average accuracy of around 90% in detecting various defect types. It also demonstrates a mean average precision (mAP@0.5) of 0.92 for bounding boxes and 0.90 for masks. This process is 60% less costly compared to current approaches. Overall, the integration of computer vision with image-based prediction enhances the efficiency and effectiveness of infrastructure inspection processes, ultimately contributing to safer and more resilient civil infrastructure systems.