1 Introduction

Road infrastructure plays a significant role in the transportation sector, and its maintenance is essential for safe and efficient transportation. Potholes are one of the most common issues that road authorities face, as they can lead to accidents and cause significant damage to vehicles. Potholes typically result from the natural wear and tear of roads. However, weather conditions such as rain, snow, and freeze–thaw cycles can exacerbate the problem by causing cracks to form and expand. Ensuring the timely detection and repair of potholes is crucial to preventing such issues [1,2,3,4].

Current methods for pothole detection typically rely on a visual inspection conducted by either human operators or systems employing cameras or LiDAR technology [5,6,7,8]. However, these approaches can be costly, time-consuming, and frequently produce inaccurate results. Nevertheless, due to advancements in computer vision and machine learning techniques, there has been a significant shift towards automating this process. The development of automatic pothole detection methods has the potential to greatly enhance road safety and reduce the costs associated with manual inspections. One promising approach is to use deep learning algorithms, such as You Only Look Once (YOLO) [9], to detect potholes and road bumps in images captured by cameras mounted on vehicles. However, training these algorithms requires a large amount of labelled data, which can be challenging to obtain, especially in various weather conditions.

While automatic pothole detection often achieves high accuracy, current methods do not always reflect the real problem. Figure 1. shows a close-up image taken while standing over a pothole (left). This is typically how data is acquired in most pothole datasets. However, this approach does not accurately reflect the reality of potholes on roads. The image on the right shows a more realistic scenario of pothole instances captured from a moving vehicle, representing how pothole detection tasks should be perceived. When the methods are presented in a manner that accurately reflects the problem, the detection performance suffers. This is because noise in images or videos, often exacerbated by low resolution, causes small potholes to appear as insignificant objects that blend into the background.

Fig. 1
figure 1

Instances of potholes: The image on the right presents a more realistic example of a pothole detection task compared to the one on the left

One of the main challenges in pothole detection is the variability in weather conditions. Rain, snow, fog, and low light can affect road surface visibility, making pothole detection more difficult. For instance, rain or snow can obscure potholes or create shadows, affecting detection accuracy. Collecting annotated data for potholes in various weather conditions can be time-consuming and impractical due to the high costs and safety concerns involved. To address this issue, we propose using generative AI to create synthetic images in different versions of weather and lighting conditions. By using Generative Adversarial Networks (GANs) to generate synthetic data, we can create a more diverse and extensive dataset, which can improve the performance of pothole detection algorithms in adverse weather conditions.

In this study, we developed an automated pothole detection system using advanced computer vision algorithms such as YOLOv8, Real-Time Detection Transformer (RT-DETR) [10] and our modification of YOLOv8. YOLOv8 continues to build on the achievements of its predecessors, introducing innovative features and enhancements. RT-DETR represents an advanced end-to-end object detection system, delivering real-time performance without compromising precision. In addition to these state-of-the-art (SOTA) architectures, our study has modified YOLOv8 by integrating self-attention through vision transformer blocks with hierarchical feature maps and a Global Attention Mechanism (GAM) into the neck scheme. The main objective is to assess the performance of selected architectures in handling unfavourable weather conditions.

The work contribution can be highlighted as follows:

  • We proposed using GANs to translate pothole images into different weather conditions to generate new synthetic data. The number of training and validation images was increased by a factor of 7. Augmented data plays a pivotal role in enhancing the model's robustness against various challenges, including alterations in lighting and weather conditions.

  • We introduced a vision transformer along with GAM modules into the YOLOv8 neck to enhance the effectiveness of capturing valuable latent features and the global context prior to reaching the detection head. To improve the receptive field and capture more valuable image context, the Spatial Pyramid Pooling-Fast (SPPF) has been replaced by Attention-Based Dense Atrous Spatial Pyramid Pooling (ADASPP), which employs multi-scale features with a parameter-free attention module. Because potholes are relatively small compared to the overall scene, we incorporated the additional detection head for very small objects. The use of depthwise convolutions and the reduction in the number of convolution filters in detection layers helped to compensate for increased model complexity.

  • We evaluated the effectiveness of the proposed system in detecting potholes captured under real-world adverse weather conditions, including rain, evening, and nighttime. The performance of the proposed system was compared with existing methods for pothole detection.

The rest of this paper is organized as follows: Sect. 2 provides an overview of the existing research on pothole detection and image-to-image translation, highlighting the key approaches and techniques used in these areas. Section 3 describes the GAN architecture used for image-to-image translation and the proposed modification of YOLOv8 algorithm used for pothole detection. Section 4 outlines the dataset employed in this study, defines evaluation metrics, and discusses the process of model selection and training to optimize network architecture. Section 5 presents the quantitative results obtained from the experiments, along with a detailed analysis and discussion of the findings. Section 6 summarises the key contributions, draws conclusions based on the results and discussion, and suggests avenues for future research.

2 Related works

2.1 Pothole detection

With the rapid advancement of computer vision technology, there has been a concerted effort among researchers to explore diverse methods for detecting potholes in roads, aiming to facilitate timely repair and maintenance. Notably, deep learning-based object detectors have gained popularity as an effective approach for this task [11]. These detectors can be broadly classified into two subcategories: Region-based Convolutional Neural Networks (R-CNN) models, which employ a two-stage detection process, and single-stage models, which utilize uniform detection methods. Two-stage detectors, such as R-CNN [12], Fast R-CNN [13], and Faster R-CNN [14], typically employ a region proposal network or selective search in the initial stage to identify regions of interest (ROIs). In the subsequent stage, the selected ROIs are mapped for specific objects, and minimal bounding boxes are predicted for the detected objects. On the other hand, single-stage detectors like SSD [15], YOLO [9], and RetinaNet [16] eliminate the need for a region proposal network and directly perform detection on a dense sampling of all possible locations. Providing a balance between speed and accuracy, one-stage object detectors are a widely utilized tool for real-time applications.

Pena-Caballero et al. [17] evaluated object identification using YOLOv2 and YOLOv3, as well as semantic segmentation algorithms. They found that while segmentation achieved high accuracy, it also came with increased computational complexity. Ye et al. [2] proposed a CNN-based pothole detection method with pre-pooling before the first convolutional layer, achieving higher accuracy than conventional CNNs. Park et al. [18] proposed an automated pothole detection method using YOLOv4, YOLOv4-tiny, and YOLOv5 models. Their evaluation showed that YOLOv4-tiny performed the best, achieving a mean Average Precision at IoU 0.5 (mAP@0.5) of 0.787. However, limitations were noted, including reduced accuracy in detecting small, distant potholes. Salcedo et al. [19] introduced a series of deep learning models for developing a road maintenance prioritization system in India. Their system includes UNet for road segmentation to help determine fake or duplicate records. Object detection was conducted using EfficientDet and YOLOv5 models, resulting in mAP scores of 0.60 and 0.63, respectively, across three categories: single crack, crocodile crack, and pothole. The YOLOX algorithm was utilized in [20]. Experimental results indicate that the YOLOX-Nano achieved high accuracy in pothole detection while maintaining low computational costs and a compact model size of only 7.22 MB. Deepa and Sivasangari [21] introduced a hybrid deep learning framework that incorporates various stages, including histogram equalization-based image pre-processing, fuzzy c-means clustering-based segmentation, feature extraction, and classification using the Hybrid Deep Capsule autoencoder. The proposed model achieved an accuracy of 98.81% on the RDD2020 dataset, surpassing existing methods.

Despite promising results, the task of real-world pothole detection still presents challenges, primarily due to the small size of potholes relative to road images. This size discrepancy imposes limitations on training Convolutional Neural Networks (CNNs) with high-resolution images, primarily because of memory constraints. To overcome these challenges, Chen et al. [1] proposed resizing input images to fit the network and utilizing image patches from high-resolution images during network training. This approach involves a two-stage system: initially employing a localization network to locate the pothole instances in the low-resolution image, and then using a classification network based on candidate patches to determine the classes. Salaudeen and Celebi [3] used an Enhanced Super-Resolution GAN to improve the quality of road surface images and address the challenges associated with detecting small objects. Several studies have proposed deep CNN-based object detectors for detecting small objects, including potholes, in remote sensing imagery. For instance, Tayara et al. [22] introduced a convolutional regression neural network to detect vehicles from satellite imagery. Tang et al. [23] also proposed a modified Faster R-CNN detector that utilized a hyperregion proposal network to improve recall and employed a cascade-boosted classifier to reduce false detections. A YOLOv4-tiny model was used by Silva et al. [24] to detect potholes from aerial views captured by a flying drone with 95% accuracy. Despite the existing research, the literature lacks a thorough evaluation of how visual attention influences pothole detection.

Additionally, exploring novel state-of-the-art models could enhance pothole detection performance further. Xie et al. [25] proposed MADet, a one-stage detector that utilizes a feature-interaction alignment operation to enhance consistency between feature-prediction pairs through mutual-assistance learning. They also utilize joint optimization for predicting target bounding boxes, incorporating both anchor-based and anchor-free approaches. This enhances the detection of objects with diverse aspect ratios and addresses issue with object occlusion. In [26] Separate Feature Refinement (SFRNet) is proposed, featuring transformer-based branches for specific functions such as fine-grained classification and oriented localization. The end-to-end transformer RT-DETR, as proposed in [10], utilizes an efficient hybrid encoder for multi-scale feature processing and IoU-aware query selection to enhance object initialization. Furthermore, it enables flexible adjustment of inference speed without requiring retraining. RT-DETR-L achieved an impressive performance of 53.0% Average Precision (AP) on COCO val2017, with a processing speed of 114 frames per second (FPS) on a T4 GPU. YOLOv8 is the latest iteration in the YOLO series of object detection algorithms from the Ultralytics group. It introduces significant improvements over its predecessors, including enhanced model architecture and optimization techniques, which contribute to its superior performance on various benchmarks. YOLOv8 also offers better adaptability to different scales of objects, making it highly versatile for the detection of small objects. YOLOv8L achieved AP of 52.9% with an image size of 640 pixels on the MS COCO dataset test-dev 2017. Furthermore, YOLOv8L exhibits a speed of 418 FPS on an NVIDIA A100 TensorRT, highlighting its efficiency and computational prowess for object detection tasks.

In pothole detection research, there is a lack of testing conducted under adverse weather and lighting conditions. Some studies use thermal images to detect potholes in challenging weather conditions, such as fog and at night [8, 27]. Although, thermal images can provide valuable additional image features, further testing should be conducted using images captured in a moving vehicle scenario. While our previous work [28] primarily provided an overview and comparison of the performance of one- and two-stage detection architectures for pothole detection under adverse conditions, it did not incorporate targeted improvements into these architectures. This gap has motivated us to explore the potential of generative networks. Our aim is to increase the diversity of available data and enhance the robustness of selected detection models: RT-DETR and YOLOv8. Additionally, we aim to incorporate effective multiscale processing and visual attention modules into the YOLOv8 model to evaluate their suitability for pothole detection in challenging visual conditions.

2.2 Generative adversarial networks

In 2014, Goodfellow et al. introduced the concept of GANs, which has since become a significant development in the field of unsupervised deep generative models [29]. The architecture of a typical GAN consists of two competing neural networks inspired by a two-player minimax game: a generator network and a discriminator network. The objective of the generator is to produce realistic samples that can fool the discriminator, while the discriminator aims to distinguish between real and fake samples (as depicted in Fig. 2).

Fig. 2
figure 2

The architecture of Generative Adversarial Network

The field of image-generation techniques has witnessed notable breakthroughs in other GAN architectures. In 2015, Radford et al. [30] proposed a Deep Convolutional Generative Adversarial Network (DCGAN), which utilizes transposed convolutional layers to upsample the input noise vector, resulting in the generation of high-resolution images. Progressive Growing of GANs (PGGAN), introduced by Karras et al. in 2018 [31], adopts a progressive training strategy, beginning with low-resolution image generation and progressively increasing the resolution. This approach yielded remarkable results in generating high-resolution images, including 1024 × 1024 face images. StyleGAN, proposed by Karras et al. in 2020 [32], presents a variation of GANs that introduces a style-based generator architecture. Unlike traditional GANs, StyleGAN employs two inputs: a learned constant vector and a style vector that controls the image's style. It also introduces a synthesis network for generating images at different resolutions. StyleGAN achieved impressive outcomes in generating high-quality images. In 2023, GigaGAN was introduced [33] as a one-billion-parameter model that excels in text-to-image generation in data-rich scenarios while offering rapid inference times. It only takes 0.13 s to generate a 512px image and 3.66 s to generate a 4 K image.

Notably, GANs have demonstrated promising results in challenging generative tasks such as text-to-photo translation, image generation, image composition, and image-to-image translation [29, 34, 35]. Image-to-image translation (I2I) has emerged as a fundamental problem in computer vision and computer graphics, encompassing a wide range of applications. The goal is to learn the mapping from an input image (X) to a specific target image (Y), such as mapping grayscale images to RGB images. The concept of image translation traces its roots back to Hertzmann et al.'s image analogies [36]. This approach proposes a non-parametric model that utilises pairs of images to achieve image transformations. The seminal work by Mirza and Osindero [37] introduced conditional GANs, exemplified by Pix2Pix, enabling the learning of mappings between a source and target domain. This approach achieved impressive results in tasks such as scene translation, season transfer, and sketch-to-photo translation. Nevertheless, relying on paired images for training poses challenges and high costs in various scenarios.

To address the limitations associated with paired training data, Zhu et al. proposed CycleGAN, a framework that introduces a cycle consistency loss to enforce the translation of an image from the source domain to the target domain and back to the source domain. CycleGAN demonstrated consistent image generation and garnered substantial attention within the scientific community. Building upon CycleGAN, the Unsupervised Image-to-Image Translation Network (UNIT) replaces domain-specific latent spaces with a shared latent space across domains, further advancing the field of I2I translation.

While conditional GANs, CycleGAN, and UNIT have showcased impressive results, they predominantly focus on single-modal translations, neglecting the multi-modality inherent in I2I translation. BicycleGAN attempts to address this limitation by leveraging paired images during training, encouraging bijective consistency between latent and target spaces. BicycleGAN is known for generating images of high quality, featuring detailed and varied content. However, its main drawback is the significant computational resources it requires. More recent advancements include Multimodal Unsupervised Image-to-Image Translation (MUNIT) [38] and Diverse Image-to-Image Translation via Disentangled Representations (DRIT) [39]. These methods have introduced solutions for multi-modal, unpaired scenarios by learning disentangled representations with a domain-invariant content space and domain-specific attribute/style space. They have enabled the generation of diverse and high-quality images, even in scenarios where paired training data is unavailable. MUNIT is particularly suited for tasks where multiple output variations are desirable, offering a broader range of possible translations. As stated in the original paper [38], the MUNIT model achieves quality and diversity comparable to that of the fully supervised BicycleGAN, while also surpassing unsupervised models such as UNIT and CycleGAN. MUNIT’s unsupervised nature offers an advantage in scenarios where labelled data is scarce or expensive to obtain.

3 Materials and methods

In this section, we will explain the methods used to create synthetic data using GANs and the process of training the GAN model. Next, we will discuss the implementation of the YOLO and RT-DETR algorithms for testing the augmented dataset.

3.1 Image-to-image translation

Achieving successful I2I translation requires an understanding of the underlying features shared between the source and target representations. In the realm of I2I translation, the ability to distinguish between domain-independent (content) and domain-specific (style) features is of paramount importance. Domain-independent features capture the spatial structure of the underlying content and should be preserved during the translation process. Learning the mapping between two or multiple domains poses significant challenges. Firstly, acquiring a paired dataset for training may be difficult or impractical, making supervised learning approaches infeasible. Secondly, performing multi-modal translation, where a single input image maps to multiple output images, adds further complexity to the problem.

Here, MUNIT emerges as a pioneering approach. MUNIT [38] decomposes image representation into a content space that is common across different domains and a style space that captures domain-specific characteristics. Content information is combined with a random style code from the target space to translate an image to the target domain. By blending content information with a randomly sampled style code from the target domain, MUNIT enables versatile image translation, yielding an array of diverse outcomes.

MUNIT, along with similar frameworks, shares a common architectural foundation: a generator comprising a style encoder and a content encoder (Fig. 3). The content encoder employs strided convolutional layers enhanced by residual blocks and instance normalization to downsample the input and capture content features. In contrast, the style encoder utilizes strided convolutional layers, global average pooling, and a fully connected layer to generate style codes, with the omission of instance normalization to preserve style information. The decoder reconstructs the input image using content and style codes using residual blocks, upsampling, and convolutional layers. Discriminators are employed, and the LSGAN objective, in combination with multi-scale discriminators, enhances the realism of the generated images.

Fig. 3
figure 3

MUNIT architecture overview: Content and style encoders transform input data, while a decoder generates images through AdaIN processing and convolutional layers

A notable addition to the framework is the inclusion of the Domain-Invariant Perceptual Loss, which is a modified perceptual loss that exhibits greater domain-invariance and employs input images as references. By applying instance normalization to VGG features, domain-specific information is effectively removed, thereby improving the training process, particularly when working with high-resolution datasets.

The optimization process of MUNIT is underpinned by a comprehensive loss function, which includes the following components:

  • Bidirectional Reconstruction Loss: This loss enforces encoder-decoder pairs to act as inverses of each other, promoting reconstruction in both image → latent → image and latent → image → latent directions.

    • Image Reconstruction: Successful encoding and decoding should result in the accurate reconstruction of an image from the data distribution.

    • Latent Reconstruction: With a latent code comprising both style and content sampled during translation, successful decoding/encoding should allow for the faithful reconstruction of this latent code.

  • Adversarial Loss: MUNIT leverages GAN to align the distribution of translated images with the target domain's image distribution. Essentially, images generated by MUNIT should be indistinguishable from real images in the target domain.

Recent advancements have propelled MUNIT to the forefront as a crucial framework for addressing the challenge of multimodal unsupervised I2I translation. However, it is important to note that this approach necessitates the use of multiple encoders and decoders for each domain, imposing significant computational demands. Furthermore, there is room for improvement in the conventional method of sampling style codes from a standard distribution during the translation process.

3.2 Pothole detection

Object detection is a crucial task in computer vision, involving identifying and localising objects within an image. YOLO family of object detection algorithms has gained significant attention due to its real-time performance and accuracy. This work focuses on two versions of the YOLO algorithm: the original YOLOv8 and our modification of YOLOv8, designed specifically for processing small objects in adverse visual conditions. Additionally, the performance of the detection transformer RT-DETR was evaluated.

3.2.1 YOLOv8

YOLOv8 [40], developed by Ultralytics, represents the most recent iteration of the YOLO series (refer to Fig. 4). The YOLOv8 backbone is composed of a chain of Conv (convolution, batch norm, SiLU activation) and a faster CSP bottleneck with two convolution blocks (C2f). It also includes a Spatial Pyramid Pooling Fast (SPPF) module to enhance the computational efficiency of the network. SPPF employs a series of max-pooling operations with kernels of identical sizes, as opposed to parallel max-pooling operations with varying kernel sizes implemented in SPP.

Fig. 4
figure 4

The network structure of the YOLOv8

The neck structure of the Feature Pyramid Network-Path Aggregation Network (PAN-FPN) is similar to that in YOLOv5 [41]. The difference is that YOLOv8 uses C2f blocks instead of C3 blocks, and the number of Conv blocks has also been reduced.

The YOLOv8’s decoupled head now processes objectness, classification, and regression tasks independently. An anchor-free approach for generating object proposals is utilized, meaning that predefined anchor boxes are not used. When it comes to real-time detectors that require Non-Maximum Suppression (NMS) post-processing, anchor-free detectors are more efficient in terms of inference time than anchor-based detectors with the same level of accuracy. This is because anchor-free detectors require significantly less post-processing time compared to their anchor-based counterparts [10]. The YOLOv8 utilizes Complete IoU (CioU) [42] and Distribution Focal Loss (DFL) [43] loss functions for bounding box loss and Binary cross-entropy (BCE) for classification loss.

3.2.2 RT-DETR

Baidu's Real-Time Detection Transformer [10] (RT-DETR) is a SOTA object detector that offers high accuracy and real-time performance (refer to Fig. 5). Unlike traditional real-time detectors, RT-DETR eliminates the need for NMS post-processing, making detection simpler and faster during inference.

Fig. 5
figure 5

The network structure of the RT-DETR

In RT-DETR, the efficient hybrid encoder was redesigned with real-time performance in mind. It uses the final three stages of the HGNetv2 backbone as input for its encoding process. The multiscale features are processed using Attention-based Intra-scale Feature Interaction (AIFI) and CNN-based Cross-scale Feature-fusion Module (CCFM). The IoU-aware Query Selection process involves choosing a set amount of image features from the encoder output. The model selects the top K encoder features based on their classification score, and the prediction boxes corresponding to these features have high classification and IoU scores. By utilizing IoU-aware query selection during training, the model can generate encoder features of better quality. The final process utilizes a transformer decoder alongside auxiliary prediction heads to continually enhance object queries, resulting in the creation of boxes and confidence scores with increased accuracy.

3.2.3 Modification of YOLOv8

The modification proposed in this work for YOLOv8 (refer to Fig. 6) is aimed at improving the detection of small objects under adverse visual conditions. The architecture we started with, called YOLOv8-P2, performs the detection of extra-small, small, medium, and large objects. We decided to apply this four-detection head model due to the presence of primarily small-sized potholes in the dataset. At the end of the backbone, multi-level feature extraction was performed using the Attention-based Dense Atrous Spatial Pyramid Pooling (ADASPP) block to enhance the convolutional receptive field in the feature pyramid computation. The CST/GAM blocks were used for the enhanced extraction of both local and global context from existing feature maps. Depthwise Convolutions (DWConv) perform a separate convolutional operation (using a depthwise kernel) for each input channel. This approach reduces the number of parameters without significantly compromising accuracy.

Fig. 6
figure 6

The network structure of the YOLOv8 modification

3.2.4 ADASPP

As mentioned in previous subsections, the original YOLOv8 employs the SPPF module for multi-level feature extraction. In this study, we have replaced the original SPPF module with a solution aimed at improving the detection of small objects and extracting more salient features under low visibility conditions. We proposed an attention-based module called ADASPP to enhance the convolutional receptive field. This module is based on Atrous Spatial Pyramid Pooling (ASPP) that utilizes multiple parallel filters with varying dilation rates. Figure 7(a) shows the implementation of the ASPP module from the YOLO Air repository [44]. In addition to adaptive average pooling, ASPP incorporates atrous convolutions with dilation rates of 1, 6, 12, and 18.

Fig. 7
figure 7

(a) The ASPP module designed for multi-scale feature extraction (b) the modified ADASPP module with dense connections and attention for improved multi-level feature extraction

As described in [45], densely connected atrous convolutions in DenseASPP involve a larger receptive field in the computation of the feature pyramid. This fact led us to modify the ASPP so that it now contains skip connections with both previous dilated blocks and input feature maps. In the dense atrous block (shown as DAB in Fig. 7(b)) with dilation of 3, 6 and 12, a 1 × 1 convolution is applied before the dilated layer to reduce the size of the feature map to half of its original size. In the ADASPP, the adaptive average pooling has been replaced by a Simple Parameter-Free Attention Module (SimAM) to capture salient image contexts without significantly increasing the model's complexity. SimAM dynamically emphasizes relevant features and computes 3-D attention weights for the feature map without adding extra parameters to the network.

3.2.5 Swin transformer

In contemporary image analysis, extracting maximal information from images is crucial. To address this necessity, we incorporated the Convolutional Swin Transformer (CST) blocks (shows in Fig. 8(a)) capable of extracting latent features within image data. The latent features are hidden within the image and have the potential to offer valuable insights for recognition purposes. A distinctive feature of CST is the integration of Swin Transformer units, as exemplified in Fig. 8(b), where essential components such as Layer Normalization (LN) and Multi-Layer Perception (MLP) modules seamlessly converge. It is worth noting that the initial unit employs the window-based self-attention (WSA) mechanism, while its counterpart adopts the shifted window self-attention (SWSA) paradigm. In other words, Swin Transformer employs attention calculations within a local window rather than across the entire image to optimize time complexity. In the subsequent layer, the window shifts and crosses previous windows to expand the receptive field. This SWSA paradigm results in a network configuration that effectively emphasizes the intrinsic significance present in image data.

Fig. 8
figure 8

(a) Introduces CST blocks, crucial for extracting hidden features in image data (b), the integration of Swin Transformer units, along with Layer Normalization and MLP modules

The procedural operation of this algorithm is outlined in Eqs. (14), where x represents the feature mapping of each layer after processing. Following the works discussing the benefit of LN for CNN [46, 47], we decided to remove LN from the Swin Transformer layers in our implementation.

$${x}^{i}=WSA \left(LN\left({x}^{i-1}\right)\right)+ {x}^{i-1}$$
(1)
$${x}^{i+1}=MLP \left(LN\left({x}^{i}\right)\right)+ {x}^{i}$$
(2)
$${x}^{i+2}=SWSA \left(LN\left({x}^{i+1}\right)\right)+ {x}^{i+1}$$
(3)
$${x}^{i+3}=MLP \left(LN\left({x}^{i+2}\right)\right)+ {x}^{i+2}$$
(4)

3.2.6 Global attention mechanism

GAM is an advanced attention mechanism used in deep learning models to enhance global feature interactions and reduce message diffusion. It incorporates a sequential channel-spatial attention mechanism, modifying the processing of the Channel Attention Mechanism (CAM) and Spatial Attention Mechanism (SAM) from the Convolutional Block Attention Module (CBAM) submodule [48].

GAM is shown in Fig. 9. In this context, the input feature mapping, denoted as F1 ∈ RC×H×W, represents the intermediate output from previous layers. MC represents the channel attention map, MS represents the spatial attention map, and ⊗ signifies element-wise multiplication. The CAM mechanism performs 3D permutation with a multilayer perceptron, operating on F1 to produce an intermediate state F2. This step captures enhanced feature interactions and relevant patterns while suppressing irrelevant information (Eq. (5)). Further SAM convolutional spatial attention processing results in F3 as the output (Eq. (6)). These improved features can be used in subsequent layers or downstream tasks, significantly enhancing feature interactions and representation learning. The effectiveness of GAM led to improved performance across various deep-learning applications.

Fig. 9
figure 9

Frame of Global Attention Mechanism

$${F}_{2}={M}_{C}\left({F}_{1}\right)\otimes {F}_{1}$$
(5)
$${F}_{3}={M}_{S}\left({F}_{2}\right)\otimes {F}_{2}$$
(6)

4 Experimental preparation

In this chapter, we delve into the details of the experimental preparation for our research on pothole detection using I2I translation. We discuss the datasets and pre-processing steps, evaluation metrics, the development of synthetic data, and the experimental setup.

4.1 Datasets and pre-processing

A Berkeley Diverse Driving Dataset [49] (BDD100k) was utilized for I2I conversion in this work. BDD100k contains a wide range of traffic object categories, including lane types, traffic lights and signs, and drivable areas. The dataset comprises images captured under diverse lighting and weather conditions, as detailed in Table 1. To perform image translation, we used the train subfolder (70,000 images) and validation subfolder (10,000 images) as our new training and test subsets. Table 1 provides a summary of the available training images, including their description. Using the labels available in JSON format, we were able to extract 69,863 unique instances belonging to three categories: weather, scene, and time of day. Due to the scarcity of data in the Foggy subset, additional images were included from datasets available on Roboflow [50, 51].

Table 1 Description of training subfolder from BDD100k dataset sorted into the corresponding categories

We evaluated the object detection task using a publicly available pothole dataset that was collected with a focus on adverse visual conditions [52]. The dataset is composed of over 1052 full HD images captured in clear weather and 1047 images influenced by adverse weather or low light. Image annotations contain two classes: pothole and manhole cover.

4.2 Evaluation metrics

Precision and recall were used to evaluate the performance of detection models. Precision is the ratio of true positives (TP) to all predicted objects (Eq. (7)), while recall is the ratio of TP to the total number of objects in the dataset (Eq. (8)). TP represents the correctly identified objects, while FP refers to objects wrongly detected as potholes. FN is defined as objects that the detector failed to identify as potholes.

$$precision = \frac{TP}{TP+FP}$$
(7)
$$recall = \frac{TP}{TP+FN}$$
(8)

Average precision (AP) is a commonly used evaluation metric in object detection that provides a comprehensive assessment of the detection model's performance across a range of Intersection over union (IoU) thresholds. IoU measures the overlap between the predicted and ground truth bounding boxes. AP is calculated by computing the area under the precision-recall curve at a specific IoU threshold. In practice, AP is obtained by averaging the precision values obtained at different IoU thresholds.

$$AP= \frac{1}{11} {\sum }_{r\in \{0,0.1, \dots ,1\}}{\rho }_{interp(r)}$$
(9)

Mean average precision (mAP) is a variant of AP that provides an assessment of the detection model's performance across all classes. In this study, we used mAP@0.5 to evaluate the model's performance at a single IoU detection threshold of 0.5. Additionally, we used mAP@[0.5:0.95], which is averaged over several IoU thresholds from 0.5 to 0.95 with a step of 0.05.

$$mAP= \frac{1}{n} {\sum }_{i=1}^{n}{AP}_{i}$$
(10)

4.3 Development of synthetic data

The unavailability of a sufficient number of paired images depicting varying visual conditions, from clear to adverse, led us to consider the scenario of unsupervised I2I translation. For the task of unsupervised translation, the numerous datasets of images captured under adverse visual conditions are already available. The BDD100k dataset was chosen for its extensive collection of traffic scenery images, which include diverse lighting and weather conditions such as clear, rainy, snowy, overcast, partly cloudy, and foggy. Additionally, the dataset offers a variety of time-of-day settings (daytime, dawn/dusk, night) and locations (city streets, highways, residential areas, etc.). In our work, we employed the MUNIT model for generating synthetic pothole images, primarily due to MUNIT's capability to learn a diverse set of high-quality image translations without requiring paired images. MUNIT contributes to the overall variation of training samples.

Our objective was to leverage the training images of potholes captured under clean weather and transfer them to low-light, foggy, or rainy conditions. The MUNIT models, trained on the diverse weather conditions present in the BDD100k dataset, were utilized individually to conduct I2I translation on the clear data from the pothole dataset. This strategy allows us to simulate pothole scenarios under various conditions, thus enhancing the dataset's diversity and realism.

The default configuration of MUNIT was used in our experiments. However, the training iterations were reduced to 300,000, the batch size was increased to 4, and the resize parameter for the shortest image side was increased to 640. The visual quality of the generated images was assessed through human judgment.

After a visual inspection of the generated images, it was found that the different nature of rainy and nighttime images from BDD100k prevented the model from being properly trained for our conditions. Therefore, new training images of night and rainy environments from our region were collected to match the required conditions. A more realistic appearance of the translated images was achieved. Thus, in the final experiments, images depicting dawn/dusk, foggy, and overcast conditions from the BDD100k dataset were utilized. Figure 10 depicts the process of translating a single input image into two distinct output images, selected from a total set of ten for each condition. Moreover, examples from the target domain are presented. The generated dataset underwent thorough evaluation to guarantee the realism and diversity of the synthetic images. Instances of newly generated low brightness and visually corrupted images were excluded from the dataset.

Fig. 10
figure 10

The I2I translation process displaying unpaired images from the target domain alongside a single input that is transformed into multiple outputs

Moreover, for the generation of more realistic rainy images, the physically based rendering (PBR) approach implemented by [53] [54] was used. Here, disparity images are required for computations. The image depth information, or disparity, is computed from a pair of stereo images using the distance between the left and right pixel values. If a stereo image is not available, monocular depth estimation can be utilized via deep learning. In this work, disparity images from the monocular pothole images were generated by Monodepth2 [55], the result of which can be seen in Fig. 11. Rainy images generated by MUNIT were used before the application of PBR to obtain a more realistic appearance.

Fig. 11
figure 11

The disparity image extracted by Monodepth2

In our experiments with pothole detection, the pothole data was divided as follows: 70–15-15 partitions from clear weather were used for the train-validation-test subsets. The remaining images of real-world adverse conditions were used for the evaluation of models. As mentioned earlier, the generative method was employed to generate additional samples featuring diverse weather and lighting conditions based on clear weather data. Only one iteration of images, selected from a total set of ten variations, was chosen for each adverse condition. Up to a seven-fold increase in the size of training and validation subsets was achieved with the image translation into conditions such as dawn/dusk, foggy, night, overcast, rain, and images with 10 mm and 25 mm rain intensity, shown in Fig. 12.

Fig. 12
figure 12

Demonstration of the adaptability of a trained MUNIT network in diverse lighting and weather conditions

4.4 Experimental setup

In this study, we conducted a series of experiments to test how well our proposed model works. To train deep learning models effectively, we used a set of well-known software tools, including Anaconda3, CUDA version 11.3 for faster model computations, cuDNN8 for optimized neural network calculations, and PyTorch 1.11.0 as our main training platform. The hardware consisted of a high-performance Intel Core i9 12900HX CPU and an NVIDIA GeForce RTX 3090 Ti 24 GB GPU.

For our assessment, we used models from the Ultralytics repository version 8.0.138 as our starting point [40]. These models were carefully trained on COCO dataset [56], which is widely used as benchmark data in computer vision tasks. During training, we limited the number of training epochs to a maximum of 500, with the first three epochs dedicated to the initial warm-up. To improve the learning process, we employed a technique called SGD optimization with an initial learning rate of 0.001.

The images from pothole dataset often contain many small objects. To achieve the balance between real-time performance and detection accuracy, we standardized the image size to 1088 px. This size ensures that our models can be used efficiently on devices with limited resources without losing important image details. Furthermore, we kept the settings for various parameters consistent across all training processes to ensure that our results could be compared accurately. The key parameter settings for the training process are summarized in Table 2.

Table 2 Summary of parameter settings

5 Results and discussion

In this chapter, we present the results of our comprehensive analysis of pothole detection using SOTA architectures, such as YOLOv8, RT-DETR, and our modified YOLOv8. YOLOv8 and RT-DETR were specifically chosen due to their prominent performance in the field of object detection. One of the objectives of this research is to enhance the diversity of available dataset to bolster the robustness of detection models. We evaluate two training conditions: with and without data augmentation using GANs, to demonstrate the effectiveness of proposed approach across different scenarios. Our focus is also on improving the detection of small objects, particularly potholes, under adverse visual conditions (rain, sunset, evening, and night). The integration of efficient multiscale processing and visual attention mechanisms into the YOLOv8 model is therefore pursued with the intention of enhancing model accuracy. We discuss the impact of modifications made to YOLOv8, such as ADASPP and CST/GAM blocks for enhanced context extraction and DWConv for parameter reduction.

5.1 Performance comparison of pothole detection

Table 3 presents the results of pothole detection accuracy achieved by the different models and conditions The precision, recall, mAP@0.5), and mAP@[0.5:0.95] scores were evaluated. We investigated the impact of generative-based augmentation on model performance. Incorporating augmentation into the training process led to improvements in all evaluation metrics for each model. Notably, the evening and nighttime subsets showed the most significant improvement, with up to an 11% and 19% increase in mAP@0.5, respectively. The least significant change was observed in the rain and sunset subsets, with a 1–3% change in mAP@0.5 for the rain subset and a 0–0.7% change for the sunset subset. It's worth noting that the expected improvement for the rain subset did not fully materialize despite augmenting the data for rainy conditions.

Table 3 Pothole detection accuracy for different models and visual conditions

The best results among all models and conditions are indicated in bold. The proposed modification to YOLOv8 achieved the highest mAP@0.5 and mAP@[0.5:0.95] scores on the clear data subset, with values of 0.867 and 0.391, respectively, outperforming both the original YOLOv8 and RT-DETR. The incorporation of the proposed modification in YOLOv8 yielded an 8.4% improvement pre-augmentation and a 5.3% improvement post-augmentation when compared to the original YOLOv8. In terms of detection accuracy under adverse conditions, the proposed model demonstrated a pre-augmentation improvement of 4.2%, 2.1%, 3.6%, and 3.4% (averaging 3.3%) in the rain, sunset, evening, and night subsets, respectively. Post-augmentation, the model achieved improvements of 6.3%, 2.7%, 3.4%, and 2.9% (averaging 3.8%) in the same subsets.

Figure 13 compares the performance of models under various adverse test conditions, including rain, sunset, evening, and night. It also provides a clearer illustration of how data augmentation impacts detection accuracy for each specific data subset. The impact of adverse visual conditions on detection accuracy can be indirectly assessed by examining subsets where conditions other than clear weather are present, which consistently results in reduced mAP@0.5 scores for all models. The artificially generated data notably enhanced detection performance under conditions of low visibility, such as during the evening and night. It is also evident that the self-attention induced by transformers and global attention blocks is beneficial for detecting small objects under adverse conditions.

Fig. 13
figure 13

Performance comparison of YOLOv8l, RT-DETR-l and modified YOLOv8l under adverse visual conditions and with/without data augmentation

The radar chart (see Fig. 14) below visualizes the parameters of the tested architectures using augmented data. When it comes to accuracy, we can see that the modified YOLOv8 has reached the level of RT-DETR-l, outperforming the original YOLOv8. In terms of model size and computational requirements, the RT-DETR-l model stands out due to its low number of parameters (32.8 million), low floating-point operations per second (125.1 GFLOPs), and the shortest inference time, approximately 50 FPS. The parameters of the proposed modification to YOLOv8 (41.15 M parameters, 159.8 GFLOPs, ~ 35 FPS) closely follow the computational parameters of the original architecture (43.6 M parameters, 165.4 GFLOPs, ~ 34FPS).

Fig. 14
figure 14

Comparison of parameters and performance among models

Figure 15 compares the performance of models under various adverse test conditions, including rain, sunset, evening, and night. The same axis scales are kept to visually compare significant differences in individual subsets. It is obvious that the self-attention induced by the transformer and global attention blocks is beneficial for detecting small objects under adverse conditions. In our experiments, both modified YOLOv8 and RT-DETR are characterized by improving accuracy in poor visibility.

Fig. 15
figure 15

Comparison of model performance under different adverse conditions

5.2 Ablation study

In our ablation study, we aimed to understand the contribution of each component to the performance of the modified YOLOv8 model. We started with the YOLOv8-P2 architecture, which already implements an additional detection head for small objects. The following components were then progressively added to the model: DWConv blocks, ADASPP, GAM, and CST blocks. Table 4 summarizes the results of the ablation experiments.

Table 4 Results of ablation experiment on modified YOLOv8 model

Summarizing the subcategory results and overall results in Table 4 the following conclusions can be drawn:

  • Although the YOLOv8-P2 utilizes an additional detection head for very small objects, it has fewer parameters than YOLOv8 (43.6 million parameters). This reduction is achieved through a decrease in convolutional filters within the detection layers.

  • The addition of DWConv blocks in the YOLOv8-P2 backbone reduced model parameters without a significant decrease in performance. This demonstrates the efficiency of DWConv in reducing model complexity while maintaining accuracy.

  • The introduction of ADASPP further improved the model's accuracy, especially in terms of precision and recall. ADASPP enhances the model's ability to capture image context at different scales, which is crucial for small object detection.

  • The CST and GAM blocks, designed to extract both local and global context from feature maps, contributed to significant improvements in precision, recall, and mAP@0.5. These blocks help the model better understand the spatial relationships within the image, aiding in accurate object localization.

  • When all modifications were incorporated, the modified YOLOv8 model achieved a final accuracy that closely matched that of the RT-DETR transformer detector. This indicates that our approach effectively addresses the challenges posed by small object detection.

The detection outcomes are illustrated in Fig. 16. As shown, the proposed model enhances the detection capabilities of the original YOLOv8. The performance of the modified YOLOv8 is very similar to that of RT-DETR. However, the modified YOLOv8 shows better performance in detecting distant potholes during rainy and sunset conditions, identifying more potholes than RT-DETR. Furthermore, the most notable improvement is observed in the night example, where the modified YOLOv8 successfully detects potholes, surpassing RT-DETR in this specific instance.

Fig. 16
figure 16

Analysis of the detection performance of YOLOv8, RT-DETR, and modified YOLOv8 under varied environmental conditions

This study offers insights for developing automated pothole detection systems using GAN-generated synthetic data, which can be applied to enhance road safety and efficiency. Trained deep-learning algorithms can accurately identify potholes in various weather conditions, which could aid in road maintenance and reduce accidents and vehicle damage. The approach could also be expanded to detect other road defects like cracks and bumps. Despite the enhanced robustness of the model, potential biases inherent in the area-specific dataset could affect the model's performance in real-world scenarios. Moreover, the difference in road categories might affect the fidelity of the generated image. For example, the scenery of buildings closely encompassing a road in the training images can contribute to the generation of artifacts when depicting an open rural road. Therefore, for the chosen type of communication, appropriate data would have to be considered to create synthetic data. In future experiments, the proposed work could be enhanced by utilizing a dataset that includes a balanced representation of various road types and geographic locations.

6 Conclusion

Potholes present a persistent road hazard, often resulting in accidents and vehicle damage. However, detecting potholes in poor visibility conditions is a challenging task that requires innovative solutions. In this study, we assessed the accuracy and efficiency of detection models, such as YOLOv8, RT-DETR, and our modified version of YOLOv8, under degraded visibility conditions. We adopted two approaches to address the impact of poor visibility on detection accuracy. First, we enriched the dataset by incorporating artificially generated images utilizing MUNIT for I2I transfer, thereby bolstering the models' robustness. This sevenfold increase in data yielded enhancement in results, especially in low-light scenarios captured during evenings and nights, with an improvement of up to 11% and 19% in mAP@0.5 across all models. In most cases, the augmentation improved the performance of the models under varying visual conditions.

Our second approach involved the incorporation of self-attention mechanisms through transformer modules and global attention within the YOLOv8 detection pipeline. This refinement of architecture, together with an additional detection head for very small objects and a multi-scale feature module called ADASPP, also targeted the challenge of identifying potholes under degraded visibility. The Ablation study demonstrated that all included modules contribute to enhancing the detection accuracy of the model. When compared to the original YOLOv8, the implementation of the suggested modifications led to an 8.4% increase in accuracy (mAP@0.5) prior to the application of synthetic augmentation.

The proposed modification to YOLOv8 delivered accuracy on par with the detection transformer RT-DETR, all while keeping the same level of computational efficiency as the original YOLOv8. The proposed model achieved mAP@0.5, precision, and recall of 0.867, 0.909, and 0.796, respectively, while also maintaining a real-time inference speed of approximately 35 FPS on images with an input size of 1088 pixels. It's important to acknowledge that the RT-DETR-l model tested in this study surpasses our model in terms of model size and computational complexity. Therefore, in future experiments, we will focus on exploring innovative techniques to enhance the computational efficiency of the proposed architecture.

The limitation of this study lies in the fact that MUNIT inherently demands significant computational resources. In addition, the proposed method necessitates training a model for each condition individually, which is time-consuming. To address these challenges, future work could explore optimizing the architecture or employing more efficient training methods to reduce the computational burden. Additionally, investigating the trade-offs between model complexity and performance on larger datasets could provide valuable insights into making these techniques more scalable and applicable in various contexts. Our future efforts will also focus on exploring innovative approaches to image augmentation by leveraging synthesized images. Moreover, we will consider integrating these models with existing road infrastructure management systems to provide real-time feedback to traffic management systems, assisting in monitoring and improving road safety.