1 Introduction

In modern era, Advanced Driver Assistance System (ADAS) has great importance to help the drivers by providing them surrounding environment information. Object detection is an important feature of this system. However, erroneous street objects detection during raining or when the vehicle’s front camera is blurred can cause severe accidents. Street object detection recognizes the object categories and assumes the exact location of each object by providing a bounding box in each image. Efficient image processing techniques are essential to translate an input image into a targeted output image [1]. In the traditional methods, the output image is obtained by stitching pieces from a database of images. But in recent works, learning is done directly from mapping using machine learning (ML) techniques. In these methods, no external database of images is required so computation is often faster than traditional methods [2].

Recent advancement of deep learning based computer vision techniques has shown the power and promises of semantic image synthesis [1, 2]. Semantic image synthesis is a technique to create a photorealistic image using a semantic label map [1,2,3]. Existing techniques have shown promising results when using semantic information as a guide. However, there are still some issues, particularly with the production of small-scale objects, which is owing to the losses of spatial resolution that associated with deep network functions such as convolution, normalization, and down-sampling among others. In conditional image synthesis, photorealistic images are generated by conditioning on ground truth input images and hence exhibit more flexibility [1]. In this work, our one focus is in the specific form of conditional image synthesis, which transforms a semantic segmentation to a photorealistic image in a more realistic way.

Various methods have been proposed in literature to detect street object [4,5,6,7,8,9,10,11]. But, the detection accuracy is still unsatisfactory for complex background scenarios, raining, or blurred camera of vehicle. Besides, at night, all the objects in the image cannot be detected because of the low background brightness. Though faster R-CNNs [7] and MobileNetSSDv2 [11] models are popularly used for street object detection, still there are research scopes to improve their performance. Recent studies on generative adversarial network based photorealistic images are showing power and promises in object detection applications [12, 13]. Being motivated by these facts, in this research, a deep learning based framework is proposed and investigated for street object detection from synthesized and processed semantic image. We plan to generate effective photorealistic image from semantic image using CGAN. Then we emphasize on the improvement of image quality of the generated images using neural style transfer [14] and ERSGAN [3] concept to increase the object detection accuracy of street object detector like faster R-CNNs and MobileNetSSDv2 models. The main contributions in this research are as follows:

  • A deep learning based framework is proposed and investigated for street object detection from synthesized and processed semantic image.

  • A customized CGAN model has been used to generate the photorealistic image from semantic images. Brightness of the generated images is increased using neural style transfer concept and resolution of the neural style transferred image has been improved using ESRGAN based super-resolution concept to enhance sharpness and texture recovery. To the best of our knowledge, this integrated form of synthesized and processed semantic image concept might not have been used as input for street object detection.

  • The objects are detected from the synthesized and processed generated image by using two separate popular street object detector named Faster R-CNN and MobileNetSSDv2 model.

  • A rigorous performance comparative analysis has been conducted between the Faster R-CNN and MobileNetSSDv2 model based proposed frameworks as well as with the recent techniques considering both the processed and unprocessed input images. An mAP of around 32.6% is found for Cityscape dataset using the proposed framework, whereas most of the cases, mAPs are reported in the range of 20–28% for this particular dataset.

2 Related Works

Different approaches are reported for image synthesis using statistical distribution concept, variational autoencoders (VAEs) model, GAN, diffusion models etc. [1,2,3, 15,16,17,18]. Comparing with the statistical distribution or VAEs based synthetic data training methods; the GAN is considered to be effective in our image generation task as it provides opportunity to modify the output samples during training. Thus, more photo-realistic images can be generated which further help the detectors to identify the objects more accurately in exact positions. On the contrary, there is risk for statistical distribution based synthetic data training methods to be susceptible for overfitting. Furthermore, the supervised learning of GANs with generator-discriminator synchronization benefit is expected to recognize more complex insights of the input and generate more realistic data than unsupervised learning based VAEs [19]. Both the GANs and Diffusion models have proven power in generating quality images though their underlying architectures and training approaches are different. These models have become competitors in the field of generative modeling, each with specific advantages and uses. Diffusion models offer a natural framework for data synthesis and its training process is significantly more stable [20]. However, compared to GANs, stable diffusion models are more computationally complex and need longer training durations with careful hyper-parameters tuning. Furthermore, stable diffusion models usually excel in to generate images conditioned on text descriptions, inpainting, outpainting, and image-to-image translations guided by a text prompt which are quite different from our image synthesis objectives. On the contrary, conditional GANs suitability and applications are well-matched with our objectives. That’s why, we have chosen the conditional GAN for the semantic image synthesis task in this study.

Traditional object identification techniques typically rely on hand-crafted characteristics. There are three primary steps: locating proposals, extracting proposed features, and classification. During the process of locating proposals, the objective of the procedure is to search areas in the image that may or may not contain items using techniques like selective search [21], edge search [22], as well as ideas like multi-scale combinatorial grouping and boxing [23]. In order to acquire multiscale information, input photos are scaled into different scales with various object aspect ratios. During the feature extraction process, multiscale windows are used to move through these images. The object’s location is collected from the sliding window to capture semantic information. The features are usually extracted from low-level visual descriptors like color, texture, shape etc. [24, 25]. The categorical classification is done by traditional ML classifiers like support vector machine (SVM) [26], adaboost [27] etc. Despite the fact that traditional techniques performed admirably on a number of benchmark datasets, these methods still have significant shortcomings for practical applications. In the proposal generation process, a huge number of proposals are created and a lot of them are redundant. During classification, these create a lot of false positives. Moreover, feature descriptors are meticulously developed by hand based on low-level visual information that resulted in getting a sense of semantic information is harder under complex circumstances.

Street object detection techniques based on CNN are becoming more popular nowadays with their proven performance. The existing approaches are majorly alienated into two groups: (i) proposal-based approach, such as R-CNN [5], Fast R-CNN [6], Faster R-CNN [7] and (ii) proposal free approach like You Only Look Once (YOLO) [8], the Single Shot Multi-Box Detector (SSD) [9]. Comparatively CNN model takes longer time because a huge number of parameters need to be trained. The YOLO detector, on the contrary, is ineffective in identifying dense multiple targets and huge objects [10]. Hence faster R-CNN has gained research attention for object detection. Girshick et al. [5] suggested an approach in which they employ selective search to extract just 2000 areas from the image and he termed them region proposals in order to get over the issue of picking a large number of regions. However, training the network still requires a significant amount of time since 2000 area suggestions per picture must be evaluated. Moreover, being a fixed algorithm, there have not any opportunity of learning in selective search. As a result, the production of poor candidate region ideas might result from this. Thereafter, Girshick [6] introduced a new strategy known as Fast R-CNN, in which the input photo is provided to the network so as to produce convolutional feature map rather than feeding region proposals. Hence, fast R-CNN takes less time to detect objects as there's no need to continuously feed to the CNN 2000 region proposals. Instead, a feature map is produced from the convolution procedure, which is only performed once per picture. Later, Ren et al. [7] developed faster R-CNN method that removes selective search strategy and enables the model to learn the region proposals, as time-consuming procedures impact on the network performance. The picture is fed into a CNN model alike Fast R-CNN to create feature map. A different network is utilized to forecast the area suggestions instead of utilizing a selective search algorithm over the convolutional feature map to determine the region proposals. In order to categorize the picture inside the proposed region as well as estimate the offset values for the bounding boxes, the inferred region proposals are reformed using a pooling layer. Beside the R-CNN methods, the (single short-box detector) SSD detector is better suited in identifying a huge number of objects of various scales and types with short possible time since feature maps of various scales are considered. One of the fundamental parts of autonomous driving assistance systems is the real time detection of objects that is well-suited with embedded systems which have limited computational capacities. In the majority of methods, the requirement to reduce computational complexity is normally neglected [11]. They require a powerful GPU for detection at real-time. However, Chiu et al. [11] presented a lightweight object detection model which is called MobileNetSSDv2. To further enhance identification accuracy as well as stability, they additionally use a feature pyramid network (FPN) with the suggested object detection model. Recent object detectors like deformable DETR and fully convolutional one-stage (FCOS) shows remarkable performance in object detection [28, 29]. In a study of Wang et al. [30], it is found that deformable DETR algorithm performed well for COCO dataset whereas the Faster RCNN have shown better results for Cityscape dataset. Therefore, the Faster RCNN detector might be suitable for object detection task for our proposed framework targeting improved accuracy. Besides, for a rigorous comparative study, we can test another popular detector MobileNetSSDv2 for its lower computation time benefit. In this study, we plan to use the faster R-CNN and MobileNetSSDv2 object detector to investigate the effects of the synthesized and processed semantic images in the accuracy improvement of the object detectors from the unprocessed counterpart as well as comparing the proposed framework’s performance with other recent street object detection techniques.

3 Methodology

The framework of our proposed method is shown in Fig. 1. It consists of two modules: (a) Realistic image generation using CGAN and (b) Object Detection. The realistic image has been generated from semantic image using CGAN generator and we have used PatchGAN [1] as discriminator to identify real/fake image to give feedback to the generator. In module (b), we have used neural style transfer concept to improve the brightness of the generated images in order to accuracy improvement in object detection. Then resolution enhancement and texture recovery are performed using ESRGAN. Finally, we have used Faster R-CNN method to detect the objects.

Fig. 1
figure 1

Proposed framework of deep learning based street object detection system

3.1 Generation of Realistic Image Using CGAN

Generative Adversarial Networks (GANs) have proven the ability of creating photorealistic images [10, 31, 32] which are almost indistinguishable from the originals. GAN [33] has a generator and a discriminator part. Figure 2 shows the both parts of GAN and CGAN which are trained in an adversarial way to acquire a balance in the network.

Fig. 2
figure 2

General architecture of GAN and CGAN

The generator part, G is trained to create an output image which is indistinguishable from ground truth real image by an adversely trained discriminator, D that is trained to detect the generator’s fakes. So GANs model generate a loss function which tries to make the decision whether the generated image is real or fake while concurrently training a generated model to decrease this loss. Instead, a structured loss is learned by a conditional GAN. Structured losses punish the output's joint configuration. GANs models learn the mapping \(\text{G}:\text{Z}\to \text{Y}\) a random noise vector Z to an output image Y. On the other hand, conditional GANs learn a mapping from the observed picture X to the random noise vector \(\text{Z to Y},\text{G}: \{\text{X},\text{Z}\} \to \text{Y}.\) Figure 2 show the general architecture of GAN and CGAN. The main difference between GAN and CGAN model is CGAN uses real data, noise as well as labels, c to generate images whereas GAN doesn’t use labels.

3.1.1 Generator Model

A conditional GAN’s goal can be stated by a loss function as follows:

$${\mathcal{L}}_{cGAN}\, (G, D) = {E}_{x,y} [log\, D(x, y)] + {E}_{x,z} [log(1 - D(x, G(x, z))]$$
(1)

where the generator G’s function is to generate an output image, y which is quite similar to the original image, x from noise vector, z. On the contrary, discriminator D’s function is to compare between x and y and identify the image as real or fake. G seeks to decrease this loss and adversarial discriminator D wants to maximize it, i.e. \({G}^{*} = arg\, {min}_{G}\, {max}_{D}\, {\mathcal{L}}_{cGAN}\left(G, D\right).\)

Different loss functions are used to measure the similarity between the generated image and real image and hence the effectiveness of the loss functions crucial for GAN’s performance. L2 Loss Function is often used in GAN based image synthesis technique. In an L2 loss function sense, the generator must not only mislead the discriminator, but also be close to the ground truth output. However, the L2 Loss Function performs poorly when there are anomalies in the dataset since it taking into account squared differences results in a considerably higher error if the sample contains outliers. However, L1 loss function is preferred as it is unaffected by them. We also look at this possibility, utilizing L1 distance rather than L2 as a criterion less blurring is encouraged by L1:

$${\mathcal{L}}_{L1} \left( G \right) = E_{x,y,z } \left[\|y - G (x,z)\|_{1}\right].$$
(2)

Our ultimate goal is to:

$$G^{*} = arg\, min_{G}\, max_{D} \, \left[ {{\mathcal{L}}cGAN \left( {G, D} \right) + \lambda {\mathcal{L}}_{L1} \left( G \right)} \right].$$
(3)

The model could be learned a mapping from x to y without z, instead it would yield deterministic results, failing to fit any distribution other than a delta function [1]. In the earlier, conditional GANs recognized this and gave Gaussian noise z as well as x as an input to the generator [34]. We found that this method was ineffective as the generator just learnt to disregard the noise [35]. Instead, we simply offer noise as dropout for our final models, which is used at the layers of the generator during training and testing stages. In spite of the dropout noise, the output of our networks might show relatively moderate stochasticity.

3.1.2 Discriminator Model

On image generation challenges, L2 and L1 loss give hazy results [36]. Although these losses may not promote high frequency sharpness, they capture low frequencies in many circumstances. Therefore, we don’t need to be worried to ensure accuracy at low frequency for cases when L1 is satisfactory. This motivates us in using an L1 term to compel low-frequency accuracy by confining the GAN discriminator to exclusively simulate high-frequency structure (Eq. 3). It is enough to limit our focus to the local picture patches in order to represent high-frequencies. As a result, we have used a discriminator, which is named as Patch-GAN that penalizes structure solely at the patch scale. The discriminator attempts to determine if each of the N × N patches in a picture is authentic or phony. This discriminator is then convolutionally moved through the picture and considers the replies to get the final output of D.

We illustrate that N may be substantially less than the image’s entire size while still producing high-quality results. This is beneficial as a smaller Patch-GAN has less parameters, runs quicker, and can be applied to any sized image.

3.1.3 Photorealistic Image Generation Using CGAN

We have used CGAN in order to generate photorealistic image from semantic map. Figure 3 shows overall generation process of the photorealistic image. The architectures of our generator and discriminator of the CGAN are based on the concept used in [37]. However, we have customized the architecture considering our problem statement as well as to obtain optimum performance. Both the generator and the discriminator employ convolution—BatchNorm—ReLu modules. Layers of Convolution, BatchNorm, Dropout, ReLu with a 50% dropout rate is used. 4 × 4 spatial filters with stride 2 are used in all convolutions. Convolutions in the encoder and discriminator down-sampled by a factor of two, but they up-sampled by a factor of two in the decoder. We add skip connections, which follow the basic structure of a “U-Net” [38], to provide the generator a way to get around the bottleneck. We create skip connections among each layer i and layer n  i, where n is the total number of layers. Every skip link simply convolves all layers i and n  i channels.

Fig. 3
figure 3

Generator and discriminator architecture of CGAN

At first, we have provided input semantic image of size 256 × 256 × 3 in the generator. After decreasing the width and height as well as increasing in depth on several sequential layers, the encoder output size turns into 4 × 4 × 1024. As we have used ‘U-Net’ architecture, the sequential layers provide feedback to the decoder layers. In the decoder, because of concatenating, the process of changing size is in reverse directions by increasing width and height as well as reducing depth. Finally, the generated image holds same size as the input semantic image.

Two inputs are given into the discriminator: one is the input picture and the target image, which it should categorize as real; the other is the input image and the created image (the result of the generator), which it should identify as false. Three back-to-back sequential layers drastically reduce the height and width while increasing the depth of these two identical inputs, resulting in 256 × 256 × 6 in the PatchGAN. Height and width are further increased by a ZeroPadding layer, becoming 34 × 34 × 256. After Conv—BatchNorm—LeakyReLu, the output has the final form of 30 × 30 × 1, with each 30 × 30 patch classifying a 70 × 70 area of the input picture.

3.2 Object Detection Process

Many deep CNN-based object detectors, such as Faster R-CNN [7], Yolo [8], SSD [9] etc. have recently been developed and shown to outperform traditional object detectors. However, still there are scope to increase the mAP of the detectors and we think the mAP can be increased by improving the brightness and sharpness of the synthesized semantic images.

3.2.1 Neural Style Transfer to Increase the Brightness of the Generated Images

If the generated images are blurred or the brightness of the background is poor, all objects might not be detected properly. We have used neural style transfer to detect the object properly in such scenario. It’s a “killer app” which has much acceptance between young generation to change the style of the input image or to use different effect like DeepArt [14] or Prisma [39].

For changing style of the image we have to give a style image relevant to the content image. Here, content image is our GAN generated image and the style image can be a sharp image captured from car’s front glass with high background brightness so that all objects can be detected carefully. For the style transfer task, we found that most of the paper used the popular Visual Geometry Group (VGG) architecture [14]. We also used VGG-19 for our task. VGG can be trained with random weights and biases or pre-trained weights and biases. Here, style transfer from style image to content image is done at a time without style transferring among the whole generated images.

Every layer of units may be thought of as a group of image filters which extract a particular characteristic from the source image. As a result, the output of a particular layer consists of what are known as feature maps. The input content picture is converted into representations that care more about the actual content of the image than its exact pixel values as it moves up the network's processing structure. In higher layers of the network, feature responses are therefore referred to as the content representation. Here the pre-trained VGG-19 network’s 4_2 conv layer is utilized as the content analyzer. We employ correlations between the various filter responses throughout the spatial area of the feature maps to produce a description of the style of the style picture. The input picture is converted into a stationary, multi-scale representation that includes texture information. The gram matrix is a representation of the correlations between feature maps. Layers 1 through 5 are used to calculate the gram matrix, with different style weight constants for each layer. This constant may be thought of as a hyperparameter that adjusts the amount of style.

Figure 4 shows the neural style transfer process using VGG19 model for the brightness improvement. At first, features related to the content as well as the style are retrieved and saved. The style picture, α is sent via the model, and all of the included layers compute and store its style, AL. The content picture, p is transferred via the model and the content representation, Pl is saved in one layer. The network computes the style features GL as well as content features FL of a random white noise picture x. The element-wise mean squared difference between GL and AL on each layer included in the style representation is calculated to provide the style loss, Lstyle. To determine the content loss, Lcontent the mean squared difference between FL and Pl is also calculated. The sum of the content loss and the style loss is then calculated as Ltotal. Using error back-propagation, its derivative with respect to the pixel values is calculated. If we use an image with higher brightness as style image, the brightness of generated picture will be increased.

Fig. 4
figure 4

Neural style transfer using VGG19 architecture

3.2.2 Resolution Improvement and Texture Recovery Using ESRGAN

Enhancing super-resolution concept is used for increasing the sharpness of the GAN generated and neural style transferred image. It also helps to recover the original texture from the semantic image. Moreover, it reduces the undesired artifacts which prevents the network from wrong detection. We use image interpolation to convert our image from one-pixel grid to another pixel grid. We have used the concept of Enhanced Super-Resolution Generative Adversarial Networks (ESRGAN) [3, 40]. It improved the sharpness of the generated image.

Figure 5 shows the basic architecture of the resolution improvement process. The basic blocks are usually residual block [41], dense block [42], residual-in-residual dense block (RRDB) as reported in [3]. We have used RRDB in the basic block as it can increase the ability to train the GAN model easily. In the GAN training phase the batch normalization creates artifacts for deeper network which might cause wrong object detection. Therefore, we have not used the batch normalization which may provide stable and improved performance and computational time and storage space can also be reduced. Figure 6 shows the replacement strategy batch normalization layer with RRDB in the basic SRGAN. This type of dense network increased the performance of the GAN. We have used this dense blocks at the main path to consume the total advantage from the network. Instead of trying to distinguish between a genuine image and a created image, our discriminator attempted to estimate the likelihood that the image is more realistic. As a result of the discriminator sending the generator its input in a timely manner, the picture generator is able to construct images that are more photorealistic. The realistic feature details can also be recovered with its aid. Consequently, the created picture and the ground truth image provided data to our generator. As a result, sharper edge details are created.

Fig. 5
figure 5

Basic architecture of the resolution improvement process

Fig. 6
figure 6

Replacing BN with RRDB

Resizing is important to increase or decrease the total pixel number in an image. This task can be done by image interpolation. It resizes the image from one-pixel map to another pixel map. Basically, it reduces unwanted artifacts and noises from the GAN architecture and also perceptual quality increases. Peak signal to noise ratio (PSNR) provides increased smoothing results with low frequency details. At first, we observed PSNR based model, then fine tune the GAN oriented model. Finally, the related weights from these two models were interpolated:

$$\varTheta_{G}^{INTERP} = \left( 1-\alpha \right)\varTheta_{G}^{PSNR} + \alpha\, \varTheta_{G}^{GAN}$$
(4)

where ΘGINTERP, ΘGPSNR, ΘGGAN all are corresponding weights of interpolated network and α is the learning rate. This balanced the two models. If we would choose the image interpolation in place of model’s weighted parameters, it could have been produced blurry results with undesired artifacts [3].

3.2.3 Faster R-CNN Model for Object Detection

We have used the generated and processed images as input to the Faster R-CNN [7] model for object detection to investigate the performance of the detector. The Faster R-CNN is made up of two parts, as shown in Fig. 7. A deep convolution neural network identifies the object areas in the first module, and the rapid R-CNN detector classifies the suggested regions in the second module. The Faster R-CNN’s innovation is that it replaced a sluggish selective search approach with a speedy neural network, introducing a Region Proposal in the process. A picture is inputted to a region proposal network (RPN), which generates a series of rectangular object suggestions, each having an objectness value.

Fig. 7
figure 7

Structure of Faster R-CNN

As shown in Fig. 7, a faster R-CNN, two-stage detector consists of three main parts: shared bottom convolutional layers, a region proposal network (RPN), and a region-of-interest (ROI) based classifier. Initially, a feature map is generated from the input image from the shared bottom convolutional layers. Based on that, RPN creates candidate object suggestions, and using a feature vector acquired by ROI-pooling, the ROI-wise classifier predicts the category label. The RPN loss and ROI classifiers loss are aggregated as the training loss:

$$L_{det} = L_{rpn} + L_{roi} .$$
(5)

Furthermore, both the RPN and ROI classifiers’ losses include two loss terms: one for classification which measures accuracy of the prediction and the other which is a regression loss on the box coordinates for improved localization.

We concurrently estimate numerous region suggestions at every sliding-window position, where k is the maximum number of proposals that may be made at each site. The parameters of the k suggestions are set in respect to the k reference boxes called anchors. We employ three scales and three aspect ratios resulting in k = 9 anchors at every sliding point.

Anchor-based approach is structured on a pyramid of anchors and is therefore more economical. With respect to different sized and aspect ratioed anchor boxes, this approach categorizes and regresses bounding boxes. It only employs filters (slide windows) of one size and only utilizes pictures and feature maps of one scale. We tried to find the effectiveness of this technique for handling various scales and sizes through experiments.

3.2.4 MobileNetSSDv2 Model for Object Detection

MobileNet-v2 is usually used as the backbone network in most lightweight network configurations. We have used our generated and processed images as input to the MobileNetSSDv2 architecture [11] for object detection to investigate the performance of this detector. MobileNet-SSDv2’s design contains a typical convolutional layer and 17 reverse residual modules, is shown in Fig. 8. A 1 × 1 neural layer, a 3 × 3 depth-wise (Dwise) differentiated convolutional layer, batch normalize (BN), and Relu6 excitation functions are included in each inverse residual module, as illustrated in Fig. 8. The output subset of features is combined with the input matrix without being scaled down. The benefit of utilizing the reverse residue generator is that it successfully prevents gradient vanishing, allowing all gradients data to be accurately transferred to the deeper network layer, resulting in improved training during the backpropagation procedure.

Fig. 8
figure 8

Structure of MobileNetSSDv2 model

A 1/16-scale convolution operation down-samples the extracted features in the standard MobileNetSSD. The detection of tiny objects in the picture will be weak and unsteady at this level. Instead, we retrieved image features at 1/16 and 1/32 sizes in our network design to solve this problem. After the backbone network, four inverse residual modules are added to extract feature maps at sizes of 1/64, 1/128, 1/256, and 1/512. Finally, the FPN module enhances these local features with various scales.

4 Experiments and Results

We have used Cityscape [43] dataset, a new huge dataset containing a diversified set of stereoscopic video sequences collected in outdoor scenes from 50 major localities, with 5000 frames of top quality pixel-level descriptions and a larger set of 20,000 frames with poor annotations. As a result, the dataset is a substantially larger than earlier attempts. The website contains information on annotated classes as well as samples of recent annotations. Some key features of Cityscape dataset are: (i) images from 50 different cities, (ii) On different seasons of year (summer, fall etc.), (iii) Images with low and high light background, (iv) Changeable weather observation, (v) Static and random movement of object.

These diverse classes made the dataset suitable for street object detection. Many researchers used Cityscape dataset for their research because of the diversity of this dataset. It consists of 30 classes with 8 categories as mention in Table 1.

Table 1 Group corresponding classes of Cityscape dataset

The selection for this particular dataset has two reasons. Firstly, the datasets like Cocostuff [44], Pascal VOC [45] has many unsuitable labels such as baseball bat, glove, fruits, foods, furniture, animals like zebra, giraffe, cloths etc. which are totally irrelevant for this task. Secondly, in Cityscape they have used a semantic maintaining module that seeks to focus class-dependent feature maps considering the input label in order to provide semantically consistent images. For instance, vehicles like car, truck, bus, motorcycle, bicycle, caravan will be in same semantic color. Therefore, the generator should concentrate on learning from all semantic classes rather than learning particular semantic labels, which is usually overlooked in other datasets in general GAN-based generation efforts as far as our knowledge.

Although it produces results that are substantially sharper, the CGAN alone i.e. with λ = 0 in Eq. (3) presents visual artifacts in some applications. These artifacts are diminished when both terms are added together with λ = 100. 100 epochs have been used to train the model. We iterate over the amount of steps as well—print a dot every ten steps (.), to show progress every 1000 steps run generate images after refreshing the display and save a checkpoint at every 5000 steps. We design a function to import an image, limiting the maximum dimension to 512 pixels during neural style transfer. To observe the results of our optimization, we must first complete an inverse preprocessing phase. Its values must remain within the range of 0–255 [46].

All studies are carried out with a × 4 scaling factor between low resolution (LR) and high resolution (HR) pictures in super-resolution following ESRGAN [3]. By down-sampling photos which have High Resolution with the MATLAB bicubic kernel function, we generate LR images. 16 is set as the mini-batch size. The cropped HR patch has a spatial dimension of 128 by 128. The range of the interpolation parameter α is set from 0 to 1 with 0.2 interval. We have chosen the mentioned parameters and hyper-parameters considering the best performance of the models found on trial and error basis. The experiments are implemented on an NVIDIA GeForce RTXTM 3080 Ti GPU with 12 GB of memory, NVIDIA’s 2G RTX architecture, and RTX 3080 graphics cards.

4.1 Result Analysis on CGAN Based Image Generation

We investigate different ResNet-based discriminators that have been utilized in recent unconditional GANs and find that they produce similar outcomes at the cost of more GPU memory. We found that omitting any loss term from the pixel to pixel generation (pix2pix) loss function resulted in worse generating outcomes which is also reported by Isola et al. [1]. A GAN’s loss requires more sophisticated interpretation than a basic classification or regression model. We had to make sure neither the generator not the discriminator model has “won.” If either the generator loss or the discriminator loss drops below a certain threshold, it means that one model is dominating the other and that the combined model is failing to train. The perplexity of 2 is shown by the value log (2) = 0.69, which is a reasonable point of reference for these losses. A sigmoid cross-entropy loss of the output photos and an array of ones is identified as the generator loss. In our experiment, we observed that at the initial stage of training generator loss was less because discriminator was not trained accurately at that time. So generator simply tried to fool the discriminator. But since the training continues the discriminator achieved the ability to differentiate between the real and fake image. After the 100 epochs the generator loss was around 0.3, which represents our generator was not perform well as shown in Fig. 9. It produces artifacts.

Fig. 9
figure 9

a Generator GAN loss in this study, b generator GAN loss in pix2pix [1] method

To solve this issue, we use generator L1 loss also. L1 loss, or MAE (mean absolute error), is the difference between the generated and target images as shown in Fig. 10. By doing this, the created picture can structurally resemble the target image. Generator L1 loss should go down as training continues as the generator trained to produce photorealistic image rather than blurry image.

Fig. 10
figure 10

a Generator L1 loss in this study, b generator L1 loss in pix2pix [1] method

CGAN alone can produce sharp images but the constructed images affected with artifact so we use CGAN together with L1 to reduce these type of artifact. In the pix2pix [1] method, Fig. 11b, the generator provided deterministic output so generator total loss range went too high.

Fig. 11
figure 11

a Generator total loss in this study, b generator total loss in in pix2pix [1] method

Discriminator loss is the summation of generated loss and real loss. The generated loss is a sigmoid cross-entropy loss of the generated images and an array of zeros considering zeros for fake. On the other hand, the real loss is a sigmoid cross-entropy loss of the real images and an array of ones considering ones for real. A discriminator loss value of less than 0.69 indicates that the discriminator outperforms random on the collection of components of real + generated pictures. A value of less than 0.69 for the generator GAN loss indicates that the generator is tricking discriminator better than random. Figure 12a shows that the discriminator loss fluctuates from 0.5 to 1 throughout the entire training stage. So the epochs increased the discriminator trained well to identify the real image and feedback send to the generator to produce photo-realistic image in our model. On the contrary, graph from Fig. 12b almost saturated after a certain training epochs near 0.5 value. In our method, generator and discriminator trained simultaneously and it is found that none of them dominates other which was the goal of our GAN design.

Fig. 12
figure 12

a Discriminator total loss in this study, b discriminator total loss in in pix2pix [1] method

The generated output samples from the CGAN are shown in Fig. 13. It is observed that the generated outputs are quite similar with the ground truth images.

Fig. 13
figure 13

Representation of ground-truth and generated image using Cityscape [43] dataset

The distance between feature vectors computed for ground truth and created pictures is compared by the evaluation metric Frechet inception distance (FID). It is a common evaluation metric for GAN to observe the quality of generated image. Furthermore, the IS is a scale that is widely used for judging the quality of generated picture from GAN model automatically. In this evaluation process, the generated and ground truth images are calculated by inception v3 model and compare the similarity between these two sets of images [47]. As GAN generator is trained to produce photorealistic images with increasing number of epochs, the FID score decreases with training. It indicates generated image is more similar to the original. The perfect score is 0 but that cannot be happened as the generated image may have some blur or artifact or sometimes color can be changed as we only gave the semantic images as input. But our objective was to reduce the FID as small as possible in this work, we have trained the Inception-v3 model with 5 epochs including 23 iterations in each epoch. Epoch-wise results are given in the Table 2.

Table 2 Evaluation of IS and FID of the CGAN generated images using inception-v3 model

From Table 2, we observe that FID score decreases with increasing number of epochs. At epoch 3, FID increases because the generator loss increased, so generated image was far from the target. This IS measure had shown to have a good correlation with human realism evaluation of produced pictures from Cityscape dataset.

IS provides two desired properties of a GAN model into a measurement system: (i) objects should be visible clearly in the photos created and (ii) the generating method should produce a huge variety of pictures from all of distinct classes. If a generative model meets both of these criteria, we should anticipate a substantial KL-divergence between target and generated image which produce larger IS. From Table 2, it is observed that IS score increases with increasing number of epochs as generator produce realistic and sharp image with increasing epochs. At epoch 3, IS decreases because the generator loss increased, so generated image was far from the target. Therefore a tradeoff is necessary for epoch optimization which is found at 3 as shown in Fig. 14.

Fig. 14
figure 14

IS and FID measurement

4.2 Performance Evaluation of Object Detection

We have used Faster R-CNN and MobileNet-SSDv2 model for object detection. From the obtained detection results, we have found that Faster R-CNN can detect multiple objects accurately at the same time. On the other hand, MobileNet-SSDv2 detection accuracy is not as worthy as the Faster R-CNN though it was trained too fast. Training time is almost one fifth of the R-CNN model.

The detected outputs of the Faster R-CNN and MobileNet-SSDv2 are shown in Fig. 15, where the input image is taken from the GAN generator. We observed that in some images, all objects could not be seen either the background was poorly lighted or the generated images were blurry. Using neural style transfer, we were able to improve the object detection accuracy. At first, we gave the GAN generated image as content image and a random image from car front glass as style image to the VGG19 model described in the methodology section. From Fig. 16, we found that the truck on the road was not detected by Faster R-CNN model whereas it was detected successfully at the correct location from neural style transferred form. Moreover, we observed that object detection from style transferred image took less time. We also found similar observation when taking other images. From Fig. 17, we found that the car on the road was detected wrongly as wheel by MobileNet-SSDv2 model. However, the style transfer has no effect on MobileNet-SSDv2. We have found same type of result when training with other images. So, we can say MobileNet-SSDv2 could not work well in style transferred input. Moreover, we observed that object detection from style transferred image took almost similar time.

Fig. 15
figure 15

Object detection using Faster R-CNN and MobileNet-SSDv2 along with their detection inference time

Fig. 16
figure 16

Result comparison after neural style transferred input in Faster R-CNN detector network

Fig. 17
figure 17

Result comparison after neural style transferred input in MobileNet-SSDv2 network

Though the neural style transfer approach had improved brightness of the image that produced from GAN generator, it creates some artifacts. Moreover, the sharpness of image was not increased too far. To enhance the sharpness of the GAN constructed image we used image enhancement using ESRGAN super resolution concept. After applying ESRGAN network, we got super-resolution image with maximum texture recovery. We used pixel based loss which helps the generator to eradicate local optima that is undesired. From ESRGAN, we have seen most of the super-resolute image provide peak signal to noise ratio (PSNR) in the range of 25–30.

After getting super-resoluted image ESRGAN model, we delivered it to the Faster R-CNN and MobileNet-SSDv2 to observe the result how well the models can detect objects from super-resoluted image. We have observed that image quality improvement through preprocessing of the generated images has great importance for object detection because it recovers the textures of the images and improves sharpness the output image.

From Fig. 18, after image sharpness enhancement, we have found remarkable result when using Faster R-CNN model for street object detection. It is observed that the Faster R-CNN works really noteworthy for the processed images with a reduction of detection time by 2.9 s. On the contrary, MobileNet-SSDv2 cannot increase the accuracy rate significantly. But this model provides exact location of the detected objects. Detection time also reduced in this model. As the generated image provides sharpness and high fidelity results, more features are extracted than before. So object detection accuracy has improved and objects are detected at the exact location than before. When we looked at the detector inference time it took just 1/8 time to detect objects than before. We found quite similar results by using different images as input. From Fig. 19, we observed that after enhancement object detection accuracy using MobileNet-SSDv2 had not been improved but the objects are detected at more exact location than before because of more features recovery and sharpness of generated image. The inference time for object detection took less time than before. Though it is found that MobileNet-SSDv2 model can be trained too fast, but at blurry case and images with lower brightness in the background performance of MobileNet-SSDv2 has not been increase significant for our synthesized and processed images case study. However, performance improvement of the Faster RCNN is found as significant in our investigation for the synthesized and processed input scenario. We have found that after producing the GAN generated image, some of the semantic pixels are washed away with artifacts from the image comparing with the real image which might be reason of quite lower confidence levels for some objects as shown in Figs. 15, 16, 17, 18 and 19. In consequence a dilemma arises with two options either preserve all the semantic pixels where artifacts would also remain in the generated image which could mislead the detectors, or to allow some semantic pixels to be removed with artifacts. As our main objective was to correctly detect all the objects at perfect position and remove the artifacts such as rain, fog, insects etc., we choose second option to be effective with a cost of falling the confidence level below 30%. Protecting all the semantic pixels of real image from washing away as well as increasing confidence level concurrently with removing the artifacts can be a potential research direction as future perspective of this study.

Fig. 18
figure 18

Result comparison after image enhancement in Faster R-CNN model

Fig. 19
figure 19

Result comparison after image enhancement in MobileNet-SSDv2 model

Table 3 presents a summary of the obtained results for the different case studies during our investigation. In particular, we achieve + 72.49% performance improvement using instance-level labeling and inference time is also reduced by 87.87% when compared to existing Faster R-CNN. Additionally, the proposed Faster RCNN based framework (Neural Style Transfer + Image Enhancement + Faster R-CNN) improves the (Neural Style Transfer + Image Enhancement + MobileNet-SSDv2) the accuracy by + 193.69%, which achieves 32.6% in terms of AP. The impact of processed image on detection inference time is shown in rightmost column of Table 3 and it is observed that the detector inference time reduced significantly for the processed image input quality especially for the Faster R-CNN detector. This analysis on detection time considering the mentioned different scenarios is helping to show a conclusive effectiveness of Faster R-CNN over MobileNet-SSDv2 as detector component for the proposed framework in this study. It is seen that for the unprocessed input image Faster R-CNN takes 5 times longer detection time than MobileNet-SSDv2, whereas it is reduced to 1.6 when proposed processed images are used. Furthermore, it is found that the total time required for our proposed complete framework using the mentioned computational resources is 1.046 s where the time required for image synthesis, neural style transfer, image enhancement and object detection with FRCNN are 0.58 s, 0.0023 s, 0.064 s, and 0.4 s respectively. Therefore, even though the synthesis and image enhancement are additional tasks beside detection task, the enhanced input features of the detector by these preprocessing has significantly reduced the total time along with the improved accuracy.

Table 3 In-depth findings from our baseline studies for the instance-level semantic labeling task expressed as region-level average precision (AP) scores AP50% for a 50% overlap value

We have also compared the results of the proposed framework with the related state of the art approaches which are summarized in the Table 4. It is observed in both Tables 3 and 4 that the detection effects of different targets in the street scene are different which are due to the variation in number of times target’s appearance during training. If two different objects Car and Truck are considered as examples; since car has been appeared more frequently than truck in training the Generator gets better feedback from Discriminator to generate more photo-realistic car output than truck. Thus, the detectors also provide better accuracy in detecting car. Similar findings are also shown in Domain Adaptive Faster R-CNN [48], Pixel-level encoding and depth layering [49] and InstanceCut [50]. Due to the network with enhanced features input, the average precision (AP) of each object is increased than the unprocessed form as shown in Table 3 and in Table 4 when comparing with different state of art works in the similar dataset. After comparing with different state of art methods in Table 4, our proposed framework identified the Car, Motorcycle and Bicycle with greater APs which usually are most frequent in training data along with comparatively better mAP of 32.6 considering all the objects. When comparing with Multiscale combinatorial grouping (MCG) [23] based FRCN [43], it is observed that mAP improves ~ (152.71–262.2%) and with ground truth (GT) segments [43], mAP improves ~ (12.07–26.36%) in this study. The proposed framework boosts the domain adaptive Faster R-CNN [48] model by 18.11% and the baseline Faster R-CNN [23] model by + 73.40%. It is also noticeable that even though the accuracies of the frequent objects are improved more, the accuracy improvement in the proposed framework is also found well for other classes further and hence suggested method can lessen domain difference between various object classes.

Table 4 Quantitative analysis of different methods on the Cityscapes [43] dataset for the instance-level semantic labeling task expressed as region-level average precision scores AP50% for a 50% overlap value

5 Conclusion

A deep learning based framework is proposed and investigated for street object detection from synthesized and processed semantic image. At first we generated photorealistic image from semantic map by using CGAN’s generator part. The discriminator tried to identify the real and fake image and provided the feedback to the generator. However, objects were not detected and located accurately for blurry, low brightness, and unwanted artifacts scenarios. To solve this problems, we used neural style transfer technique to improve brightness and ESRGAN concept to enhance sharpness with maximum textures recovery. Then, we applied the generated image to the object detection models Faster R-CNN and MobileNet-SSDv2 in order to investigate the impact of the synthesized and processed input on their performance. Faster R-CNN detected the objects more accurately than MobileNet-SSDv2 but it took too much time because of computational complexity. More features are retrieved than the unprocessed counterpart since the resulting image provides sharpness, smoothness, and high fidelity outcomes. As a result, object detection accuracy has increased remarkably with shorter detection time. In conclusion, the obtained results analyses show that the proposed synthesized and processed concept is found impactful in the performance improvement of the Faster R-CNN model in terms of accuracy and detection time than the unprocessed counterpart. The proposed framework and the findings of this study will be useful for the research and development of street object detection system in rain, low-light/night scenarios accurately with comparatively less time. However, Convolutional Neural Network (CNN) models still take significant time due to huge numbers of parameters need to be trained. The computation increases the cost and deep models require high capacity GPU. In the self-driving car, the computational complexity should be as low as possible and time requirement should be low as real time object detection is desirable. Hence reducing computation time is a major concern of our future perspectives. We also plan to use the diffusion models separately or combined way as GAN with diffusion (GANDI) for image synthesis and investigate their performance in the future work of this study. Furthermore, in future, we are planning to use the video dataset as input because in the real world the model should handle the real time object detection.