1 Introduction

Deep Learning (DL) models have made massive progress in the field of computer vision (Simonyan and Zisserman 2014). This progress is mainly promoted by training different convolutional neural networks (CNNs) architectures on various tasks like image classification, image segmentation, and object detection. The layers of these neural networks work mainly on preserving the spatial characteristics of images (Simonyan and Zisserman 2014; Srivastava et al. 2021; Szegedy et al. 2015). So far, there have been many types of developed and improved DL models, including different layers and connections. Special handling of the used training data is crucial to train such architectures to achieve state-of-the-art performance. A DL model’s performance is typically improved by increasing the quality, diversity, and amount of training data (Shorten and Khoshgoftaar 2019). However, the unavailability of training data is considered a limitation for achieving the desired performance (Yousri et al. 2021; Perez and Wang 2017). This unavailability can be in the difficulty of finding large-sized datasets or the lack of annotated data required for training the DL architectures supervisely. Accordingly, an efficient way to acquire more annotated data is to enlarge their amount by basically extending the available dataset. This is called data augmentation (DA), where there are different methods for performing augmentation (Perez and Wang 2017; Han et al. 2018). DA methods automatically and artificially inflate the size of the available training data, including the data and their labels. Data warping and oversampling are commonly used for augmentation (Khosla and Saini 2020; Wong et al. 2016; Shorten and Khoshgoftaar 2019). Data warping transforms the existing images by mainly geometric and color transformations (Khosla and Saini 2020). Rotation and flipping of the images are considered common forms of geometric transformation, while color shifting is the most common form of color transformation (Zheng et al. 2020). On the other hand, oversampling augmentation encompasses creating synthetic instances by methods like feature space augmentations and generative adversarial networks (GANs).

In the context of automatic lane detection, recently, there has been an increasing tendency toward considering this task a semantic segmentation task. Thus, many DL models based on various architectures have been developed to perform such tasks robustly. The main benefit of the DL approach for lane detection is that the DL models are capable of showing accurate detection under various road conditions that the traditional computer vision techniques cannot deal with Yousri et al. (2021); Zou et al. (2019). Yet, to achieve this robustness in automatic lane detection, it is crucial to train such DL models using big-size data and various road conditions like the different road lighting and the various lane types. In automatic lane detection, simple data augmentation techniques such as flipping, cropping, and rotation are commonly used either to increase the data size or to overcome data imbalance (Zou et al. 2019). However, while dealing with data augmentation, it is crucial not to use methods that can eliminate some important features of the lane lines, like their bright colors or frontal heading nature. Moreover, methods like GANs cannot be applied directly to detection tasks because this produces only new images without their corresponding masks (Shorten and Khoshgoftaar 2019).

In this paper, perspective transformation (PT) is utilized to inflate the training data size by increasing the number of images taken by a frontal camera mounted on the top of the autonomous vehicle. We can mimic various images taken at different angles and positions for the same scene using PT from just one image (Wang et al. 2020). This can be beneficial in the context of data augmentation, especially when the method is easy to implement. Furthermore, three architectures based on the fully convolutional networks (FCNs) are involved in studying the impact of using PT during various experiments to be trained on the ego-lane detection task. Eventually, the usefulness of ensemble learning is utilized during the testing stage. Accordingly, the main contributions of this work can be listed as follows:

  • Employing the perspective transformation as a data augmentation method to mimic realistic images taken at different camera angles for the road scenes without eliminating essential features.

  • Using the same perspective transformations to generate the corresponding labels for the augmented images.

  • Investigating the effect of perspective transformation-based data augmentation on the performance of three different state-of-the-art architectures.

  • Adopting a stacking ensemble approach while testing the developed models to achieve the best possible performance in complex and challenging scenes.

The remaining sections of the paper are organized as follows. Section 2 conducts the related work. While Sect. 3 describes the methodology of using perspective transform as a data augmentation method and introduces the adopted ensemble learning approach. Section 4 shows the conducted experiments, results, and discussion. Eventually, the conclusion is found in Sect. 5.

2 Related work

This paper investigates the effect of using the perspective transform as a data augmentation method while training deep learning models on the ego-lane detection task. Accordingly, this section focuses on briefly reviewing the previous related work that considered deep learning models for the lane detection task and the different data augmentation methods used to overcome the limitation of data unavailability.

2.1 Deep learning for lane detection

In Chao et al. (2019), a robust multi-lane detection algorithm was proposed where an architecture based on a fully convolutional network (Neven et al. 2018) was used for lane boundary feature extraction. Then, Hough variation and the least square method were used to fit the lane lines accurately as done previously in Sun and Chen (2014). The algorithm results were evaluated based on Tusimple and Caltech datasets showing robust lane detection. Another work (Mendes et al. 2016) presented a detection system using CNN-based architecture, which was designed to be easily converted into FCN one after training to allow the use of a large contextual window. This methodology was compared with other state-of-the-art methods and showed high performance except in scenes with extreme lighting conditions. Furthermore, an architecture based on up-convolutional layers was proposed in Oliveira et al. (2016) while another one called RBNet was developed in Chen and Chen (2017). Both architectures showed robust performance in lane segmentation upon testing them on the KITTI benchmark dataset.  Chen and Chen (2017) ’s limitation was to have a road segmentation model robust to all the weather conditions.

An end-to-end lane detection based on LaneNet was introduced in Wang et al. (2018). LaneNet is an architecture inspired by SegNet (Badrinarayanan et al. 2017); however, LaneNet has two decoders: one acts as a segmentation branch that detects lanes in a binary mask, and the other is an embedding branch. The network outputs a feature map; thus, clustering and curve fitting using H-Net were then done to produce the required final output results. In Gad et al. (2020), an encoder-decoder network-based SegNet architecture was trained to produce a binary segmentation map by labeling each pixel as a lane pixel or non-lane pixel. The authors used tuSimple dataset to test their approach; then, a real-time evaluation was provided. By looking at lane detection as a semantic segmentation task, the authors in Zou et al. (2019) were inspired by the success of encoder-decoder architectures of SegNet and U-Net (Ronneberger et al. 2015). Accordingly, they built their network by embedding the ConvLSTM to detect lane lines from the information of many frames rather than a single one.

In Yousri et al. (2021), five state-of-the-art FCN-based architectures: SegNet, Modified SegNet, U-Net, ResUNet (Diakogiannis et al. 2020), and ResUNet++ (Jha et al. 2019) were trained for the host lane detection. The performance evaluation of the trained models was done visually and quantitatively by considering lane detection a binary semantic segmentation task. The output results showed robust performance while testing the models on challenging road scenes with various lighting conditions and dynamic scenarios.

2.2 Data augmentation

Image classification and segmentation have a wide scale of data augmentation techniques that vary from simple ones to learned strategies. Image data augmentation techniques can be classified generally into three main approaches: basic image manipulations, deep learning, and meta-learning techniques  (Shorten and Khoshgoftaar 2019). However, basic image manipulations are preferably used in supervised learning, which requires images and their corresponding labels. These manipulations include applying the geometrical transformation, kernel filters, color space transformation, mixing, and random erasing. The sharpening and blurring filters are widely used to augment images  (LeCun 2015).

The authors of   Zhang et al. (2022) have proposed an approach to enhance the image pixels by using a convolution neural network (CNN) and robust deformed denoising convolution neural network (RDDCNN) by extracting the noise features, combining the rectified linear unit and batch normalization to improve the ability of the RDDCNN to learn, and eventually having a clean image. In   Tian et al. (2023), the authors have used wavelet transform and multistage convolution neural networks in image denoising. In situations where the noise level is uncertain, the initial stage of the process dynamically adapts the network parameters, which can be advantageous. The second stage uses the residual blocks and wavelet transformations to suppress the noise. The third stage is removing the redundant features by using residual layers. A limitation of this paper that it was difficult to have clean reference images. The authors of   Tian et al. (2022) have proposed a heterogeneous architecture of super-resolution convolution neural network to generate a higher quality image. This architecture contained a complementary convolution block and a symmetric group convolution block in parallel such that it improved the channels’ internal and external relations. On the other hand, color space transformation allows altering the color distribution among images. There are many ways of representing digital images, like the grayscale representation, which is considered simple. However, color is a significant distinctive feature for some tasks. A study done in   Jurio et al. (2010) showed the performance of the image segmentation task on HSV (Hue, Saturation, and Value) and other color space representations.

Rotation, flipping, and cropping are considered standard geometrical transformation augmentation (GTA) techniques. The simple rotational and flipping affine transformations are commonly used to increase the size of the training data by changing the orientation of the image content. The main advantage of using such simple techniques is their easy implementation and capability to augment the data along with their corresponding annotations. Accordingly, GTA is the desired data augmentation technique for supervised learning. In vision-based lane detection, much previous research adopted the rotational and flipping processes as in   Zou et al. (2019); Yousri et al. (2021); Jaipuria et al. (2020). GAN was proposed in   Goodfellow et al. (2014), and it is considered a powerful data augmentation method that tries to produce unprecedented images. In   Li et al. (2022), the authors used least squares GAN to augment a small dataset for which increased their testing accuracy by 3.57%, but there was class imbalance because of some rare instability incidents. In   Ghafoorian et al. (2018), the authors employed a generative adversarial network that consisted of a generator and a discriminator for the lane detection task. An embedding-loss GAN was proposed to segment driving scenes in their approach semantically. The advantage of this model was that the output lanes were thin and accurate. Many variants of GAN have been proposed to improve the quality of the output synthetic images  (Radford et al. 2015). Yet, the GAN approaches are hard and can be used for road scenes because the generated images have no corresponding annotations.

3 Proposed approach

3.1 Perspective transformation

In real-world road scenes, all the objects are three-dimensional. This means that any captured image from such scenes is considered a geometrical representation mapped from 3D to 2D  (Brynte et al. 2022). This mapping is done through an arbitrary window and camera lens and placed onto the 3D space of the real world. This change in the geometrical description is done through geometric transformation. In digital systems, a 2D image mapped from the real world is represented by discrete values (xy) on the image space  (Marzougui et al. 2020).

When a camera captures, or the human eye views a scene, the objects of the real world are mapped with sizes different from their actual ones. The distant objects appear smaller than the near objects due to the concept of perspective  (Do 2013). By transforming the perspective, we can realize the same scene at different positions and orientations. Accordingly, perspective transformation can be very beneficial when dealing with images in computer vision, as many warped versions can be generated from just one image. In this approach, perspective transformation is used as a data augmentation technique to mimic various road frontal images taken for the same scenes with different viewpoints. Consequently, the number of images that will be used to train the FCN-based architectures on the ego-lane detection task and their corresponding labels will be increased. Shortly, perspective transformation is considered a projection mapping that turns the projection of an image into another unique visual plane. This pointwise transformation can be formulated as follows  (Li et al. 2017; Do 2013):

$$\begin{aligned} {\begin{bmatrix} {x ^ {'}}_{dst_{i}}&{y^ {'}}_{{dst_{i}}}&{w}^{'}_{i}\end{bmatrix}= \begin{bmatrix} x_{src_{i}}&y_{src_{i}}&w_{i} \end{bmatrix} \begin{bmatrix} m_{11} &{} m_{12} &{} m_{13} \\ m_{21} &{} m_{22} &{} m_{23} \\ m_{31} &{} m_{32} &{} m_{33} \\ \end{bmatrix} } \end{aligned}$$
(1)

and,

$$\begin{aligned} { {P}^{\;dst}_{i}\,=\, ({x^{'}}_{dst_{i}}, {y ^ {'}}_{{dst_{i}}})\, \,, \, \, {P}^{\;src}_{i}\,=\, ({x}_{src_{i}}, {y}_{{src_{i}}}) \,, \, i\,=\, 1, 2, 3, 4} \end{aligned}$$
(2)

where \({P}^{\;src}_{i}\) is a source point of quadrangle vertices on the original image represented by the coordinates \({x}_{src_{i}}\) and \({y}_{{src_{i}}}\). While \({P}^{\;dst}_{i}\) is a destination point of quadrangle vertices on the warped image represented by the coordinates \({x}_{dst_{i}}\) and \({y}_{{dst_{i}}}\), where \({x_{dst_{i}} = {{x^{'}}_{dst_{i}}} / {w}^{'} }\) and \(y_{dst_{i}} = {{y^{'}}_{dst_{i}}} / {w}^{'} \). The 3\(\times \)3 matrix found in Eq. (1) is the perspective transformation matrix or map matrix, which can turn \({P}^{\;src}_{i}\) into a new point \({P}^{\;dst}_{i}\) and they are homogeneous coordinates. Perspective matrix should only have 8 free elements as generally \(m_{33}\) = \(w_{i}\) =1 after normalization. Now by setting the source and destination points, the perspective transformation matrix can be calculated. To find the 8 unknown elements, 8 equations are needed to solve. Accordingly, four pairs of points represented as quadrangle vertices are used as follows: \(P_{1}=(0, 0)\), \(P_{2}=(W, 0)\), \(P_{3}=(0, H)\), \(P_{4}=(W, H)\).

Where H and W are the height and width of the image. Figure  1 shows the four points of the quadrangle vertices on an image. By changing the coordinates of these points, the source and destination points can be defined. Each output pixel is calculated by multiplying with the given perspective matrix to perform a perspective transformation for an input image.

Fig. 1
figure 1

The four points of the quadrangle vertices on an image

As the aim is to train supervised deep learning models, the same perspective matrix must be applied to both the road image and its corresponding label. Figure  2 represents a sample image and its corresponding label from the nuScenes dataset.

Fig. 2
figure 2

Showing a sample road image and its corresponding ego-lane segmentation mask: a Separated; b Combined together

In this study, after applying the perspective transformation, the output image has the same size as the input image. This is because any points outside the original image boundary are filled with black pixels of zero values and will need further cropping. In Table  1, the destination points of the eight adopted perspectives are illustrated. For a better realization of the destination points, Figure  3 shows them as quadrangle vertices. Many other destinations point can be used for more perspectives. However, only the eight perspectives will be used in the augmentation process in this study. The perspective transformation of the input image is performed by multiplying each output pixel with the given perspective matrix, and the output image has the same size as the input image, which is \(1280\times 720\).

Table 1 Illustration of the destination points of the eight adopted perspectives

Figure  4 shows the eight warped images of the same road frame of Figure  2 after applying the PTA algorithm. The warped images shown in the figure obviously give a better realization of the ego-lane features without missing any needed information. Hence, this is the main motivation behind adopting PTA as a data augmentation method in the context of ego-lane detection.

Fig. 3
figure 3

Visualization of the destination points of the eight adopted perspectives

Fig. 4
figure 4

The eight perspectives used in this study

3.2 Used architectures

Ego-lane detection is recognized as a semantic segmentation task of the lane and off-lane classes in this work. The existing CNNs are powerful visual networks that are capable of learning hierarchies of features. However, FCN-based architectures have become the cornerstone of deep learning applied in the context of semantic segmentation. A full convolution network is an architecture that only performs convolution operations by either downsampling or upsampling without dense layers. Many developed FCN-based architectures can be trained supervisely to produce a pixel-to-pixel semantic segmentation map, making them easily trained in an end-to-end manner. SegNet (Badrinarayanan et al. 2017), and U-Net (Ronneberger et al. 2015) are successful architectures that can perform the semantic segmentation task efficiently. Both networks are based on the encoder-decoder architecture, where The encoder block and decoder block are fully convolutional networks.

In SegNet, each encoder produces a set of feature maps by performing convolution with a filter bank. Each convolution operation is followed by batch normalization then an element-wise rectified linear unit (ReLU) is applied before the max-pooling. The decoder stage of SegNet consists of a set of upsampling and convolution layers where each upsampling layer corresponds to a max-pooling one in the encoder part. The resultant upsampled maps are then convolved to produce a high-dimensional feature representation. At last, a softmax classifier is used for a pixel-wise prediction to produce the final segmentation.

On the other hand, the architecture of U-Net has a contracting path and an expansive path. The contracting path represents the encoder part where each convolution operation is followed by applying the ReLU activation function and max-pooling operation for downsampling without any batch normalization (Ozturk et al. 2020). At each downsampling step, the number of feature channels is doubled. In contrast, the decoding in the expansive path includes feature maps upsampling and then up-convolution operations. At each step, concatenating with the corresponding cropped feature map from the contracting path occurs, halving the number of feature channels to restore the original number of pixels. A convolution operation is performed at the final layer, which maps the segmentation output.

The architecture of the U-net is symmetric, easy to be implemented, fast, and produces a precise segmentation of images. Accordingly, many recent architectures have been using and developing U-Net with some modifications or added blocks. According to Jha et al. (2019), ResUNet++ significantly outperforms U-Net. The ResUNet++ architecture is based on ResUNet (Diakogiannis et al. 2020), which stands for the Deep Residual U-Net, where it takes advantage of both the deep residual learning and U-Net. ResUNet++ architecture contains one stem block followed by three encoder blocks, Atrous Spatial Pyramid Pooling (ASPP), and three decoder blocks. For experimenting with PT as a data augmentation method while supervisely training DL models on the ego-lane detection task, SegNet, U-Net, and ResUNet++ are utilized here.

3.3 Ensemble learning

Lane detection is a challenging semantic segmentation task due to the diversity of lane kinds, road illumination conditions, and possible dynamic scenarios. Accordingly, training just one model that can robustly and efficiently detect the ego-lane under all the possible road conditions is challenging (Yousri et al. 2021; Chao et al. 2019). Also, another challenging part of the lane detection task is the class imbalance nature within the images, where the lane information class makes up only a small portion of the image. Eventually, this can lead up to further misclassifications that must be considered. So here, we can take advantage of ensemble learning which is considered a powerful machine learning technique which learns by gaining knowledge by ensembling the results of several machine learning algorithms to have enhanced generalization than a single machine learning algorithm  (Li and Yang 2017). Several models can be trained separately, and their prediction can then be combined or averaged or get the majority vote to produce a better overall prediction by relying on more realizations rather than just one (Dietterich 2002; Cutler et al. 2012). Bagging, boosting, and stacking are the main classes of ensemble learning methods (Dietterich 2002). Stacking, which is the type we use in this study, is about training different models on the same data to learn how to best combine predictions.

In the context of ego-lane detection, the main motivation for adopting an ensemble prediction of models is to overcome the challenges addressed earlier, and hence better and more accurate pixel classification can be achieved. However, to achieve effective ensemble learning, it is important to carefully select and train the models and then pick the type of ensembling technique according to the problem (Sagi and Rokach 2018). The selection of the network architectures to be trained is better to be based on diversity. Moreover, an essential criterion while selecting is to guarantee that the trained models will output the same predictive type. In this work, three state-of-the-art semantic segmentation architectures: SegNet, U-Net, and ResUNet++, will be used in ensemble learning, and their predictions will be merged and averaged to produce the final prediction.

4 Experiments and results analysis

This section presents the steps for validating the reliability of the proposed augmentation approach, including the implementation details, the datasets used to evaluate the performance of FCN-based models, and analyzing the results obtained quantitatively and qualitatively. The proposed approach phases were executed using Python programming language and on Google Colaboratory  (Carneiro et al. 2018).

4.1 Datasets description

This work considers FCN-based architectures that can be supervisely trained to produce a pixel-to-pixel semantic segmentation map by identifying each output pixel as a lane or non-lane. Accordingly, images with their corresponding ego-lane/host lane ground truth annotations are crucial for the training and validation of such architectures. In this study, two datasets are involved:

  1. 1.

    The extended version from nuScenes dataset  (Caesar et al. 2020) that was developed and used in Yousri et al. (2021) where some of the frontal camera frames were labeled based on a proposed sequence of traditional computer vision techniques for the ego-lane detection task.

  2. 2.

    Open-source benchmark KITTI dataset  (Geiger et al. 2013) which has lane estimation benchmark comes with ground truth annotations for the ego-lane.

The extended version from the nuScenes dataset is the only chosen data for training FCN-based architecture for the training stage. The reason behind excluding KITTI from the training stage is the unavailability of a sufficient number of labeled frames. NuScenes dataset originally contains 1.4 million RGB (Red Green Blue) images taken via 6 different cameras mounted on the autonomous vehicle. However, the extended version that was developed in Yousri et al. (2021) contains only a subset of the frontal images of the original nuScenes dataset along with their corresponding ego-lane annotations. The images within this extended version have diverse road illumination conditions, lane lines kinds, and dynamic scenarios. At the same time, the reliability of the generated labels was qualitatively validated in Yousri et al. (2021) by visual examination. Accordingly, upon this rich annotated dataset, the PTA method is applied to train the three FCN architectures. For the testing stage, both nuScenes and KITTI are utilized as this expands the diversity of the testing experiments. Using KITTI in performance testing has two advantages. The first is that there have been many previous studies that tested their lane detection methods on the KITTI benchmark. Thus, we can by this means, easily compare the results of our work with theirs. Secondly, achieving high performance upon testing on images of another dataset will intuitively assure that no overfitting occurred due to the PTA augmentation.

4.2 Training stage setup

In this study, all the FCN-based models are supervisely trained end-to-end to detect the ego-lane in various road conditions. In the training stage of all the experiments that will be conducted later, we randomly split the training data into a training set and a validation set, where the ratio between them respectively is chosen to be 0.9 : 0.1. The number of the training epochs is set to be 100, and it is fixed along with all the experiments for the different architectures and data sizes. To improve the overall performance of the deep learning models, it is crucial to obtain the optimal parameters during the training stage. This can be done by carefully defining the loss function(s) suitable for the semantic segmentation task. In this approach, a hybrid loss function using the Binary Cross-Entropy and the Dice Loss is utilized, and they can be mathematically presented as Jadon (2020):

$$\begin{aligned}{} & {} {{L}_{BCE}}\,(v,{\hat{v}}) = -\,(v\,log\,({\hat{v}})\,+\,(1-v)\,log\,\left(1-{\hat{v}})\right) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} {{L}_{DICE}}\,(v,{\hat{v}}) = 1\,-\frac{2v{\hat{v}}\,+\,1}{v\,+\,{\hat{v}}\,+\,1} \end{aligned}$$
(4)

where v is the predicted value by the prediction model and \({\hat{v}}\) is the actual value.

4.3 Results and discussion

During the training stage of the three deep learning architectures, the images are sampled to a resolution of 265 x 128. The hypothesis that we want to test during the conducted experiments is mainly that the PTA method can boost the performance of FCN-based models in the ego-lane detection task more than the standard GTA. The \(90^o\) rotation in both directions is adopted in this study as the standard GTA technique. The reasons behind choosing it are:

  1. 1.

    It is commonly used with the lane detection task.

  2. 2.

    Easy to be implemented.

  3. 3.

    It eliminates a very important feature about the ego-lane lines: their frontal and central position on the image.

Accordingly, no overfitting is likely to occur as in flipping, which nearly produces the same road scene. The first step in investigating and understanding the impact of using perspective transform for augmenting images in the context of lane detection tasks is to compare models trained on augmented data using PTA versus the other adopted augmentation method. The experiments include three different deep learning architectures to expand and strengthen the evaluation. The conducted experiments included training the three detectors: SegNet, U-Net, and ResUNet++ for 100 epochs on:

  • Three different data sizes.

  • Three different augmented datasets using PTA, GTA, and both of them.

By performing these 27 experiments, we can comprehensively evaluate the effect of using PTA as an augmentation method. In PTA trials, the augmentation was done for every image by a factor of 2 from the adopted perspectives presented earlier. As ego-lane detection is recognized here as a semantic segmentation task of two classes, the dice coefficient is the main evaluation metric. Table 2 shows the validation results of the training stage.

Table 2 The validation results of the trained models

By looking at the table, we can observe that all the models trained on the augmented dataset using the PTA method result in higher performances on the different data sizes. Also, it is obviously noticeable that by using the PTA method, the validation dice of ResUNet++ outperforms both U-Net and SegNet by nearly \(0.3\%\) and \(1.2\%\), respectively, on data size equals 2000 before the augmentation. As the data size increases, the quantitative variations among the validation results decrease due to the reduction of the augmentation impact generally (Shorten and Khoshgoftaar 2019). The scarcity of data results in a lack of samples essential for training the models sufficiently, and thus the importance of augmentation appears in this context. This behavior was addressed before in Diakogiannis et al. (2020) that the FCNs use contextual information to increase their performance. Thus, the impact of the PTA method can prove this concept as it does not eliminate the important information from the images, unlike the standard GTA methods. The behavior of ResUNet++ during the experiments shows that it gets affected noticeably by the augmentation process and that it outperforms the other two architectures. Jha et al. concluded before in their study results (Jha et al. 2021) the considerable effect of data augmentation on ResUNet++ performance in the segmentation task. Accordingly, we investigated the effect of the PTA method on ResUNet++ architecture by using training data augmented by different augmentation factors from the adopted perspectives. Figure 5 shows the validation dice coefficient and loss of ResUNet++ trained on a dataset of 3000 images augmented by factors of 2, 4, 6, and 8.

Fig. 5
figure 5

Validation loss and validation dice coefficient curves while training ResUNet++ on data augmented using PTA by different factors

For testing the models, different road conditions, as shown in Table  3 were chosen for investigating quantitatively to what extent the PTA method boosts the detection performance of the three models trained on data of size equal to 6000 and augmented by a factor of 2. The road environment is dynamic and challenging in nature due to the diversity of possible scenarios, including changes in illumination, road condition, and lane line types. Consequently, testing the performance of the models in such diverse driving situations is crucial to evaluate their robustness. For every category found in Table  3, 200 images were used in the testing. The testing set contains all the needed condition classes to investigate the performance of the models in the most complex and harsh environments. The results shown in Table  3 prove that the models trained on images augmented using the PTA method outperform those augmented using the standard GTA method. During the performance analysis on the testing sets, ResUNet++ showed the most reliable ego-lane detection task performance in most testing classes. At the same time, U-Net achieved the highest dice coefficients in the double lane lines and the crowded and curved road conditions. This gives intuition that U-Net can adapt more robustly in detecting the ego-lane in scenes with fine and complex details. Furthermore, by studying the testing results, the models show insufficient performance in conditions like dark, rural, and crowded scenes. The model has an average processing time for lane detection equals 21.3 ms/frame for sunny condition, 21.9 ms/frame for cloudy weather, and 23.1 ms/frame for dark condition.

Table 3 A comparative table for evaluating the performance of the different trained models tested on various challenging conditions

Even though we have achieved using PTA high dice coefficients in the previous testing trial, we still need to improve the performance robustness of ego-lane detection. A high-performance detector is supposed to be able to cope with nearly all the driving situations that are likely to occur on a daily basis. Bad illumination, like in the case of the dark night scenes and the shadows of rural roads, is very challenging. At the same time, harsh rainy scenes are considered a serious limitation for accurate lane detection due to the bad lighting, the distracting water droplets, and the wet road with a mirrory and distorting effect. Also, the dynamic road scenarios affect the robustness of lane detection. These scenarios can be in the form of distorting ego-lane obstacles like a preceding car or pedestrians. Moreover, the distortion can be in the form of complex lane positions or lane lines that are hard to adapt to, like curvy, zigzag, or multi-line ones. Ensemble learning can provide an excellent solution for achieving the best possible results in the addressed harsh road environments. By stacking and averaging the predictions of the models used in the previous testing experiment, a more robust performance in the ego-lane detection task can be attained, as shown in Table 4.

Table 4 The quantitative results after testing based on the ensemble learning approach

The results show that the performance for finding the ego-lane detection has been noticeably boosted. The ensemble learning approach has increased the testing results on the challenging rural scenes by nearly \(7\%\). Also, for the dark and rainy scenes, the dice coefficient has reached 0.986 and 0.987, respectively. Moreover, the ensemble prediction is considered robust in the dynamic scenarios of distorting and crowdy road environments. This gives intuition regarding the reliability of the ensemble predictions of models trained on data augmented using the PTA method. The usefulness of PTA can also be evaluated by training new models on other different data augmentation techniques. However, this can be done in other contexts rather than lane detection, like object detection. But when it comes to lane detection, augmentation methods like random noise adding or color space transformation can dramatically affect the models. The affine transformation-like rotation used in this study is commonly used. The usage of the PTA method gives out enhanced results for lane detection tasks. Thus, this study can be considered a weak supervision approach as it utilizes a limited number of images with their corresponding weak labels for efficiently training FCN-based models.

In the context of semantic segmentation, the visual examination is an excellent choice to evaluate the overall performance qualitatively. The prediction is supposed to represent a precise segmentation of the input image into two different parts in the coarse and fine levels. Accordingly, the qualitative evaluation is important in the context of the ego-lane detection task as it reveals the prediction quality that is crucial for advanced driver assistance systems (ADAS). This study aims to achieve the most precise detection after training the models on a dataset of big size and augmented using PTA. Thus, we consider evaluating the performance in terms of the fine level, seeking that the models can precisely process the details under challenging road conditions. There should not be a noticeable space between the green prediction and the ego-lane lines during the visual evaluation. Figure 6 shows the high performance of the ensemble prediction on very challenging scenes, including bad illumination, rain, shadows, different complex lane orientations, and various dynamic scenarios. As a future work, time series prediction as in Xing et al. (2022); Xiao et al. (2022); Xing et al. (2022) could be taken into consideration.

Fig. 6
figure 6

Samples from the ensemble prediction results based on nuScenes dataset

4.4 Comparing with others

KITTI Lane benchmark (Geiger et al. 2013) is commonly used for evaluating different ego-lane detection methods. Accordingly, we utilize it in testing our work to compare the result of our ensemble learning approach with other state-of-the-art lane detection methods. The KITTI Lane benchmark contains 95 sample images collected in various urban scenes with ground truth. The comparison is done in terms of the dice coefficient, which is equivalent to the F1-measure or MaxF in the binary segmentation context. The proposed approach is compared with FCN_LC  (Mendes et al. 2016), Up-Conv-Poly  (Oliveira et al. 2016), RBNet  (Chen and Chen 2017), and NVLaneNet  (Wang et al. 2020) as shown in Table  5 as the highest Dice Coefficient is 91.86% for the NVLaneNet  (Wang et al. 2020), while the proposed approach has Dice Coefficient of 96.04%.

Table 5 Evaluating results based on KITTI Lane benchmark

5 Conclusion

A data augmentation approach based on perspective transformation is developed in this paper. It supervisely trains different FCN-based architectures on ego-lane detection. Through several experiments for training three FCN-based models: SegNet, U-Net, and ResUNet++, we demonstrate that perspective transformation data augmentation effectively enhances the performance of the models. The results are evaluated quantitatively and qualitatively by testing the models on various illumination and road conditions, different lane line kinds, and diverse, dynamic scenarios. Realistic images of roads can be generated through perspective transformation augmentation, including the necessary labels, without missing important features. Eventually, ensemble learning is used to boost the ego-lane detection task in the most challenging road conditions. The performance result is compared with other previous lane detection methods on the KITTI Lane benchmark dataset, which resulted in a dice coefficient equal to 96.04%, which exceeds the other papers’ results.