1 Introduction

Semantic segmentation is a pixel-to-pixel classification based on label data [54, 55]. It is the basic technique to get some meaningful information to extract the required [7, 53, 92]. After getting the information, you can mention that certain specific objects belong to specific classes [9, 59, 81]. It is the most challenging field nowadays. It has many more diverse applications in every field like education, medicine, weather prediction, climatic change prediction, and so on [4]. Deep learning-based segmentation methods have been widely used in other vision tasks such as video object segmentation [53,54,55, 83]. The experimental results have proved that the dataset training and preliminary changes, i.e., pre-processing, affect the results. You can say before training; we can make our data feasible to perform some operations altogether. The datasets we will take also depend on the type of operation you need to perform [46]. The related work in this area is more technical and challenging [38, 105]. Many surveys have been conducted to have a thorough overview of semantic segmentation [2]. The challenges and the new approaches have led this particular field, more worth and attention. Recent research and development show that the perfect classification can lead to better semantic segmentation results [39, 58, 113]. Subsequently, many methods are deriving day by day; we will go through all the basic details, and after that, we will see how deep learning algorithms will help us get the most efficient result [96]. Region-based segmentation, graph-based segmentation, image segmentation [26, 117], instance segmentation [56], semantic segmentation all had the same basic but different procedures. Figure 1 describes the moving trend for segmentation in recent years. Figure 2 will show you all the state-of-the-art techniques that can be used for semantic segmentation. The related work in this field using deep learning is progressing every day.

Fig. 1
figure 1

Segmentation trends in recent years

Fig. 2
figure 2

Traditional and deep learning based techniques for semantic segmentation

1.1 Key challenges

Some algorithms can perform on a specific dataset we give as input; it will not provide the same results on other datasets [102]. The primary reason behind it is that the different dataset does not perform the same operations before getting to the training phase and the testing phase; the second reason is that while going through the machine learning process, we imagine that our whole data set does not have any ambiguity and in result, it will yield the best result which is more efficient and accurate [122]. Sometimes the dataset has few samples, like maybe less than 1000 images as a whole. In that case, our model will be under-fitted for such types of problems [90, 106, 107]. Model overfitting and the underfitting problem often occur when we have many or few samples [16]. We can now imagine that an algorithm is not equally performed in all dataset types, so in this paper, we will see how we can change something and improve efficiency [37]. For example, if you have a satellite image dataset [116], that means images from satellites of any region need to do some predictive analysis. You need to evaluate the climatic change in a specific area [121]. For that, you will get all the training datasets from already captured pictures taken from any satellite. Now you will examine all the images you have already accepted as a map. Here, you can use any map like Apple map, Baidu map, Google map, and so on [76]. After getting all the pictures, you can say on the images dataset for that specific area and apply some deep learning algorithms to evaluate semantic segments.

1.2 Classification

Classifying the objects that belong to the forest region [91], i.e., the tree is your required masked area. Now, you can easily highlight the regions of the area where we have trees. Our next step is to mention the label across that particular area [70, 71]. In this way, you can easily see which part of the country has more forest and less. Due to the forest, we are facing some climatic changes. Hence, you have noticed that using a semantic segmentation technique, and you can quickly get a prediction about a specific period, also like when this area grows more trees and has good weather and so on. Semantic segmentation has many more examples applicable in any place, like in the medical field for automatic diagnosis of Schizophrenia, where we can use CNN-LSTM models and the EEG Signals [22, 69, 75].

1.3 Object detection

Object detection [10] is also another essential part of semantic segmentation. After classification, you will make some edges and then detect the object [42]. Another helpful example is detecting some diseases by taking any medical training dataset [72]. The most recent research in this particular area is to detect COVID-19 [21, 73] when you have a sample of lungs, all the images for the infected lungs you can train. After that, you can guess and evaluate the differences you can see from infected lungs’ images from healthy lungs [105]. The most popular is the chest X-Ray scan. Here in this example, you can take all the data set and then evaluate which image had a particular sort of disease, and then you can spot the infected area and then mask that specific area. After that, the difference between the input lung image and the healthy lung image will have the calculation for the evaluation based on some critical analysis [62]. Therefore, you can see that using a good and accurate algorithm can give you more efficient results. You can easily detect the climatic change in a specific area in weather prediction cases. Here, the problem you can encounter is getting the primary data set and all the behavioral changes with time. Before getting all the data set and images, you will need to analyze before making your dataset. So, in this field, you can say getting all the data is also a critical step in dealing with or applying some deep learning algorithms [40]. Our purpose is to deliver a core understanding of all these concepts. We have first used supervised learning algorithms [37], and after that, we have seen how it will not give us the most efficient result, and we will jump to the most efficient algorithm. The algorithm that we have tested in the beginning uses some simple images or saves some Lighter images. We have used some already available libraries to get the results, and after that, we tried the more complicated algorithms.

2 Datasets for semantic segmentation

This research area is becoming popular day by day. We have tried the algorithms on CamVid [113], Pascal VOC [76], and COCO dataset [100]. The dataset that we have used throughout is CamVid. Most of the time, it works for any technique [57]. While doing masking, we have used the COCO dataset. Cambridge-driving Labeled Video Dataset consists of 367 training pairs, 101 validation pairs, 233 test pairs with 32 semantic classes. It is supposed to be the most advanced dataset for real-time semantic segmentation [63]. The other dataset that we have experimented on is the COCO dataset. COCO, or we can refer to Microsoft Common Object in Context dataset, is a large image dataset explicitly designed for image processing tasks like object detection, person keypoint detection, caption generation, segmentation, and many other prevalent problems these days. It has around 80 classes and has more detailed objects inside the images. Moreover, we have also used some small datasets like balloons dataset, shapes [19] (to detect rectangle, circle, triangle, etc.), nucleus [74, 105] (for medical relevant field). The purpose of using these datasets is to check how algorithmic network architecture can work on other datasets with the same accuracy [123, 124] and loss.

Many other semantic segmentation datasets like Mapillary Vistas [61] contain around 25000 high-resolution images with the 66 defined semantic classes. Moreover, it also has instance-specific labels for the 37 classes. Mainly, the classes belong to the road scenes same as in the cityscapes dataset, but it has more annotations than cityscapes. Moreover, the Semantic KITTI is also used as the outclass dataset to understand the semantics of scenes [5]. It is based on the KITTI Vision with all the sequences or frames with overall angles (Table 1).

Table 1 State of the art Algorithms on Publicly available datasets

BDD is called Berkeley Deep Drive Dataset as the detailed dataset for urban scenes for autonomous driving [33, 85, 91]. This dataset can perform the complete tasks, including object detection, multi-object tracking, semantic segmentation, instance segmentation, segmentation tracking, and lane detection. For the semantic segmentation, BDD has 19 classes, and samples are not so practical for urban scenes semantic segmentation. Wildash 2 is also a primary dataset for semantic segmentation, but it has limited material, i.e., training and testing samples, to fulfill the algorithm’s requirements. So, it is advisable to prefer the other highly organized and well-managed datasets [110].

Wildpass is considered as a panoramic semantic segmentation taken from the cities. It has two versions with the same datasets. One is WildPass and other is WildPass2K. The first one contains 500 panoramas from 25 cities, and WildPASS2K contains 2000 labeled panoramas taken from 40 cities. Although the variation is acceptable, this dataset is not recommended to deal with the complex scenarios while dealing with urban scenes [104].

3 Algorithms: Classical VS. deep learning

3.1 Grab-Cut algorithm: Region-based segmentation

Grab-cut is an image-based segmentation [67] that is important in getting the object based on the defined area or region. Here, we can also extract the image information based on dividing the image into two regions: background and foreground; after that, we can make a segment. In the region-based algorithm, we have used Grab-cut [62] using open CV in Implementation using Python; the steps are under below, 1. Take in an input image. 2. We have taken foreground and background separately so that we can input them in our grab cut algorithm function. 3. After that, the defined rectangle will map on the foreground image, showing the segment separately.

3.1.1 Results

This experiment result depends on the rectangle size that we have already made in Step 3. So, you cannot define this algorithm as efficient because if you change the area argument value, it will not give us the same result (Fig. 3).

Fig. 3
figure 3

Segmentation results using grab-cut on camvid dataset image

3.2 K-map image segmentation

This algorithm is mainly designed to get the required segment of an image using a K map. It will get the necessary mass based on the neighborhood, and after that, we can get the required section of the image based on RGB values. The images we will use as input are also not high-resolution images. This algorithm can quickly work only on low-resolution images and easily get all the segments based on colors and neighborhoods (Table 2). The main steps of this algorithm are explained below

  • First, take input of the image; you can take any image you want.

  • Apply the Gaussian filter to let smoothen the image.

  • Then, we have created a graph far building segments.

  • Segment graph by merging the gay neighborhood, which is in this case we have used four or eight you can also say based on neighborhood or similarity.

  • After those segments have been created, and it will generate an output image with all the details mapped on the plot.

  • Then, it will return a graph with edges of the input image.

  • Two segment graphs we have used the threshold to give us the result output image as a segmented image.

Table 2 CAMVID dataset image and segmentation result

3.2.1 Results

This algorithm is efficient and can get all the segments after the image based on colors. The only drawback that we have evaluated is that when you take a high-resolution image, this algorithm becomes slower and also causes a delay. There is an ambiguity of classes. As we have not defined this particular algorithm, we can see that the road that should be colored with one color is also segmented into so many parts. Furthermore, the person in the original image must mask as one object, the same as the car beside the person. Meeting and coping these all things in mind, we refer to the deep learning approach in which we will see how classes can categorize and read by the machine (Fig. 4).

Fig. 4
figure 4

CAMVID dataset image and K-Mapping segmentation result

3.3 Deep learning approaches

The algorithm that we have tried is based on the up-down settings of the convolutional blocks and found the semantic segments as a color map. Currently, State-of-the-art methods include many approaches to deal with semantic segmentation problems. The encoder-decoder-based, i.e., Fully transformer-based network models, are very popular, and they also give us promising result [80]. Modern research has applied the fully transformer-based architectures, and some adopted the CNN-based semantic segmentation model. Moreover, the hybrid network models are also a practical approach to solve these problems [95].

For semi-supervised segmentation, consistency regularization is a popular topic. Consistent predictions and intermediate characteristics are enforced by this method. Input perturbation techniques randomly augment the input pictures and apply a consistency constraint between the predictions of enhanced images, such that the decision function is in the low-density zone. Multiple decoders are used in a feature perturbation technique to ensure that the outputs of the decoder are consistent. Furthermore, the GCT technique further executes network perturbation by employing two segmentation networks with the same structure but started differently and ensured consistency between the perturbed networks [104].

Weakly Supervised Semantic Segmentation or WSSS with image-level labels has made significant progress in developing pseudo masks and training a segmentation network [43]. Recently, WSSS approaches have relied on CAMs to locate objects by identifying picture pixels helpful in classifying them. It does not mean that CAMs do not create helpful pseudo masks; they only accentuate the most discriminative portions of an object. There has been a great deal of work invested into finding a solution to this problem. For this purpose, they are employing tactics such as region-erasing, region monitoring, and regional growth to complete the CAMs. Other approaches use an iterative process to develop the CAM. Using random walks, PSA and IRN, for example, suggest spreading local replies across surrounding semantic entities. As previously stated, this problem stems from a lack of coordination between categorization and segmentation. Many academics have noticed this and are investigating ways to decrease the gap using extra supervision, such as CAM consistency, cumulative feature maps, cross-image semantics, sub-categories, saliency maps, and multi-level feature maps requirements. These strategies are straightforward, yet they provide positive results [20].

While we consider the context-based mechanism, OCNet (Object Context Network for Scene Parsing) would be a better option to select as the baseline. Logically, it contains the Resnet-FCN and objects Context Module. After the classifier, it again upsamples the image to parse the scene and provide the mask [109]. It has many variations like Base-OC, Pyramid-OC, ASP-OC (Atrous Spatial Pooling). Besides OCNet, we can have significantly matured network models like RFNET or ACNET that use asymmetric convolution blocks to strengthen the kernel structure. This network also helps us save extra computation time. Moreover, SETR (Segmentation Transformer) is the latest network architecture for the transformer-based mechanism that challenges the excellent mIoU of 50.28% for the ADE20K dataset and 55.83% for Pascal Context, and also give us promising results on the Cityscapes dataset [36, 77]. There are other latest transformer-based semantic segmentation models, i.e., Trans4Trans(Transformer for Transparent Object Segmentation) and SegFormer(Semantic Segmentation with Transformers) that are significantly less computational network architecture that can give us multi-scale features [99, 114]. SegFormer minimizes the effect of using complex decoders. Technically, it adds the learned features from all layers and the maximized and enriched representation. [99] also re-scale the basic approach and found very well-noted and robust results for up to 84.0% while experimenting on the Cityscapes dataset.

Omnisupervised learning framework is also designed for efficient CNNs, which adds different data sources. In this way, it will improve the reliability in unseen areas. So, the traditional CNN uses an unsupervised framework to take advantage of both labeled and unlabeled panoramas [103]. Now, researchers plan to take a panoramic panoptic segmentation approach to better scene understanding.

The traditional Semantic segmentation is based on RGB images, which is not a reliable way to deal with complex outdoor scenarios. Polarization sensing can be adopted as an efficient approach for dealing with these issues. By getting the information from the optical sensors, we can get the exact information regardless of what materials we are incorporating [47, 98].

There is another challenging aspect of semantic segmentation, i.e., 3D Segmentation in computer vision, that can be applied in autonomous driving, medical image analysis, and robotics. Usually, it applies to the handcrafted features while considering the shortcomings of the 2D-based segmentation [31]. Several models are famous in 3D Semantic segmentation, including ShapeNet, PSB, COSeg, ScanNet, etc. [60],

Another fascinating aspect of semantic segmentation is mapping the associated pixels in night scenes where the illumination concept is negligible. In this way, the real objects are not fully seen, and network models need to look down into deep details. The re-weighting methodology has been adopted before to cope with the false predictions [94]. The publicly available datasets can be used for that subject. These scenarios can be compared with the medical images, where images are almost grey, and the features are not correctly shown. The dataset we have used for this is the CamVid dataset, i.e., the dataset contains images of all around 960x720 dimensions, with approximately. The network architecture we will be following is based on core UNET. We will have 32 semantic classes; each class has a defined color (RGB) defined by the CamVid. The classes have been categorized by the characteristic of a particular object in our input. Mainly, moving objects has animals, pedestrians, children, rolling carts, bicyclists, motorcycles, cars, trucks, trains. Moreover, we have Road (shoulder, lane marking drivable, non-drivable), Ceiling (sky, tunnel, archway), and Fixed Objects (Building, wall, Tree, Fence, sidewalk, parking block, pole, traffic cone, bridge, Sign, Text, Traffic lights) respectively. The exact labeled image is shown in Figs. 5 and 6.

Fig. 5
figure 5

Comparison for All state-of-the art network architectures class IoU and Category IoU using Cityscapes dataset

Fig. 6
figure 6

Comparison for All state-of-the art network architectures based on every class using Cityscapes dataset

3.3.1 Network architecture

This model was developed by Olaf Ronneberger [66]. They developed a model for doing image segmentation on biomedical data. We have tried the UNET model on the CamVid dataset. Taking the input dataset image of size 960x720 and all the images are in png format. It is recommended to use images in png format if you are preprocessing data or making your dataset in certain environmental conditions and camera quality because image in other formats is not so feasible for doing all types of operations that we want to perform while performing deep neural network operations. The model used for this type of segmentation is by upsampling the image matrix by using a convolutional block [116]. While using the max-pooling layer, the image changes from high resolution to low resolution. Here we can get the extract part of the image, i.e., a feature required to be extracted. Moreover, here we can see that the image size decreases, but the depth, context, or receptive field enhances [30]. In return, we can see how size increases, resulting in losing the information that where our actual matrix value was where place. To solve this problem, we have another step for decoding the information that was downsampled before, and then it will pass to the transposed convolutional network to upsample it. During downsampling, we compute the parameters for the transpose convolution such that the image’s height and breadth are doubled, but the number of channels is halved. We will get the required dimensions with the exact information that will increase the accuracy in return. Results in Fig. 9 show results for the tested algorithm on CamVid Dataset.

UNet has also been applied on the Aerial or Drone dataset [59] and with VGG [48] as the network backbone. The drone dataset is the freely available dataset. It has 23 semantic classes. Figure 7 shows the segmentation results after applying UNET to this dataset. UNet outperforms in most situations. The Table 3 shows the accuracy and loss values for both datasets in 5 epochs (Figs. 891011 and 12).

Fig. 7
figure 7

True labels for CamVid dataset

Fig. 8
figure 8

CAMVID dataset labeled image with color mapping

Fig. 9
figure 9

Accuracy and loss curve for UNET on CamVid dataset

Fig. 10
figure 10

Experimental results for UNET on camvid dataset

Fig. 11
figure 11

Experimental results after Applying VGG-UNET on Aerial Dataset

Fig. 12
figure 12

Experimental results after applying VGG-UNET on aerial dataset

Table 3 Experimental results after Applying VGG-UNET on Aerial Dataset Experimental results for CamVid and Drone dataset using VGG-UNet

3.4 Masking: Instance segmentation using object detection

Another famous and more useful algorithm for datasets like VOC2012 [49], Camvid, and Coco datasets using Mask RCNN is the extension for Fast RCNN, mainly for object detection. The experiment was done using resnet101 as the backbone. Resnet-101 [39] is the CNN network which is 101 deep layers. What you have to do is to load a pre-trained network trained on more than a million images from the ImageNet dataset with 1.5 million images with the 1000 image classes. All input images are of size 224*224. Resnet pre-trained on the ImageNet dataset can be further evaluated using the Fast RCNN (Figs. 131415 and 16). After that, it can validate all the datasets containing classes that contain object classes similar to it. For example, the COCO dataset configured pre-trained weight file can validate the COCO dataset. Moreover, when we tested it with VOC, it showed a good efficiency of around 98-99% in every object class. After detecting the objects that belong to different classes, it will mask that instance. Eventually, it proved fruitful with class-wise object detection on every image while testing the CamVid dataset.

Fig. 13
figure 13

Masking on CoCo dataset

Fig. 14
figure 14

(a)Results for Mask-RCNN applied on COCO dataset

Fig. 15
figure 15

(b) Results for Mask-RCNN applied on VOC2012

Fig. 16
figure 16

(c) Results for Mask-RCNN applied on CamVid dataset

4 Experimental conditions

The experiments are performed under good hardware specifications. For conventional algorithms and Mask-RCNN experiment configurable to 2.2GHz dual-core Intel Core i7, Turbo Boost up to 3.2GHz, with 4MB shared L3 cache. UNET experiment was done under-speed boost with NVidia Tesla P100 GPUs. Selecting the system or hardware for semantic segmentation algorithms’ customization and performance analysis is also a key aspect [113]. Efficiency affects the results for segmentation. It is due to the reason that we have non-local information in semantic segmentation. Sometimes, we focus on the feature map level rather than the image level. However, we have already lost spatial information while focusing on the last feature map. We must save the residuals from managing the dependency between the pixels far away within the image.

5 Evaluation metrics

The performance evaluation can vary from problem to problem when making the deep learning neural network. Mainly, traditional methods like KNN [81], Decision tree with Boosting [85], SVM [53], conditional random fields, or any statistical-based approach use accuracy or precision as performance evaluation metrics. As far as deep learning is concerned, we have more performance metrics for Classification, Object Detection, and Semantic Segmentation [89].

  • Normally, we evaluate the final performance based on the accuracy (mIOU - mean intersection over Union). mIoU we can get by comparing the ground truth values with the output map after passing our image into the derived model.

  • The second measure s usually the time it takes to process the image in the CNN network. In terms of FPS, we can also refer to it as Latency. The third one is the Network Parameters that we have used to learn by the derived network.

  • The storage space we will need to save all the network parameters. This is also known as network size.

  • The computational cost for GPU that we are using. Sometimes, we use it as the execution time for the frequency of GPU. The more the value is, the more efficient our GPU will work.

  • Utilization of the hardware resources that mainly deal with CPU, GPU, and memory.

  • The amount of power that our system mainly consumed.

  • Memory bandwidth that we are using. It refers to the ratio of a number of bytes to a several transferred bytes from memory (sometimes it is shared and sometimes not shared).

  • Training metrics also matter. Which environment, IDE, or libraries for deep learning are we using.

6 Performance evaluation

We have done experiments on the algorithms and advanced deep learning algorithms and concluded that traditional algorithms are not working fine for real-time images. It does not have the meaning for each pixel. A model-like UNET, SegNet, Deeplabv3+ is the right choice to be selected now [51]. The reason is they can give you exact pixel information without losing the pixel values. Like in encoder-decoder structure, i.e., UNET, SegNet model, we use skip connection in decoder side to regain the lead that we have lost in performing MaxPooling operation on the encoder side. Losing information refers to the pixel values while we are doing feature extraction. Deeplab V3 has three more updated versions with improvements. UNet also has improved versions, namely Residuals U-Net and Fully Dense UNet. UNet refers to helping detect low-level features. It has initially been proposed for medical image segmentation. Residual UNet was introduced to improve the performance of UNet architecture. Further, a series of residual blocks are stacked together that benefits in terms of degradation problems with the help of skip connections, as same as in UNet, which helps to propagate the low-level features. Moreover, while modifying the UNet architecture using dense blocks, Dense UNet was introduced. It helps to improve the artifact while allowing each layer to learn the features at various spatial scales. We show in Table 4 the comparative data of JPANetcomposed of three different lightweight backbone networks and other models on the camvid test set. JPANetcan not only achieves 67.45% mIoU but also obtains 294FPS once we input 360 × 480 low-resolution images. The data in Table 4 another time proves the effectiveness of the JPANet model. Figure 5 shows the visual comparison effect of JPANet on the CamVid test set (Figs. 17 and 18).

Fig. 17
figure 17

Comparison for All state-of-the art network architectures based on mIoU (in %) using Camvid dataset

Fig. 18
figure 18

Hardware architecture

Table 4 Comprehensive performance comparison on the CamVid test set

Eventually, UNET is easily applied in every field, especially in Biomedical (for medical image datasets) and Industry 4.0 related problems, like detecting the defects for Hot-Rolled Steel Strips, Surface, or Road Defects [79]. Mask-RCNN also has advanced in recent years like Mask- Scoring RCNN. For solving real-time scene understanding, Mask RCNN would be a better choice. The performance also matters. In some dense networks like Yolo V5 or Fully Dense UNET, the network parameters are abundant. While selecting a network model, you must consider the lightweight architectures to be applied to real-time applications and fast in computation.

We can see from Table 5 that JPANet achieved the very best scores in 18 of the 19 classification categories. It’s because JPANet emphasizes the importance of shallow spatial information. The development of JPANet on small object samples is the most blatant. For instance, the JPANet accuracy on the traffic signal and traffic sign are 24.6% and 19.8% above ESPNet, respectively. Besides, JPANet also pays attention to extracting multi-scale semantic information. Thus JPANet also improves the segmentation results of huge targets to a particular extent. For instance, the accuracy of JPANet on the sidewalk and car is 1.7%, and 1.2% above the state-of-the-art ERFNet, respectively (Table 5).

Table 5 Experiment on the Cityscapes dataset for each class one by one

7 Conclusion and future challenges

In this paper, a comprehensive overview of semantic segmentation algorithms and their grouping has been discussed. Some classical and some machine learning algorithms are compared and examined. Furthermore, by adding some parameters, we have calculated the efficiency of the model. Semantic segmentation is becoming a hot topic in every field, whether we study in industrial projects or the medical fields. Most of the time helps us in examining the critical details about the particular application after the valuable data pre-processing. The Future aspects of this research area can be studied as below:

  • An annotation problem is also a big challenge for the dataset, which has very few samples for training, as in segmenting the surface defect datasets, some medical diagnostic or in soft robots where the field is quite new and publicly available datasets are not as much.

  • Few-shot segmentation can be used to solve the less annotated dataset problems. The same problem can also be tackled by using the data augmentation technique.

  • Small objects detection and segmentation is also an essential aspect for semantic segmentation that most researchers want to solve these days. Small is not tiny, but it is not as straightforward as nearer objects. Typically aerial images and long distant scenes are the main subjects to examine the classification or segmentation of the small object.

  • Weather conditions could be a significant point of discussion in urban scene datasets for semantic segmentation.

  • Moreover, light effect and control are crucial aspects that need to be addressed.

  • Computational loaded segmentation approaches are also becoming quite negligible because of the robust and less-computational network architectures introduced daily like Trans4Trans or SegFormer. They have already taken the place of traditional encoder-decoder-based network architectures.

So, to cover the whole understanding in every specific field and to understand the fundamental challenges, we must have a clear understanding of how we extract features, whether it is for detecting big objects or for detecting the smaller object in images due to the variation in distance, or lightening conditions.