1 Introduction

Detecting pedestrians promptly and explicitly in a natural environment is a vital goal in artificial intelligence systems. Pedestrian detection is also an interesting subject in computer vision. Besides, it is a fundamental building block in different applications, such as intelligent transportation systems (ITS), traffic control monitoring, visual search, models of human behaviour, pedestrian tracking, pose estimation, pedestrian detection on social networks, face detection, semantic segmentation and, recently, monitoring the social distance of pedestrians in the Covid-19 Pandemic [1,2,3,4].

Pedestrian detection is often achieved through three main methods: 1) handcrafted, feature-based methods [5], 2) deep learning methods, particularly Convolution Neural Networks (CNNs) [6, 7], and 3) hybrid methods [8]. While in hybrid methods feature extraction is done by deep learning, classification and localization are implemented based on algorithms, such as Support Vector Machine (SVM) or AdaBoost. However, an alternative strategy is to deploy handcrafted methods to generate proposals and deep learning methods to classify and localize pedestrians. In the handcrafted, feature-based method, there are two major classes, including channel, feature-based methods [9] and deformable, part-based methods [10]. The main challenges in pedestrian detection can be divided into four categories, including occlusion, domain adaptation, scale variance, and real-time detection. Detecting small-scale and occluded pedestrians in a live-stream manner is the most essential issue in this field.

To distinguish pedestrians, there are three significant steps: 1) proposal generation, 2) proposal classification, and 3) post-processing. Pedestrian detection methods are defined based on these three steps. Firstly, proposal generation aims to recognize a set of bounding boxes where there are possibilities to detect pedestrians. Two leading approaches to achieve this aim are Region Proposal Network (RPN) and Sliding Window (SW) algorithms. Secondly, proposal classification aims to divide generated regions into two groups, including positive (pedestrian) and negative (background) classes based on feature extraction. It is noteworthy that in deep learning approaches, the first and the second steps unite and create a unified architecture. Additionally, in such approaches, localization and classification are gained simultaneously. In the third step, which is the post-processing step, the extra bounding boxes are excluded. Then, one or more pedestrians are likely surrounded by bounding boxes; therefore, the extra bounding boxes are neglected. The most popular approach in the post-processing step is non-maximum suppression (NMS) [11]. It is necessary that the researchers be able to produce distinctive feature maps to have an undemanding job in the classification step.

However, since supplementary information has been added to the pedestrian detection process, the segmentation method has been applied lately. Before researchers made use of CNN in segmentation, Random Forest (RF) and Conditional Random Field (CRF) were employed in the learning process. Fundamentally, image segmentation in pedestrian detection is classified to two categories, namely semantic and instant segmentation [12]. These segmentation methods are commonly known as multi-task learning, owing to the use of a separate network to segment semantically along with pedestrian detection. For example, in [3], an instance segmentation has been implemented by adopting the Faster-R-CNN feature map. Given that semantic information of the background image is employed to detect pedestrians, these methods are more precise. Meanwhile, the use of semantic information must not lead to false positives (FP). Semantic segmentation methods include a considerable amount of computation inasmuch as they contain complex detection and segmentation networks.

Generally, the modern segmentation methods can be considered either proposal-based methods [13] or mask-based methods [14]. Proposal-based methods comprise a two-phase detection, and each region produces a proposal which is later segmented as a mask. In this method, pedestrian localization and classification is more accurate. Moreover, it is noteworthy that in this method, each proposal may contain several pedestrians. Segmentation must therefore be conducted precisely, which is not easy to achieve. That is why the relation between an occluded or not occluded pedestrian might not be distinguished. However, in mask-based methods, this problem does not exist, and they are commonly employed to detect small-scale pedestrians or detect pedestrians in a crowded background [12].

In semantic segmentation, classification is conducted by means of a super pixels approach. Different semantic segmentation methods have been proposed, however, most of them have issues in the same topics as downsampling and spatial invariance [15]. To solve the former issue, the Atrous convolution algorithm has been proposed, while CRF has been used to extract semantic and more precise information, which leads on to resolving the latter issue [16]. The major problem in proposing a CNN-based semantic segmentation is the necessity for providing pixel-wise ground truth images to be deployed in the learning process, as the supervised learning algorithms require pixel-level labelled images. Among such methods, there are some algorithms that provide a weakly-supervised semantic segmentation [17]. These methods do not entirely depend on labelled information. Labelling can be carried out on image level, bounding box level, scribble level or point level. Another problem is that CNN-based semantic segmentation methods are real-time applications. Due to their complex architecture, they consequently involve a great deal of computation restricting some practical applications such as, ADAS and robot sensing.

In this paper, a novel semantic segmentation approach based on convolutional neural networks is presented. The proposed structure overcomes the limitations of open datasets as well as the deficiencies of the conventional semantic segmentation networks with the class of U-Nets [18]. The proposed network provides more flexibility for deploying the down-sampled batches of images and convolutional kernels, by introducing an innovative combination of conditional batch normalization and sampling blocks followed by a new supervision strategy based on a list of new skip connections. Generally, the major contributions of this paper can be summarized as follows:

  • The proposed network can determine low scale pedestrians as target objects.

  • Its special supervision prevents the loss of information and therefore mis-training of the network.

  • The semantically segmented pedestrians can be given to a part detection network for possible occlusion detection.

  • The implementation speed of the proposed architecture is high enough for possible real-time processing of images data, especially for surveillance and live supervision purposes.

This paper is organized as follows: in Sect. 2, a literature review of the previously proposed strategies for semantic segmentation of target objects, including pedestrians, is presented. Next, Sect. 3 contains a detailed description of the proposed method for semantic segmentation of pedestrians from popular datasets. The implementation and comparative results are presented in Sect. 4. Finally, this paper is concluded under Sect. 5.

2 Literature review

Detection of meaningful objectives with the aid of neural networks has been developed notable advances for different objectives like pedestrians. In this section, an overview of the methods and algorithms chronologically deployed for pedestrian detection is represented.

2.1 Generic pedestrian detection

Deep neural networks, although powerful in detection, can be computationally expensive. Deep learning-based pedestrian detection algorithms have originated from the region proposal CNN (R-CNN) detectors. Pedestrian detection can be categorized into two-stage detectors and single-stage detectors. In the first phase, two-stage detectors estimate proposals. Then, each of the proposals is sent to classification and bounding box regression as the next phase of detectors.

Two-stage-based pedestrian detection algorithms have originated from the R-CNN detectors. In Fast-RCNN [6], to increase the network speed, the whole image enters the convolutional neural network at once. On the other hand, a pooling layer is also used. However, the network speed is still low due to the deployment of the selective search algorithm. Faster-RCNN [7] was designed to reduce the number of hyperparameters. Tesema et al. exploited region proposal networks (RPNs) as fundamental detectors of pedestrians [8]. Moreover, they deployed a naive classifier to refine the pedestrian detection outcome. Zhang et al. presented an anchor region proposal network to detect human different parts of the body as well as heads and endeavor to integrate them to attain higher accuracy. They also utilized the post-processing NMS to improve the detection results [19].

A significant problem with two-stage generators is their slow pace, which inspired researchers to speculate on single-stage detectors. These detectors worked based on the object bounding boxes and object classes, but they do not need intermediate object proposals. YOLO [20] and SSD [21], as examples of single-stage detectors, have high operation speed, though their accuracy is their weak point. Lately, RetinaNet [22] employed a novel object detection loss function named Focal Loss to deal with the data imbalance between the background (no object) and the other classes. Despite being a single-stage detector, RetinaNet is more accurate than Faster R-CNN and is analogous in terms of speed to other single-stage detectors. Wei-Yen Hsu et al. proposed the ratio-and-scale-aware YOLO method, which is based on YOLOv3; however, it provides a lot of improvement. They proposed a revolutionary feature map that transformed each positive instance into a feature vector to encrypt both density and diversity information simultaneously [20]. Besides, the occlusion-sensitive hard example mining method and occlusion-sensitive loss were designed by Jin Xie et al. [23] Their methods explore hard instances depending on the occlusion level and allocate higher weights to the detection errors taking place at considerably occluded pedestrians. Additionally, Yi Tang et al. designed the first architecture that enhanced pedestrian detection performance with a state-of-the-art framework that not only increased pedestrian information automatically but also investigated the loss function policy [24]. Recently, Glenn Jocher et al. proposed the most efficient version of YOLO algorithm with the name of YOLOv5 [25]. This version of YOLO can detect target objects faster and with higher accuracy compared to the previously proposed versions. The current algorithm deploys genetic algorithm for finding the best anchors and uses mosaic augmentation for improving the accuracy of training procedure.

2.2 Pedestrian detection based on semantic segmentation

Some of the semantic segmentation methods are U-Net [18], EncNet [26], Gated shaped CNN [27], Deeplab [16], etc. In semantic segmentation methods, due to multi-scale detection and receptive field increase, some methods have been proposed, such as Deeplab V2 and V3 [28]. In another method such as PSPNet [29] general content information is utilized to improve the segmentation process. Also, some methods such as ParseNet [30] have employed large-scale kernels for convulsion and designed a network including boundary refinement.

In research by Alavianmehr et al. [31], a new combinational region and semantic segmentation CNN approach for pedestrian accurate detection and localization from static images is designed. The primary process of CNN-based methods includes two steps, proposal extraction, and CNN classification. The proposed framework is a mixture of modern CNNs such as, YOLO and semantic segmentation networks like Fully Convolutional Networks (FCNs) [15], particularly those with a structure similar to those of U-Nets. Huazhen Chu et al. introduced an effective segmentation method named Part Mask R-CNN. According to this method, they applied Part Mask R-CNN to every body part of the pedestrian to model different body parts and produce parts annotations utilizing database annotations and their processing [32]. Qiming Li et al. designed a new efficient anchor-free network based on Conditional Random Fields (CRFs) for multi-scale pedestrian detection [33]. To set about the incomplete occlusion and scale problems in pedestrian detection, Peiyu Yang et al. developed an effective Fully Convolutional Network (FCN) [34].

Successful training of the proposed FCNs requires thousands of annotated training images, and just augmentation cannot provide reliable training for FCNs, especially in the case of special imaging modalities. According to these issues, a U-shape architecture named U-Net was proposed. U-Net supplies an asymmetric structure for semantic segmentation. This structure has its deficiencies. For instance, it is not able to detect multi-scale target objects very accurately, especially in the case of low image contrast. The other deficiency of U-Net is that the greater contracting depth it has, the higher complexities it will return. To overcome these deficiencies and add flexibility and scalability to the U-Net structure, some inspirational networks such as U-Net +  +  [35], and U-Net3 + [36], have been proposed. These two networks keep the original framework of U-Net and add some novelties to the U-Net structure. The proposed novelties compensate for the existing shortcoming in conventional U-Net architecture. For example, adding full- and multi-scale supervisions, which is essential for multi-scale semantic segmentation, embedding skip connections that provide integration for connected components of the detected objects, and so on.

In addition, all the advantages and privileges provided by such state-of-the-art networks like U-Net3 + should be compatible with the application of pedestrian detection. The proposed method in this paper represents a new structure that provides multi-scale pedestrian detection based on the combination of multi- and full-scale supervision. Moreover, the proposed structure can detect low-scale pedestrians that could provide a horizon in front of occlusion detection.

3 Proposed method

In this section, the proposed framework for detecting pedestrians from popular datasets is explained. The proposed network has a novel structure so that it can be adapted to both online and real-time applications. The proposed method is folded into two parts to render a better explanation. Next, we examine the proposed pre-processing stage for feature extraction for fine-grained classification. Distinct parts of the proposed semantic segmentation architecture are explained in detail subsequently.

3.1 Pre-processing: image augmentation

As we proposed a new structure for semantic segmentation of pedestrians, we should provide sufficient training and validation data for achieving an acceptable trained network with a high volume of trainable parameters, even in the case of pre-trained backbones. For this case, one of the strategies for proving enough training images is augmentation. Image augmentation overcomes the under-fitting issue due to the low volume of input data. By deploying this procedure, all the trainable parameters would converge to their ultimate training stage. For this purpose, we deployed a list of common augmentation filters such as rotation, mirroring, and reflection. Figure 1 illustrates the deployed augmentation for a sample traffic image.

Fig. 1
figure 1

Applying sample augmentation to a sample image: a Original image, b Mirroring = Vertical (image x-axis), c Rotation = 180◦, and d Rotation = 45

3.2 The proposed network

In this section, we introduced the proposed CNN network with its novel and innovative structure, which contains the feature extraction and selection procedures alongside semantic segmentation. Semantic segmentation is the process of classifying each pixel associated to a particular label. It does not differentiate across separate instances of the same object. On the other hand, Instance segmentation differs from semantic segmentation since it labels every instance of a particular object in the image dissimilarly.

Figure 2 depicts the general block diagram of pedestrian detection system based on the semantic segmentation approach. The application of machine learning (ML) and deep learning technique in the segmentation of images has grown throughout the years.

Fig. 2
figure 2

General block diagram of a sample automatic object detection system

In the context of this paper, the semantic segmentation approach is deployed for the detection of pedestrians. The pedestrian detection aims at the automatic driving aided system (ADAS), and it has a prominent role in the traffic surveillance strategies. The common structure for semantic segmentation purposes is fully convolutional networks (FCNs). Recently, there are also new proposed networks with U-shapes deployed for semantic segmentation purposes and therefore they are named as U-Nets. These networks are mainly deployed for semantic segmentation of biomedical objects because the focus of their application is more on the extraction of the detailed feature maps of the target objects. Such U-Net architectures are designed for extracting multi-scale target objects. Moreover, the architecture of networks like U-Net +  + and U-Net3 + has some deficiencies specially for detecting real world objects like pedestrians, so we have proposed a new architecture that is able to detect pedestrians from the relevant dataset images with more accuracy and less complexity.

Our proposed network with a new architecture can extract the fine features associated with pedestrian as target objects in a semantic segmentation manner. The node structures of different U-Nets and the one related to our proposed network is illustrated in Fig. 3.

Fig. 3
figure 3

a Comparing illustrations of the sample structures for the cutting-edge U-Nets [35], and b illustration of the node structure for the proposed network named as Butterfly Network (BF-Net) because of its resemblance to a butterfly

As shown in Fig. 3, the proposed BF-Net does not have the skip connection complexities like UNet +  + and UNet3 + , because the obvious that the most important privilege of the proposed network over the competitive ones is that it has more flexibility to segment multi-scale target objects due to its specific supervision. The full details of the proposed supervision strategy and its advantages are described in the next subsections.

3.3 Specific node structure of the proposed BF-Net

As the node structure of the proposed network is like the appearance of a butterfly, therefore we called it as Butterfly-Network, and its abbreviation name is BF-Net. The proposed BF-Net comprises four types of nodes. Two of these nodes, named as \({\chi }_{Co}^{i}\) and \({\chi }_{Ex}^{i}\) are like the Contracting (Encoding) and Expanding (Decoding) nodes in Conventional U-Nets. The new nodes are \({\chi }_{Se}^{i}\) and \({\chi }_{Sc}^{i}\),which are sub-expanding and sub-contracting nodes.

The new nodes are \({\chi }_{Se}^{i}\) and \({\chi }_{Sc}^{i}\),which are sub-expanding and sub-contracting nodes.

\({\chi }_{Se}^{i}\) s that i demonstrates the ith downsampling or contracting layer along the coding direction, N stands for the number of the main contracting/Eanding nodes, and l is equal to the number of sub-contracting/sub-expanding nodes, defined with the following formula:

$$\chi _{{Se}}^{i} = \left\{ {K\left( {\left[ {\begin{array}{*{20}l} {\chi _{{Se}}^{i} = \chi _{{Co}}^{N} } & {} \\ {\underbrace {{C\left( {D\left( {\chi _{{Co}}^{j} } \right)} \right)_{{j = 1}}^{{N - l + i - 1}} ,C\left( {\chi _{{Co}}^{{N - l + i}} } \right)}}_{{scales:1^{{st}} \sim (N - l + i)^{{th}} }},\underbrace {{C\left( {U\left( {\chi _{{Se}}^{j} } \right)} \right)_{{j = i + 1}}^{l} }}_{{scales:\left( {N - l + i + 1} \right)^{{st}} \sim \left( l \right)^{{th}} }}} & {} \\ \end{array} } \right]} \right)} \right.,\begin{array}{*{20}c} {i = 1} \\ {i = 1, \ldots ,l - 1} \\ \end{array}$$
(1)

where the function C(.) stands for the convolution operation, the function K(.) represents a convolution layer followed by a conditional batch normalization (CBN) and a ReLU activation. Moreover, the function D(.) stands for a max-pooling layer with a pooling size 2(Nl+j−1) and U(.) represents a bilinear up-sampling layer with a rate of 2(lj−1). Moreover, operand [.] represents the channel dimension splicing and fusion. In addition to the definition of the sub-expanding nodes as \({\chi }_{Se}^{i}\), the definition for \({\chi }_{Sc}^{i}\) nodes are as follows:

$$\chi _{{Sc}}^{i} = \left\{ {\begin{array}{*{20}c} {\chi _{{Se}}^{1} ,} & {\,i = 1} \\ {H(\chi _{{Sc}}^{{i - 1}} )\;,\;} & {i = 2, \ldots ,l} \\ \end{array} } \right.$$
(2)

where H(.) applies the similar contracting procedure like the ones on the main contracting path. Finally, the feature map aggregation for \({\chi }_{Ex}^{i}\) nodes is done based on the following equation:

$$\chi _{{Ex}}^{i} = \left\{ {\begin{array}{*{20}c} {\chi _{{Sc}}^{l} ,} & {i = N} \\ {K\left( {\left[ {\underbrace {{C\left( {D\left( {\chi _{{Co}}^{j} } \right)} \right)_{{j = 1}}^{{N - l}} ,C\left( {D\left( {\chi _{{Sc}}^{j} } \right)} \right)_{{j = 1}}^{{i + l - N - 1}} ,~C\left( {\chi _{{Sc}}^{{i + l - N}} } \right)}}_{{scales:1^{{st}} \sim \left( {i + l - N} \right)^{{th}} }},\underbrace {{C\left( {U\left( {\chi _{{Ex}}^{j} } \right)} \right)_{{j = i}}^{{i - l}} }}_{{scales:\left( {i + l - N + 1} \right)^{{th}} \sim N^{{th}} }}} \right]} \right),} & {i = 1, \ldots N - 1} \\ \end{array} } \right.$$
(3)

Due to the definitions represented for the different nodes existing in the proposed BF-Net, it is possible to illustrate the schematic structure of the proposed BF-Net as shown in Fig. 4 for an initial convolutional kernel size of 16, as mentioned in the node structure.

Fig. 4
figure 4

Structure of the proposed BF-Net with N = 4 direct down/up sampling layer and 2 sub-up/sub-down sampling ones for the semantic segmentation application. The striped ellipsoids () and arrows () resemble the node structure of the proposed BF-Net shown in Fig. 3b. All the skip connections

3.4 Proposed Multi-scale supervision

Deep supervision was introduced in U-Net +  + , and full-scale deep supervision was proposed in U-Net3 + that adds some improvements to the final supervision results. However, the full-scale deep supervision faces a serious issue.

As the supervision for the last downsampling layer in the structure of U-Net3 + should be done with the down-sampled ground truth image, so the small-size or low-scale target objects may completely disappear due to downsampling. In that case, the supervision may lead to undesirable training results. This issue would be worse in the situation that there are multi-scale target objects in a very close contact with each other; so, the full-scale supervision may not take place efficiently. In this paper, a new approach toward conducting modified full-scale deep supervision is proposed.

For this purpose, we apply a bilinear upsampling followed by a 2D-convolution with an appropriate kernel to the \({\chi }_{Sc}^{max}\), so that it can be concatenated with matched \({\chi }_{Ex}^{i}\) in the main expanding path. Subsequently, the concatenation of the blocks after resizing is applied to the output of all \({\chi }_{Ex}^{i}\) s plus \({\chi }_{Sc}^{N}\) s and \({\chi }_{Se}^{l}\) s. Simultaneously, the outputs from each main expanding path are passed through appropriate upsampling and convolutional kernels, so that they can be concatenated with each other as well as the up-sampled sub-expanding \({\chi }_{Sc}^{max}\). In Eq. (4), the detailed formulas for the calculation of proposed supervisions (Sup.) for two different translations of BF-Net are shown:

$$Sup. = \left\{ {\begin{array}{*{20}l} {K\left( {\left[ {C_{3} \left( {U_{3} \left( {\left[ {C_{2} \left( {U_{2} \left( {\left[ {C_{1} (U_{1} (\chi _{{{\text{Ex}}}}^{4} )),\chi _{{{\text{Ex}}}}^{3} } \right]} \right)} \right),\chi _{{{\text{Ex}}}}^{2} ,\chi _{{{\text{Sc}}}}^{1} } \right]} \right)} \right),\chi _{{{\text{Ex}}}}^{1} } \right]} \right),} & {{\text{for}}\;{\text{BF}} - {\text{Net}}_{{(4 - 2)}} } \\ {K\left( {\left[ {C_{4} \left( {U_{4} \left( {\left[ {C_{3} \left( {U_{3} \left( {\left[ {C_{2} \left( {U_{2} \left( {\left[ {C_{1} (U_{1} (\chi _{{Ex}}^{5} )),\chi _{{Ex}}^{4} ,\chi _{{{\text{Sc}}}}^{1} } \right]} \right)} \right),\chi _{{Ex}}^{3} } \right]} \right)} \right),\chi _{{{\text{Ex}}}}^{2} } \right]} \right)} \right),\chi _{{{\text{Ex}}}}^{1} } \right]} \right),} & {{\text{for}}\;{\text{BF}} - {\text{Net}}_{{(5 - 1)}} } \\ \end{array} } \right.$$

The important note about the sample supervisions mentioned in Eq. (4), is that the computational complexity of the proposed supervision is perpendicular with the number of main paths in BF-Nets. On the other words, the higher the number of main paths is, the more complicated the computation of the proposed supervision would be. According to Eq. (4), whenever the number of contracting layers increases from 3 to 4 phases, and so does the number of expanding layers, the computational complexity will increase as well. The number of main expanding or main contracting might change depending on data set and the use of real-time application. Consequently, the proposed BF-Net is simple, flexible, expandable, and powerful.

3.5 Hybrid loss function

After concatenating all the mentioned outputs, a code block containing a dropout followed by 1 × 1 convolution with the number of kernels equal to the number of classes observed in the given dataset, followed by an adaptive max-pooling, and a proper activation function such as sigmoid is applied to the concatenated output. In this case, a hybrid loss calculation method is deployed. The mathematical format of the hybrid loss is defined as:

$$L\left(G, P\right)=-\frac{1}{N}\sum_{c=1}^{C}\sum_{n=1}^{N}\left({g}_{n,c}log{p}_{n,c}+\frac{2{g}_{n,c}{p}_{n,c}}{{{g}_{n,c}^{2}+p}_{n,c}^{2}}\right)$$
(5)

where \({g}_{n,c}\in G\) and \({p}_{n,c}\in P\) stand for the ground truth labels and predicted values for class c and nth pixel belonging to each batch of images, respectively. Accordingly, N shows the number of pixels inside each batch. This equation is just defined for the images of a batch, so the total loss function for the proposed BF-Net would be defined as:

$${L}_{total}={\sum }_{j=1}^{e}{\omega }_{j}\times L(G,{P}^{j})$$
(6)

where e stands for the number of main expanding path blocks plus the first sub-contracting block (for instance, according to Fig. 5, e is equal to 5). \({\omega }_{j}\) s index weights that give each loss function calculated in Eq. (5).

Fig. 5
figure 5

General overview of a conditional batch normalization (CBN) block for embedding in BF-Net

For the accomplishment of the full supervision, it is necessary that the concatenated output of the resized outputs of the mentioned nodes be passed through a block, including a dropout, 1 × 1 Convolution, an adaptive maxpooling, and finally, a sigmoid function for calculating the loss function.

3.6 Deployment of conditional batch normalization (CBN)

Another innovative contribution of the proposed BF-Net can be mentioned as the deployment of conditional batch normalization instead of conventional BNs. The basic idea for embedding batch normalization layers after convolutional layers is to provide faster convergence and keep heterogeneity of the processed data in each convolutional layer to prevent the optimization failures.

The conventional BN with equation \(BN\left({G}_{i,c,w,h}|{\gamma }_{c},{\beta }_{c}\right)\) has two \({\gamma }_{c}\) and \({\beta }_{c}\) that should be predicted from an embedding during the training procedure [37]. In other words, a BN reduces the internal covariant shift by normalizing feature maps belonging to each input mini-batch. However, the initialization of a network with less sensitivity to the initialization of the two parameters, i.e., \({\gamma }_{c}\) and \({\beta }_{c}\) is very difficult. Accordingly, researchers suggested a conditional batch normalization (CBN) to estimate two change measuring parameters \({\delta \gamma }_{c}\) and \(\delta {\beta }_{c}\) on the fixed primary numerical values, so that the target neural network will be initialized to produce outputs with a mean equal to zero and a very small variance [38]. In that case the definition for a conditional batch normalization would be as follows:

$$CBN\left({G}_{i,p,x,y}|{\widehat{\gamma }}_{c},{\widehat{\beta }}_{c}\right)={\widehat{\gamma }}_{c}\frac{{G}_{i,p,x,y}-{\mathbb{C}}_{B}[{G}_{.,c,.,.}]}{\sqrt{ {\sigma }_{B}^{2}\left[{G}_{.,c,.,.}\right] + \epsilon }}+{\widehat{\beta }}_{c}$$
(7)

where \({\left\{{G}_{i,.,.,.}\right\}}_{i=0}^{N}\) stands for N samples in the form of a mini batch, and \({G}_{i,p,x,y}\) is related to the \({p}^{th}\) vector of feature maps of \({i}^{th}\) sample at location (\(w,h\)). \(\epsilon\) stands for a fixed value as stabilizing and regulating coefficient. \({\widehat{\gamma }}_{c}\) and \({\widehat{\beta }}_{c}\) are also defined as follows:

$${\widehat{\gamma }}_{c}= {\gamma }_{c}+\Delta {\gamma }_{c}$$
(8)
$${\widehat{\beta }}_{c}={\beta }_{c}+\Delta {\beta }_{c}$$
(9)

where \(\Delta (.)\) s stand for latent multi perceptron layers. Accordingly, the CBN is able to tune the independent feature maps based on different inputs, therefore assists in boosting the generalization ability of the network on inharmonious data [39].

The overall view of CBN block inside the structure of a sample BF-Net is illustrated in Fig. 5.

CBNs are more likely to be embedded inside a residual building block of each contracting/expanding path of a BF-Net. After describing the full details of the proposed network, it is time for evaluating the performance of the proposed method in comparison with the other cutting-edge U-Nets.

3.7 BF-net training

To implement the proposed BF-Net, we should train it based on all the points mentioned in the previous subsections. For instance, we should follow the node structures for down and up sampling paths as introduced on Sect. 3.3. Moreover, we should consider the proposed multi-scale supervision alongside the hybrid loss function for the training procedure. The specific structure of the CBNs also plays an important role for the appropriate training and validation of the network. Selected optimization algorithm for training BF-Nets is Adam, since it is a combination of the best properties of the AdaGrad and RMSProp algorithms. We also adjusted the values for the number of epochs, the initial learning rate, and the patience term for the early stopping of the training procedure as 100, 10–5, and 20, respectively.

4 Implementation and comparative results

In this section, the performance evaluation of the proposed network is presented. In addition, the efficiency of the proposed network for the purpose of pedestrian detection is compared with the other state-of-the-art U-Networks. Accordingly, this section comprises of two subsections. Under the first sub-section, the validation datasets for the purpose of semantic segmentation of pedestrian images are introduced. The next sub-section is assigned to the qualitative and quantitative comparative results.

4.1 Pedestrian datasets

In this paper, two dataset containing images of the pedestrians and their pixel-wise annotations as ground truths are used for assessing the performance of pedestrian detection by the proposed network and comparing it with the performance of the other semantic segmentation networks. One of the deployed datasets is cityscapes [40, 41].

This dataset comprises of a large and diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5000 frames in addition to a larger set of weakly annotated frames. The other dataset including pixel-wised annotation of pedestrians is PennFudanPad [42]. The brief introduction to all the deployed datasets for reporting the implementation results of this paper are shown in Table 1.

Table 1 Datasets for evaluating and comparing the performance of the pedestrian detection implemented by the proposed method and the other state-of-the-art networks

4.2 Qualitative and quantitative comparison results

Under this section, the qualitative and quantitative comparisons between the pedestrian detection outcomes of the proposed BF-Net and the state-of-the-art networks such as ResNet-50 & -101-based U-Nets are presented.

To achieve a good presentation of the proposed network in comparison with the other state-of-the-art networks such as U-Net3 + , we should apply both BF-Net and U-Net3 + to a similar image belonging to the introduced datasets.

Figure 6 exhibits both qualitative and quantitative results for the semantic segmentation of pedestrians conducted by ResUnet-50 based U-Net3 + (left) and the proposed ResUnet-50 based BF-Net (right) in a sample cityscapes image.

It is worth mentioning that the true positive (TP), false negative (FN), false positive (FP) and true negative (TN) parameters are calculated based on a threshold 0.5 applied to the output results of the SoftMax layer.

As it can be inferred from the results shown in Fig. 6, the proposed BF-Net can reduce the number of false positives (FPs) to increase the precision of the pedestrian detection. This advantage alleviates the precision and subsequently the dice score of pedestrian detection.

Fig. 6
figure 6

Qualitative comparison of ResNet-50-based UNet3 + (left), and proposed BF-Net (right) on a sample image belonging to the cityscapes dataset. Cyan areas: true positive (TP); Yellow areas: false negative (FN); Purple areas: false positive (FP)

Some other qualitative results for semantic segmentation of pedestrians based on the proposed BF-Net and the other competitive U-Nets are depicted in Fig. 7. The results shown in this figure demonstrate the great ability of the proposed network to detect very low scale pedestrians comparing to the other state-of-the-art U-Nets. This ability is due to the proposed supervision strategy that describes in the previous section. It is noteworthy that substituting the conventional BN blocks with the CBN ones also help improving the segmentation results specially for the distracting objectives like riders and manikins.

Fig. 7
figure 7

Sample Qualitative comparison between the results of detecting very small-scale pedestrians based on different semantic segmentation networks: a original image from cityscapes dataset with original size, and b its associated binary ground truth, and pedestrians’ detection results for: c U-Net, d U-Net +  + , e U-Net3 + , and f BF-Net. The images including segmentation results are cropped and resized for showing the segmentation results more clearly. Cyan areas: true positive (TP); Yellow areas: false negative (FN); Purple areas: false positive (FP)

The other quantitative comparison results are illustrated in Fig. 8. In the shown diagram of these two figures, the implementation results are reported from two aspects. One aspect is the illustration of the precision-recall curves, that are reported for the implementation of the proposed BF-Net compared with the state-of-the-art U-Nets based on the backbone of ResNet-50 and ResNet-101.

Fig. 8
figure 8

Comparison of BF-Net and the three other U-Nets with implementing two different ResNet-based networks: a Precision-Recall Curves with highlighted AP values for images of cityscapes dataset, and b Precision-Recall Curves with highlighted AP values for images PennFudanPed dataset

As the curves of the diagrams in Fig. 8 show, the BF-Net achieves a higher performance than conventional U-Net, U-Net +  + , and U-Net3 + for both introduced backbones.

After comparing the performance of the proposed BF-Net with the state-of-the-art UNet architectures, it is important to compare the performance of the proposed network with two other state-of-the-art networks, i.e., Mask R-CNN and Depplabv3 + with the same backbone ResNet50.

It is worth mentioning that we selected ResNets because in most of the other methods and references, these backbones lead to the best results with an acceptable level of complexity. Therefore, we built the foundation of all the implementation and simulations associated with our proposed method based on ResNet50 and ResNet101 backbones. Accordingly, a fair comparison between the efficiency of our proposed method and the other state-of-the-art ones can take place [43].

The precision-recall curves of implementing pedestrian detection by the means of BF-Net, Mask-RCNN and Deeplabv3 + and applied to the images of both cityscapes and PennFudanPed datasets are illustrated in Fig. 9.

Fig. 9
figure 9

Comparison between BF-Net, Mask-RCNN and Deeplabv3 + based on ResNet-50: a Precision-Recall Curves with highlighted AP values for the images of cityscapes dataset, and b Precision-Recall Curves with highlighted AP values for the images of PennFudanPed dataset

The other aspect of quantitative comparison is inference time plots shown in Fig. 10. In this plot, the inference time, the complexity, and F1-scores (Dice score) are illustrated simultaneously for all the comparative networks. It can be observed from this plot that both BF-Nets have higher performance with a similar rate of complexity (the size of the depicted circles), and an implementation speed a bit slower than the other networks with similar backbones. Table 2 compares U-Net, U-Net +  + , U-Net3 + and the proposed BF-Net with two different ResNet-50 and ResNet-101 backbones in terms of segmentation results measured by Dice coefficient and IoU (mean ± std) for the images in cityscapes dataset. For each evaluation, we investigate all possible confidence thresholds to report the best Jaccard Index (J.I.) score. Larger Jaccard Index represents better performance. J.I. mainly assess the degree of overlap between the predicted set P and the ground truth label set G, as showed in Eq. (10).

Fig. 10
figure 10

inference time, complexity (based on the size of parameters), and F1-Score of the proposed BF-Net and the comparative U-Nets for: a cityscapes dataset, and b PennFudanPed dataset. The inference time is calculated by the time taken to process test images belonging to cityscapes dataset on a single NVIDIA GeForce RTX3080 GPU with 8 GB of dedicated memory

Table 2 Comparison of pedestrian detection performance for test images in Cityscapes dataset between the competitive 3state-of-the-art U-Nets and:
$$J.I.=\frac{\mathrm{P}\cap \mathrm{G}}{\mathrm{P}\cup \mathrm{G}}$$
(10)

And the F1-measure, is the same as the Dice coefficient:

$$\mathrm{Dice}\_\mathrm{Coefficient}=\frac{2\times \mathrm{PG}}{\mathrm{P}+\mathrm{G}}$$
(11)

Table 3 contains the comparative results between the performance of the proposed BF-Net and the other competitive U-Nets in the terms of Dice coefficient and IoU (mean ± std) for the images in PennFudanPed dataset.

Table 3 Comparison of pedestrian detection performance for the test images in PennFudanPed dataset between the competitive state-of-the-art U-Nets and:

Finally, Table 4 includes the comparative results of pedestrian semantic segmentation implemented by ResNet50 backbone BF-Net, Mask-RCNN, and Deeplabv3 + networks for test images belonging to both cityscapes and PennFudanPed datasets.

Table 4 Comparison of pedestrian detection performance for the test images in PennFudanPed dataset between the competitive state-of-the-art U-Nets and:

5 Discussion and conclusion

The semantic segmentation of pedestrians is a crucial preliminary step in various domains related to intelligent traffic systems, especially safe and secure automatic driving systems, and traffic surveillance ones. The proposed BF-Net provides a reliable and efficient architecture for pedestrian semantic segmentation with an acceptable level of computational complexity so that it could be deployed in real-time processing. In this section, the strengths and limitations of the proposed BF-Net are discussed under a separate subsection. Moreover, the subsection contains the impacts of this architecture on the community of the semantic segmentation of the pedestrians. The main contribution of the proposed architecture compared with the other state-of-the-art networks are concluded under conclusion and future work subsection.

5.1 Discussion

The proposed BF-Net architecture provides semantic segmentation of pedestrians with higher performance and lower complexity than the other state-of-the-art architectures, especially the ones with architectures like traditional U-Net.

As the results shown in Tables 24, the efficiency of the proposed BF-Net based on Dice coefficient (F1-score) and Jaccard index (IoU) increases, while the computational complexity based on the number of the trainable parameters decreases. In other words, the efficiency and the computational complexity are inversely proportional, i.e., the higher the number of trainable parameters is, the less efficient the BF-Net performs. This relationship can be formulated as follows:

$${\text{No}}.{\text{of}}\;{\text{Trainable}}\;{\text{Param}}.{\text{s}} \propto \frac{1}{{{\text{Efficiency}}}}$$
(12)

Besides the mentioned advantages, the proposed BF-net has some limitations. First, based on the precision-recall curves, the performance of the proposed BF-Net does outperform the other U-Nets for a specific range of thresholds. The reason could be related to the specific architecture of the nodes and the ways that the weights and biases trained. Second, the BF-Net is not faster than conventional U-Net and U-Net +  + . In general, for a specified applications like real-time and online surveillance, the pedestrian segmentation scheme should be modified in such a way that the deployed strategy can create a trade-off between complexity and processing speed.

Most of the deep learning architectures for semantic segmentation mentioned in [43], suffer from deficiencies such as incompatibility of their convolutional blocks for pre-trained backbones. This problem occurs mostly because the semantic segmentation networks have no fully connected layers. The encoder-decoder models such as U-Nets have the same issue. The proposed BF-Net tries to solve this issue by deploying Multi-scale Supervision like U-Net3 + for increasing the accuracy of the segmentation for low scale objects. On the other side, by concatenating the supervisions and producing a unique one, like U-Net +  + , the BF-Net succeeds to overcome the complexity of U-Net3 + . The proposed BF-net is also able to be compatible with different pre-trained backbones that may have a lot of convolutional and residual blocks, by using flexible direct and sub-paths. In addition, the proposed BF-Net can create a compromise between the convolutional blocks from the direct and sub paths by deploying appropriate skip connections.

5.2 Conclusion

In this paper, we proposed a new deep neural network architecture named as BF-Net for the purpose of pedestrian semantic segmentation. The novelty of the proposed method is that it could be deployed for segmenting the pedestrians especially the ones that have several scales and different appearances in a series of sequential frames. The proposed network extracts all the available feature maps for the pedestrians as target objects, so that the occlusion issue can be observed and detected in the studied datasets. The implementation results for detecting pedestrians from the images of cityscapes and PennFudanPed databases demonstrate that the proposed method has a high ability to detect and even predict the existence of pedestrian in both stationary images and live video streams. As it is possible to embed the proposed skip connection in the structure of BF-Net into the feature pyramid network (FPN) existing in Mask R-CNN, then the next step as the future work would be replacing the plain skip connections of FPN with the redesigned skip connections of BF-Net.