1 Introduction

Pedestrian detection is a widely studied object detection problem that finds extensive applications in domains such as intelligent video surveillance [1], intelligent transportation [2], and autonomous driving systems [3, 4]. It also serves as a fundamental technology supporting tasks like pedestrian pose estimation [5, 6] and pedestrian re-identification [7, 8]. The accuracy of pedestrian detection algorithms directly impacts the performance of these tasks.

In real-world, the diversity in pedestrian poses, varying scales, and environmental factors, including occlusions, present significant challenges to detection algorithms, necessitating robust and precise solutions.

Traditional pedestrian detection methods, such as Viola–Jones [9], histogram of oriented gradients (HOG) [10], and scale-invariant feature transform (SIFT) [11], rely on manually designed features and template matching techniques. While these methods are straightforward to implement, their performance and generalization capacity are limited by the constraints of handcrafted feature engineering, often resulting in suboptimal outcomes when faced with complex scenarios.

In recent years, considerable progress has been made in object detection techniques. State-of-the-art algorithms, such as multi-anchor faster R-CNN [12], multiscale attention fusion [13], and multimodal detectors [14], leverage convolutional neural networks (CNNs) to enhance pedestrian detection performance by improving accuracy. Among these approaches, the YOLO (you only look once) series has consistently maintained a prominent position, striking a favorable trade-off between detection precision and processing speed.

However, it is noteworthy that YOLO models exhibit certain limitations that hinder their performance in pedestrian detection. Specifically, these models face challenges in effectively handling the diverse scales of pedestrians and addressing the imbalance between positive and negative samples. Primarily, the inherent flexibility and variability in human poses give rise to pedestrians appearing at various scales within images (Fig. 1a). Consequently, accurately detecting pedestrians across these varying scales becomes a complex task. Furthermore, scenarios characterized by dense pedestrian crowds introduce additional difficulties due to occlusions occurring between individuals (Fig. 1b). Such occlusions obstruct the complete visibility of pedestrians, further complicating their accurate detection by YOLO models. In addition to scale and occlusion challenges, the imbalance between positive and negative samples during the target regression process adversely impacts the precision of object localization by YOLO models. This imbalance skews the learning process towards the dominant class, resulting in suboptimal performance when localizing pedestrians. Hence, the task of effectively addressing these limitations and devising resilient solutions continues to be a pivotal and captivating area of research. Resolving these challenges holds paramount importance for advancing the field of pedestrian detection and enabling the development of more accurate and reliable systems.

To overcome these challenges, this paper presents a groundbreaking model called HF-YOLO, which addresses the limitations in pedestrian detection performance. Our model introduces innovative techniques to enhance the accuracy of detecting pedestrians with small scales and occlusions. It achieves this by leveraging feature fusion across multiple hierarchical levels, enabling the integration of high-resolution features to capture both high-level semantic information and low-level localization cues. Additionally, a dedicated small object detection layer is introduced to improve the accuracy of detecting pedestrians with varying scales and occlusions.

Fig. 1
figure 1

a and b showcase the scenarios of pedestrians in real-world environments. In Figure (a), the red and green boxes depict pedestrians of varying scales. In Figure (b), the blue and orange boxes illustrate instances of occlusion. (Color figure online)

The contributions of our work can be summarized as follows:

  • We propose HF-YOLO, a novel pedestrian detection model that incorporate the HardSwish activation function within the convolutional blocks of our model, enhancing its feature representation capability and introducing greater non-linearity.

  • We employ feature fusion across multiple hierarchical levels. This novel approach effectively tackles the challenges associated with small-scale pedestrian detection and occlusions.

  • To address the issue of imbalanced high- and low-quality samples in the regression loss function, we introduce a balancing factor. This factor redirects the model’s focus towards accurate regression for high-quality samples, overcoming the bias introduced by the larger number of low-quality samples.

  • Extensive evaluations are conducted, comparing our proposed model against six baselines. The results demonstrate the superior performance of our model, surpassing existing approaches in terms of detection accuracy and reliability.

The paper’s structure is as follows: In Sect. 2, we review prior research on pedestrian detection. Section 3 provides a detailed explanation of the HF-YOLO model. In Sect. 4, we present the dataset used and discuss the experimental results. Finally, in Sect. 5, we conclude the paper.

2 Related Work

Currently, object detection algorithms can be broadly classified into two major categories: Anchor-based algorithms and Anchor-free algorithms. Anchor-based algorithms encompass both two-stage approaches (e.g., RCNN [15], Fast R-CNN [16], Faster R-CNN [17]) and one-stage approaches (e.g., SSD [18] and YOLO series [19,20,21]). In Anchor-based algorithms, a set of predefined anchor boxes are generated on the image at different scales and aspect ratios. These anchor boxes serve as reference regions for detecting objects. Convolutional neural networks are employed to extract features from the image, followed by classification and regression operations performed on each anchor box. Finally, a non-maximum suppression algorithm is applied to filter out redundant detection boxes and obtain the final detection results.

In contrast, Anchor-free methods, such as CenterNet [22] and FCOS [23], adopt a different paradigm. These approaches treat objects as single points during model construction. After feature extraction, specific points (e.g., center points) are selected as key representations for object detection. The classification and regression tasks are decoupled into separate branches, allowing for independent prediction of object presence and spatial localization. Subsequently, a non-maximum suppression technique is applied to eliminate overlapping detections and produce the final detection results. While Anchor-free methods alleviate the need for anchor box generation, they often require a larger number of candidate points during detection. Consequently, these methods necessitate more extensive training data and sophisticated training strategies to achieve optimal detection performance in various scenarios.

The development of object detection has progressed from traditional methods to the era of deep learning-based detection. Early object detection algorithms relied on handcrafted image feature descriptors tailored for specific tasks, aiding in discerning target positions or categories. However, these manually crafted features lacked effective image representations, necessitating the design of more intricate feature representations to obtain high-quality image features. As detection techniques advanced, methods rooted in deep learning emerged, leveraging convolutional neural networks (CNNs) and akin methodologies to acquire robust and sophisticated feature representations. Yu et al. extracted multiple features from input images and an image database. They constructed multiple hypergraph Laplacian operators and formulated sparse codes [24]. Simultaneously, they preserved the locality of the obtained sparse codes by employing manifold learning on the hypergraph. Cao et al. introduced a discriminative region of interest (RoI) pooling scheme that samples from various sub-regions of a proposal and performs adaptive weighting to obtain discriminative features [25]. Yu et al. designed a novel fine-grained image recognition framework [26], introducing a feature selection module to address noise and high dimensionality in features. They incorporated weight vectors with sparse constraints and an improved "RELU" operator. The feature selection model achieved higher accuracy and a larger compression ratio. Woo et al. optimized the network from both channel and spatial perspectives, further enhancing the model’s feature extraction effectiveness in both dimensions [27]. Lv et al. proposed the RTDETR model, which utilizes Transformers to process the features extracted from the last layer of the backbone, enhancing the model’s ability to differentiate features among various objects [28].

While significant advancements have been made in the field of general object detection algorithms, their direct application to pedestrian detection has proven to be unsatisfactory [29]. Zhang et al. proposed a novel occlusion-aware R-CNN detection algorithm [30], building upon the faster R-CNN framework. This approach incorporates an occlusion processing unit, which effectively captures five distinct partial features of pedestrians. These part-level features are subsequently combined with the global features of the target, employing a weighted summation technique to obtain a comprehensive pedestrian detection outcome. Liu et al. introduced a density prediction module and an adaptive non-maximum suppression (NMS) method in order to address the challenges associated with pedestrian detection [31]. The adaptive NMS method dynamically adjusts the NMS threshold based on the density of the objects being detected. In scenarios with dense pedestrian presence, the NMS threshold is heightened to ensure a high recall rate. Conversely, in cases with sparser object distributions, the NMS threshold is lowered to alleviate the issue of redundant detection boxes. This adaptive approach effectively resolves the predicament of losing highly overlapped targets or generating false positives due to a fixed threshold. Chu et al. proposed a multi-instance prediction method to handle severe overlap among multiple targets [32]. Instead of predicting a single instance, their method predicts a set of highly overlapped instances for each proposal box. This approach reduces the false negative rate by designing a loss function that minimizes the distance between predicted boxes and ground truth boxes, supervising the learning of instance set prediction.

Xia et al. argued that incorporating multi-scale information can improve the robustness of the network without losing information [33]. They introduced multi-scale dilation residual modules into the backbone network to increase the receptive field and capture more global and higher-level semantic features. Li et al. utilized an improved Res2Net as the backbone network to enhance the model’s multi-scale representation capability for pedestrians [34]. Addressing the limited ability of a single feature extraction block to extract semantic information at different levels, Wang et al. proposed Three ResNet Blocks [35]. This module integrates three different basic blocks, each of which extracts pedestrian information, to enhance the information flow in the network structure and improve the accuracy of detection results.

These related works demonstrate the efforts made to overcome the limitations and challenges in pedestrian detection. The proposed methods introduce novel techniques, such as occlusion processing, adaptive NMS, and multi-instance prediction, to enhance the performance of pedestrian detection models. Additionally, advancements in backbone networks, such as Res2Net [36], further contribute to the improvement of multi-scale representation and feature extraction capabilities.

3 HF-YOLO Model for Pedestrian Detection

3.1 Model Overview

The YOLO series models are a prominent category of one-stage object detection methods. These models combine the tasks of object classification and localization regression by utilizing anchor boxes. This integration allows YOLO models to achieve high efficiency, flexibility, and good generalization performance, making them highly popular in both academia and industry.Wang et al. proposed a real-time detection model called YOLOv7 [37]. It is an advanced version of the YOLO series of real-time object detection models. Building on the success of its predecessor, further optimization has been introduced to improve its detection performance.

Fig. 2
figure 2

The HF-YOLO model structure mainly consists of three parts: backbone (feature extraction), neck (feature fusion), and head (detection)

To address the challenges associated with pedestrian detection in complex environments, we propose the HF-YOLO model, which builds upon the foundation of YOLOv7. The HF-YOLO architecture, illustrated in Fig. 2, comprises three essential components: Backbone, Neck, and Head.

The backbone component incorporates encapsulated convolutional blocks (CBH), MaxPooling, ELAN, and SPP modules for efficient feature extraction, as depicted in Fig. 3. CBH is defined as a sequence of operations:

$$\begin{aligned} X_{out}=HardSwish(BN(Conv(X_{in}))) \end{aligned}$$
(1)

where \(X_{in}\) and \(X_{out}\) respectively represent the feature maps of the input and output. apply a \(3\times 3\) convolution operation to \(X_{in}\), followed by batch normalization (BN), and subsequently an activation function (HardSwish).

Specifically, the ELAN module combines concepts from VoVNet [38] and CSPNet [39], utilizing a gradient path strategy to control the shortest and longest gradient paths in each layer. This enables different computational units to learn diverse information, maximizing parameter utilization efficiency. Additionally, the ELAN module ensures stable model learning by directly propagating information to update the weights of each computational unit, thereby mitigating degradation issues during training. Its gradient path design strategy promotes efficient parameter utilization, enabling the network to achieve higher accuracy without the need for additional complex architectures.

Spatial pyramid pooling (SPP) can generate fixed-size outputs, effectively addressing the issue of repetitive feature extraction in convolutional neural networks, while also reducing computational costs.

Fig. 3
figure 3

CBH consists of the operations convolution (Conv), batch normalization (BN), and the HardSwish activation function. The ELAN module employs a multi-branch structure, incorporating residual connections after each convolutional operation. This approach alleviates the issue of gradient vanishing associated with the increase in model depth. SPP module is overall composed of two parallel branch structures. The first part first undergoes CBH operation, then goes through max pooling with kernel sizes of 5\(\times \)5, 9\(\times \)9, and 13\(\times \)13, followed by a residual connection. The second part consists of a residual connection with CBH operation. Finally, these two parts are concatenated to obtain the output

Following the backbone, the neck component focuses on feature fusion and information propagation. It integrates various techniques, such as feature concatenation, to combine features from different levels of the backbone. This facilitates the integration of low-level localization information and high-level semantic information, empowering the model to effectively handle pedestrians of diverse scales and occlusion levels.

The Head of the HF-YOLO model consists of four detection layers, which perform object detection on feature maps with sizes of 20 \(\times \) 20, 40 \(\times \) 40, 80 \(\times \) 80, and 160 \(\times \) 160, respectively. These detection layers enable the model to capture pedestrians at different scales and generate precise detection results.

By combining these components, the HF-YOLO model addresses the challenges of pedestrian detection by enhancing feature extraction, facilitating feature fusion, and enabling accurate detection across various scales.

3.2 Optimizing Feature Representation with the HardSwish Activation Function

In convolutional neural networks, the output of each layer is obtained by applying a linear transformation to the input from the previous layer. However, without an activation function, the network’s output remains a linear combination of the inputs, regardless of its depth. Activation functions play a crucial role in introducing non-linearity to the data, enabling the network to capture complex patterns and effectively represent non-linear mappings between input and output domains. By incorporating non-linear transformations, activation functions enhance the expressive power of the network, facilitating the learning of intricate and abstract representations. This, in turn, enables the network to effectively model and extract features from high-dimensional data.

In model architectures, it is common to use Conv-BatchNorm-LeakyReLU (CBL) convolutional blocks, where convolutional operations (Conv), batch normalization (BN), and activation functions (LeakyReLU) are encapsulated together. The inclusion of activation functions introduces non-linearity, allowing the model to learn and represent intricate features. The LeakyReLU activation function is an adaptation of ReLU that introduces a small slope for negative values, addressing the issue of vanishing gradients encountered in ReLU:

$$\begin{aligned} LeakReLU(x)= \left\{ \begin{array}{ll} x, &{} x\ge 0\\ 0.1x, &{} x < 0 \end{array} \right. \end{aligned}$$
(2)

However, due to its near-linear behavior, LeakyReLU may not be optimal for detecting complex patterns in intricate data. The HardSwish activation function is defined as:

$$\begin{aligned} HardSwish(x)= \left\{ \begin{array}{ll} 0, &{} x\le -3\\ x, &{} x\ge 3\\ \frac{x(x+3)}{6}, &{} otherwise \end{array} \right. \end{aligned}$$
(3)

The most prominent distinction between Hard-Swish and Leaky ReLU is the non-monotonic concavity when x is less than 0. As shown in Fig. 4, the HardSwish [40] activation function undergoes a smoothing process, which enhances its smoothness and continuity. This characteristic facilitates gradient computation and optimization, mitigating problems such as gradient explosion and vanishing gradients. Consequently, it contributes to an accelerated convergence rate of the model. Compared to LeakyReLU, HardSwish introduces a stronger non-linearity while maintaining computational efficiency, thereby amplifying the representational capacity of neural networks. Consequently, replacing the activation function with HardSwish leads to the construction of the Conv-BatchNorm-HardSwish (CBH) convolutional block, which is used to build the ELAN and SPP modules.

Fig. 4
figure 4

The plotting of the Leaky ReLU and HardSwish activation functions with it’s non-monotonic bump for x less than 0

3.3 Fusion of High-Resolution Features

The detection of pedestrians in an image presents challenges due to the inherent variability in their distances from the camera, resulting in diverse scales. This scale variability can lead to scale imbalance and impact the performance of detection algorithms. Additionally, occluded pedestrians often lack discriminative features, which can result in false positives or false negatives during the detection process. Detection algorithms leverage shallower feature maps for precise spatial information and deeper feature maps for semantic context. By fusing feature maps from different layers, it becomes possible to exploit the complementary characteristics of both shallow and deep features. This fusion strategy enables the algorithm to capture richer contextual information, achieve multi-scale receptive fields, enhance detection capacity for objects of varying scales and occluded targets, and ultimately improve object localization precision.

During the feature extraction process, successive down-sampling operations are employed to derive informative feature representations. As the feature map’s resolution decreases, it encapsulates higher-level semantic information pertinent to the targets. However, challenges arise, particularly in scenarios involving occluded pedestrians, where multiple individuals might share a common feature representation. Consequently, this can lead to the exclusion of occluded targets, hampering detection accuracy.

To bolster HF-YOLO’s detection performance in handling diverse object scales and occluded pedestrians, we enhance the feature fusion module, illustrated in Fig. 5. Initially, we augment the C3 layer features using operations like convolution and upsampling. These augmented features are then fused with the P2 layer features extracted from the backbone network, leveraging the Concatenation (Concat) operation. This fusion yields C2, C3, C4, with its formulation outlined as follows:

$$\begin{aligned}{} & {} C4=Cat[U(CBH(P5)), CBH(P4)] \end{aligned}$$
(4)
$$\begin{aligned}{} & {} C3=Cat[U(CBH(E(C4)), CBH(P3)] \end{aligned}$$
(5)
$$\begin{aligned}{} & {} C2=Cat[U(CBH(E(C3)), CBH(P2)] \end{aligned}$$
(6)

where Cat concatenates the feature maps along the channel dimension, CBH represents the Conv_BN_HardSwish operation, adjusting the number of channels. U enlarges the feature map to twice its original size. E represents the feature extraction operation. This fusion facilitates the propagation of high-level semantic information.

Fig. 5
figure 5

The features of {P2,P3,P4,P5} extracted from the backbone were fused from top to bottom

Furthermore, we employ a bottom-up fusion process to propagate the localization information from the lower layers to the higher layers. This process generates N2, N3, N4, N5, where the features from the lower layers carry important positional details. Its expression is as follows:

$$\begin{aligned}{} & {} N2=E(C2) \end{aligned}$$
(7)
$$\begin{aligned}{} & {} N3=E(Cat[CBH(N2),C3]) \end{aligned}$$
(8)
$$\begin{aligned}{} & {} N4=E(Cat[CBH(N3),C3]) \end{aligned}$$
(9)
$$\begin{aligned}{} & {} N5=E(Cat[CBH(N4),P5]) \end{aligned}$$
(10)

Following two iterations of feature fusion operations, the amalgamation of high-level semantic information and low-level localization details serves to bolster the efficacy of feature extraction for occluded targets.

Based on the fused feature maps, we construct a detection layer with a dimension of 160 \(\times \) 160 to enhance the detection capability, particularly for small-sized targets. This increase in spatial resolution aids in capturing finer details and improving localization accuracy. The fusion process described above is visually depicted in Fig. 5.

3.4 Bounding Box Regression Loss Function

The bounding box regression loss function plays a crucial role in object detection algorithms as it encompasses accurate localization of objects. By comparing the predicted bounding boxes with the ground truth boxes, the bounding box regression loss value is computed, allowing the model to continuously optimize and improve the accuracy of object localization.

In object detection models, the complete IoU (CIoU) [41]metric is widely utilized for computing regression losses. The CIoU metric incorporates three essential geometric factors—the overlapping area, distance, and aspect ratio of the bounding box—into the loss calculation. This comprehensive metric provides a more accurate measure of the spatial discrepancy between predicted and ground truth bounding boxes. By considering these geometric factors, CIoU facilitates a more precise evaluation of the localization accuracy, enabling the optimization of object detection models for improved precision and robustness.

The formula for CIoU is as follows:

$$\begin{aligned} L_{C I o U}=1-I o U+\left( \frac{\rho ^{2}\left( b, b^{g t}\right) }{c^{2}}+\alpha v\right) \end{aligned}$$
(11)

Where IoU represents the intersection over union of bounding boxes A and B, which is calculated using the following formula:

$$\begin{aligned} I o U=\frac{|A \cap B|}{|A \cup B|} \end{aligned}$$
(12)

where b represents the predicted bounding box’s center coordinates, \(b^{g t}\) represents the ground truth bounding box’s center coordinates, \(\rho ^{2}(b,b^{g t})\) represents the squared distance between the center points of the two boxes, \(c^{2}\) represents the squared diagonal length of the minimum enclosing rectangle of the two boxes. v is a parameter that measures the aspect ratio, w and h represent the width and height of the predicted box, and \(h^{gt}\) and \(h^{gt}\) represent the width and height of the ground truth box, defined as follows:

$$\begin{aligned} v=\frac{4}{\pi ^2 } \left( \arctan \frac{w^{gt} }{h^{gt} } - \arctan \frac{w}{h} \right) ^2 \end{aligned}$$
(13)

\(\alpha \) is a weight parameter defined as follows:

$$\begin{aligned} \alpha =\frac{v}{\left( 1-IoU \right) +v} \end{aligned}$$
(14)

In comparison to other regression loss functions, CIoU loss has demonstrated notable advancements in terms of convergence speed and detection accuracy. However, one inherent limitation of CIoU loss is its neglect of the inherent imbalance within the regression samples. During the training process, there is often a scarcity of high-quality samples accompanied by an abundance of low-quality samples. As a result, the contribution of gradients derived from the former category to the overall regression gradient is diminished. To overcome this challenge and further enhance the precision of pedestrian target localization, we propose the utilization of F-CIoU.

F-CIoU introduces the IoU as a balancing factor to weigh the CIoU loss, as exemplified by Equation 5, where \(\beta \) is assigned a value of 0.5. By incorporating F-CIoU, higher-quality samples with larger IoU values incur greater losses and correspondingly higher weights. Consequently, the model places increased emphasis on the regression of high-quality samples, thereby promoting superior localization accuracy. The utilization of F-CIoU addresses the issue of imbalance in the regression samples, enabling the model to better optimize its performance by giving appropriate attention to different sample categories based on their quality.

$$\begin{aligned} L_{F-CIoU}=IoU^{\beta } \left( 1-IoU+\left( \frac{\rho ^{2}\left( b,b^{gt} \right) }{c^{2}} +av \right) \right) \end{aligned}$$
(15)

Following the aforementioned enhancements, we have derived the HF-YOLO model. The HF-YOLO model integrates the proposed improvements, including the novel feature fusion module, the utilization of the HardSwish activation function, and the introduction of the F-CIoU loss function. These enhancements collectively contribute to the improved performance of the HF-YOLO model in handling objects of various scales and occluded pedestrians, ultimately leading to enhanced object detection accuracy.

4 Experiment

In this section, we will provide an overview of the dataset used and the experimental setup employed in our study. We will then introduce the evaluation metrics utilized to assess the performance of the proposed HF-YOLO model. Additionally, we will present the experimental evaluations conducted on activation functions and the bounding box regression loss function. Subsequently, we will compare the results obtained from our proposed model with existing models through comparative analysis. Finally, we will discuss the ablation experiments performed to investigate the individual contributions of different components in the HF-YOLO model.

4.1 Experimental Setup

The experimental setup in this study is described in Table 1. In this study, we utilized the SGD optimizer for training the HF-YOLO model. The model was configured with an input image size of 640\(\times \)640 pixels. The batch size is 64. The initial learning rate was set to 0.001,, and the cosine annealing algorithm is used to update the learning rate. The momentum factor of 0.9 was employed during optimization. The total training duration comprised 300 epochs, during which the model was iteratively updated and fine-tuned using the training data.

Table 1 Experimental environment

4.2 Datasets

The dataset used in this study consisted of two main sources: the CrowdHuman dataset and the WiderPerson dataset.

The CrowdHuman dataset [42] is composed of images primarily obtained from Google search. It comprises complex scenes with dense pedestrian crowds. The training set of this dataset contains approximately 470,000 instances, with an average of around 23 people per image. The annotation information includes three types: full body, visual body, and head. For this study, the full body annotations were selected for training the model.

The WiderPerson dataset [43] is specifically designed for outdoor pedestrian detection. It consists of 13,382 images, and the annotations include five categories: pedestrian, cyclist, partially visible person, crowd, and ignore region. In this study, the first three categories (pedestrian, cyclist, partially visible person) were combined and treated as the "person" class.

To create the dataset for our experiments, we extracted a total of 16,000 images from the CrowdHuman and WiderPerson datasets. These images were then partitioned into three subsets: a training set, a validation set, and a testing set, following an 8:1:1 ratio. All experimental results were obtained in the test set.

4.3 Evaluation Metrics

The performance evaluation of the model in this study encompasses four key metrics: precision, recall, mean Average Precision (mAP), and detection time.

Precision quantifies the accuracy of positive predictions, serving as a measure of false detections. It is computed using the formula:

$$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$
(16)

where TP represents the number of correctly detected objects by the model, and FP denotes the count of objects falsely detected by the model.

Recall, on the other hand, gauges the probability of identifying positive samples within the predicted results, indicating the extent of missed detections. The calculation is given by:

$$\begin{aligned} Recall=\frac{TP}{TP+FN} \end{aligned}$$
(17)

where FN represents the number of positive samples overlooked by the model.

The average precision (AP) is determined by computing the area under the precision-recall curve, which provides an assessment of the detection quality for each class. The mean average precision (mAP) corresponds to the average AP values across multiple classes and is derived as follows:

$$\begin{aligned} AP=\int _{0}^{1} Pecision(Recall)d_{Recall} \end{aligned}$$
(18)
$$\begin{aligned} mAP=\frac{1}{m} {\textstyle \sum _{i=1}^{m}} AP_{i} \end{aligned}$$
(19)

4.4 Experimental Results and Analysis

4.4.1 Experimental Comparison of Activation Functions

To ascertain the effectiveness of the HardSwish activation function, a comparative evaluation was performed, contrasting it with alternative activation functions, including SiLU [44], GELU [45], and LeakyReLU. The results of the experiments, presented in Table 2, unequivocally establish the superiority of HardSwish in terms of performance. These findings provide compelling evidence of its ability to introduce a higher level of non-linearity to the model, thereby augmenting its capacity to represent intricate patterns and complex relationships in the data.

Table 2 Comparison of experimental results of activation function

4.4.2 Comparative Experiment of Bounding Bbox Regression Loss Function

To assess the effectiveness of the F-CIoU loss function, a comparative analysis was conducted, comparing it with alternative loss functions, including CIoU, SIoU [46], and WiseIoU [47]. The experimental results, presented in Table 3, yield valuable insights.

The SIoU loss incorporates an additional angle loss component to address the issue of predicted boxes exhibiting undesired wandering behavior during training. As a result, it achieves a modest improvement of 0.41% in recall compared to the CIoU loss. However, this improvement comes at the cost of a slight decline in both average precision and precision measures.

The WiseIoU loss, which introduces an attention-based regression loss, aims to enhance object detection performance. However, when compared to the CIoU loss, it does not yield any substantial improvement in model performance.

In contrast, the F-CIoU loss, by incorporating an IoU weighting factor, effectively mitigates the issue of regression imbalance. It outperforms the original loss function by achieving a notable improvement of 0.88% in mean average precision (mAP) and a significant 3.12% increase in recall. These results clearly demonstrate the effectiveness of the F-CIoU loss in optimizing the model’s performance by addressing the regression imbalance and improving detection accuracy.

Table 3 Comparative experimental results of boundary box regression loss function

4.4.3 Experimental Results in Comparison with Other Models

The HF-YOLO model, incorporating the proposed enhancements, was evaluated against several state-of-the-art detection models, including SSD, RTDETR, YOLOv3tiny, YOLOv4tiny, YOLOv5s, YOLOv6n, YOLOv7tiny, and YOLOv8n, in a series of comparative experiments.

Table 4. presents compelling evidence supporting the superiority of our HF-YOLO model across various crucial metrics such as mean average precision (mAP), precision, and recall. The results showcase a clear advancement compared to all other models examined. Specifically, HF-YOLO demonstrates superior performance over SSD and RTDETR, registering an impressive increase of 16.3% in detection accuracy compared to SSD and 1.5% compared to RTDETR. Moreover, our model achieves a reduction in detection time by 4.5ms in contrast to SSD and 14.4ms compared to RTDETR.

Although YOLOv3tiny and YOLOv4tiny exhibit the quickest detection times, they compromise on mAP, precision, and recall metrics. Meanwhile, our HF-YOLO model, while maintaining a comparable detection time to YOLOv5s, YOLOv6s, YOLOv7tiny, and YOLOv8n, significantly outperforms them across all other evaluated metrics. These outcomes underscore the practical applicability and efficacy of our proposed model in real-world object detection scenarios.

Table 4 Experimental results in comparison with other models

4.4.4 Results and Analysis of Ablation Experiments

To evaluate the efficacy of each module improvement, a series of ablation experiments were conducted to analyze the impact of different modules. The ablation experiment design, outlined in Table 5, utilized YOLOv7tiny as the baseline model. The following improvements were investigated:

  • Improvement\(\textcircled {1}\): Replacement of LeakyReLU with HardSwish activation function.

  • Improvement\(\textcircled {2}\): Fusion of high-resolution features and addition of small object detection layer.

  • Improvement\(\textcircled {3}\): Modification of regression loss function from CIoU to F-CIoU.

By conducting these ablation experiments and comparing the results with the baseline model, the effectiveness of each improvement can be assessed.

Table 5 Ablation experiment design

The experimental results, presented in Table 6, demonstrate the outcomes of the conducted ablation experiments.

In experiment\(\textcircled {1}\), involving the replacement of the activation function, the introduction of HardSwish resulted in increased non-linearity, leading to improved feature representation capabilities. Compared to the original model, there was a noticeable enhancement of 0.62% in mean average precision (mAP) and 1.69% in Precision. The detection time remained consistent at 7.7ms, indicating the effectiveness of the HardSwish activation function in improving model performance.

Building upon the findings of Experiment\(\textcircled {1}\), Experiment\(\textcircled {2}\) introduced additional detection layers, enhancing the model’s ability to detect pedestrians of varying scales. This led to a significant improvement of 2.64% in mAP and 1.71% in Recall, with the detection time increasing by 0.8ms to reach 8.5ms, compared to the original model.

Experiment\(\textcircled {3}\) focused on improving the regression loss function. By doing so, the detection rate of high-quality samples was enhanced, resulting in a substantial increase in Recall to 71.71%. The mAP was elevated to 81.60%, and the Precision reached 86.25%. The detection time remained consistent at 8.5 ms.

Table 6 Results of ablation experiment

The mAP comparison between HF-YOLO and YOLOv7tiny is presented in Fig. 6, where it can be observed that HF-YOLO exhibits higher accuracy, achieving an improvement of 3.52 in mAP.

Fig. 6
figure 6

Comparison of mAP between HF-YOLO and YOLOv7tiny

4.4.5 Visualization of Detection Results

Figure 7 serves as a visual validation of the proposed algorithm’s efficacy, displaying detection results for SSD, RTDETR, YOLOv3tiny, YOLOv4tiny, YOLOv5s, YOLOv6s, YOLOV7tiny, YOLOv8n, and HF-YOLO algorithms. The purpose of this experiment is to showcase the superior performance of the HF-YOLO algorithm.

Fig. 7
figure 7

In contrast to the detection outcomes of SSD, RTDETR, and other algorithms within the YOLO series, the HF-YOLO algorithm consistently demonstrates accurate detection of pedestrian targets, exhibiting proficiency in scenarios characterized by both small-scale targets and occlusions

The results indicates a significant difference between our proposed model and others. In the first set of images, SSD, YOLOv3tiny, YOLOv6s, and YOLOv8n display conspicuous shortcomings in detecting small targets. Moreover, excluding our improved algorithm, other detection outcomes suffer from imprecise localization. Moving to the second and third sets of images, depicting more complex scenes with partially occluded pedestrians, our algorithm consistently exhibits commendable detection capabilities.

Relative to the results of YOLOv7tiny, our enhanced algorithm notably diminishes false positives and elevates the accuracy of target localization. Furthermore, it consistently achieves precise target detection even in complex scenes.

5 Conclusion

In conclusion, this paper presents HF-YOLO, an advanced pedestrian detection model that effectively addresses the challenges associated with pedestrian detection, including scale variations and occlusion. By leveraging feature fusion from shallow and deep layers, HF-YOLO enhances the model’s detection performance, resulting in improved accuracy and robustness. Moreover, to tackle the issue of imbalance in bounding box regression, HF-YOLO incorporates a balancing factor that prioritizes the accurate localization of high-quality samples. This ensures that the model focuses on optimizing the detection of relevant objects and improves overall detection efficacy. Experimental evaluations conducted in this study provide compelling evidence of the effectiveness of the proposed HF-YOLO model. It outperforms the baseline model, demonstrating significant improvements in detection accuracy and overall performance. The results highlight the potential of HF-YOLO as an advanced pedestrian detection solution, with practical applications in real-world scenarios.