1 Introduction

Aiming to estimate the number of persons in dense and complex scenes, crowd counting is receiving increasing academic attention due to its great value in real-world applications such as public safety [1, 2], intelligence gathering [3, 4], emergency management [5, 6], and data analysis [7, 8]. What is more, to prevent the spread of COVID-19 virus in public places, it can be used to perform social distance monitoring by providing effective crowd size information [9].

Motivated by the recent successful use of Convolutional Neural Network (CNN) in various computer vision tasks, most crowd counting methods are inclined to adopt CNN-based methods. Despite the impressive advancements have been achieved, these approaches suffer from two intractable problems. On the one hand, the size of individuals in the image tends to vary with the distance from the camera, which leads to the scale variation problem [10], as illustrated in the top of Fig. 1. On the other hand, the background regions in the crowd image have a similar appearance or color with the foreground regions, which is not conducive to accurate identification of crowd regions [11], as illustrated in the bottom of Fig. 1.

Fig. 1
figure 1

Two problems that impact the accuracy of crowd counting. Top: Scale variation. Bottom: Complex background

To address the above problems, numerous solutions are put forward from different technical perspective [12, 13], such as multi-column architectures [14, 15], adaptive feature pyramids [16, 17], and multiple CNNs with different sizes of receptive fields [18, 19]. In addition, considering the low quality of the input images, some methods try to pre-process the input images by introducing meta-heuristic-based image processing algorithms [20,21,22,23,24,25] and image optimization algorithms [26, 27], thus improving the processing efficiency of the crowd images. Though these schemes can mitigate these problems to some extent, it is still challenging to achieve high-precision crowd counting in extremely complex scenarios due to the presence of various background interferences. To be specific, firstly, the influence of background regions is not fully considered. These methods are not targeted to suppress the weight of the background region in the feature map, so that features irrelevant with the crowd can easily be extracted from the background regions [28]. In this case, it is necessary to further calibrate the weights of different regions in the feature map, thus suppressing the representation ability of the background regions. Secondly, the local correlation is ignored. Due to the presence of scale variation, different crowd regions in density map contain different scale information, which is local correlation [29]. However, the Euclidean loss commonly used by existing methods does not effectively capture local correlation, so that the generated crowd density maps have poor local continuity. Therefore, we can devise a new loss function that can intentionally learn local correlations on different regions.

According to the above analysis, we propose a Hierarchical Region-Aware Network (HRANet) based on the multi-level architecture, which can adaptively recalibrate the weights of different local regions to better focus on the crowd regions. In our implementation, to capture the internal differences within different regions in crowd images, we design a Region-Aware Module (RAM), which can flexibly extract region adaptive features at each level. Additionally, to further suppress the weight of background regions in these features, we implement a region-aware attention mechanism on the Region Recalibration Module (RRM) to capture the relative influence of region features on different image locations. Based on this influence, RRM can recalibrate the weights of different regions. By combining the above two components, our HRANet can effectively suppress the influence of background regions on the counting results. In addition, we design a Region Awareness Loss (RAL) to enforce our model to learn global and local correlations within different regions. Specifically, we divide the ground-truth and the generated density map into different size sub-regions and further compute the structural similarity between these sub-regions, so that our method can produce locally consistent density maps and reduce false identifications.

In general, the contributions of our work are threefold:

  1. (1)

    We design a Region-Aware Module (RAM), which captures the internal variation of the crowd image to adaptively extract contextual information of different regions.

  2. (2)

    We design a Region Recalibration Module (RRM), which adopt a novel region-aware attention mechanism to capture the relative influence of regional features on different image locations, thus recalibrating the weights of local regions at each level.

  3. (3)

    We design a Region Awareness Loss (RAL), which can enforce the proposed model to learn global and local correlations within different sizes of regions to produce locally consistent density maps.

The organization of the rest of the paper is as follows. Section 2 illustrates the related work associated with our method. Section 3 describes our method in detail. Section 4 performs detailed comparative experiments. Section 5 analyzes the superiority of our method. Section 6 concludes the whole paper.

2 Related work

2.1 Crowd counting

Motivated by the successful use of deep learning, many CNN-based methods [30, 31] have widely developed in areas of crowd counting and have made significant gains. In this study, we focus on two categories of approaches associated with our work: multi-column based method and multi-level based method.

2.1.1 Multi-column based method

Multi-column based methods usually adopt multiple branches with different sized convolution kernels to encode different receptive fields, so that they can capture multi-scale features. As a pioneering work that employs multiple columns, MCNN [32] extracts multi-scale features by utilizing three branches with convolution kernels of different sizes. In the same way, CP-CNN [33] adopts two branches to extract the local and global contextual information in feature maps, respectively, and further combines them to generate the density maps. DeepCount [34] learns density information from multiple branches to train a density-aware regressor. DSSINet [29] uses a message-passing mechanism to refine the feature information between three branches. DADNet [35] introduces dilated CNN with different rates on the multi-column structure to capture more multi-scale contextual information. McML [36] introduces a statistical network, so that it can estimate the mutual information between different scales.

In general, these multi-column based methods can effectively extract multi-scale features from different branches to accommodate crowds on different scale. However, the network architecture of each branch in these methods are almost identical, which inevitably leads to significant information redundancy.

2.1.2 Multi-level based method

Multi-level based methods exploit the hierarchical structure of CNN to capture multi-scale features from within the backbone network. For example, ANF [37] introduces Conditional Random Fields (CRFs) to aggregate multi-level features. It also incorporates non-local attention mechanisms as inter- and intra-layer attention to expand the receptive fields between different levels. MBTTBF [38] introduces a bi-directional feature fusion method in the multi-level network to facilitate the full fusion of multi-level features. TEDNet [39] proposes a decoder with dense skip connections in the multi-level structure to hierarchically aggregate multi-level features. SaCNN [40] extracts multi-level features from the backbone FCN hierarchically and resizes them to the same size for generating crowd density maps.

This kind of methods effectively extract features at different scales from the network, which significantly reduces the computational effort of the model. However, since the low-level features extracted by these methods contain more edge information, which may interfere with the final counting results.

2.2 Loss function

Loss functions aim to measure the difference between the predicted results and the real data, thus guiding the model to achieve better optimization results. The loss function commonly used in crowd counting is the MSE loss [41], which is mainly used to evaluate the variability of the data. After that, considering the local similarity between different regions due to scale variation, some researches [42, 43] start to design different similarity preserving metrics [44, 45] and similarity losses [28, 39] to reduce the difference between the ground-truth and the estimated density map. DSSINet [29] designs a Dilated Multiscale Structural Similarity (DMS-SSIM) loss to learn local correlations within regions of different sizes. Specifically, the DMS-SSIM is computed by measuring the structural similarity between the multi-scale region centered at the given pixel on the estimated density map and the ground-truth. TEDNet [39] proposes a combinatorial loss to learn local correlations and spatial correlations in density maps. CFANet [28] designs a novel structural loss function (SL), which can improve both the structural similarity and counting accuracy between the estimated density map and the ground-truth.

Fig. 2
figure 2

The architecture of Hierarchical Region-Aware Network (HRANet). The HRANet consists of four main components: Encoder, Region-Aware Module (RAM), Region Recalibration Module (RRM), and Decoder. First, the encoder is used to extract the multi-level feature representation of the given image. Then the RAM extracts adaptive contextual features of different regions at each level. After that, the RRM is used to further recalibrate the feature weights on different regions. Finally, multiple decoders are used to aggregate the multi-level features hierarchically to generate the final crowd density map

3 Proposed method

In this section, we first provide the overview of the proposed Hierarchical Region-Aware Network (HRANet) and then describe each module in detail.

3.1 Overview

In our implementation, the HRANet adopts an encoder-decoder architecture base on the UNet [46], whereby we can obtain a high-resolution estimated density map, and meanwhile retains more spatial detail information. As illustrated by Fig. 2, our HRANet consists of four components: encoder, RAM, RRM, and decoder.

To flexibly capture deeper features, we adopt the first 13 layers pre-trained VGG-16 as encoder by virtue of its adaptive structure for follow-up feature fusion. Given a crowd image I, the encoder output the multi-level features \(\{{F_{3}^{1}}, {F_{4}^{1}}, {F_{5}^{1}} \}\), which contain the contextual information at different scales.

Due to the fact that feature \({F_{i}^{1}}\) at each level can only encode a single size of receptive field, leading to its insufficient expression capacity. To remedy this, we design the RAM to extract wider contextual information, and the RRM to recalibrate weights for these contextual information, thus improving the expression capability of feature \({F_{i}^{1}}\). To be specific, the RAM is responsible for capturing the internal differences within different regions in multi-level features \({F_{i}^{1}}\) and generating region adaptive features \({F_{i}^{2}}\). Subsequently, the RRM is utilized to further capture the relative influence of of \({F_{i}^{2}}\) on each image location to recalibrate the weight of different regions, and generate final output features \({F_{i}^{3}}\).

Based on the final output features, we apply multiple decoders to aggregate them hierarchically. Here, each decoder is composed of a bi-linear up-sampling operation and a \(1\times 1\) convolutional operation.

$$\begin{aligned} F_{i}^{3}=Conv\left( Concat\left( F_{i}^{3},U_{bi}\left( F_{i+1}^{3}\right) \right) \right) \end{aligned}$$
(1)

where \(U_{bi}\) represents the up-sampling operation, Concat () represents the concatenation operation, Conv() represents the \(1\times 1\) convolutional operation.

Finally, we apply a set of convolutional layers to regress the aggregated features to produce the crowd density map. The detailed structure of these convolutional layers is: C(128, 3)-U-C(64, 3)-C(1, 3), where C(m,n) represents the convolutional layer, m represents the number of channels, and n represents the size of the convolutional kernel. U represents the bilinear interpolation up-sampling operation.

3.2 Region-aware module

RAM aims to adaptively extract contextual features on different regions. As illustrated in Fig. 3, the RAM consists of two branches, which are responsible for extracting global contextual features and region adaptive features, respectively.

Fig. 3
figure 3

The details of Region-Aware Module (RAM). \(\oplus\) represents the channel concatenation operation

In the first branch, we first squeeze the number of channels of \({F_{i}^{1}}\) to preserve the representation power of global contextual features by utilizing a \(1\times 1\) convolutional layer. Hereby, we can obtain the global contextual features \(\widetilde{F_{i}^{2}}\). In the second branch, we design a region-aware block, consisting of a \(3 \times 3\) average pooling layer followed by a \(1\times 1\) convolutional layer, to capture the internal differences within different regions of the crowd image. Specifically, we first divide the multi-level features \({F_{i}^{1}}\) into different regions using an average pooling operation. Then, the output features are convolved by a \(1\times 1\) convolution to generate the region-aware features \(\hat{F_{i}^{2}}\), which contain adaptive contextual information of different regions.

Afterwards, we combine the global contextual features \(\widetilde{F_{i}^{2}}\) and the region-aware features \(\hat{F_{i}^{2}}\) by using the channel concatenation operation. Final, the final features are passed through a \(3 \times 3\) depthwise convolution to obtain the region features \({F_{i}^{2}}\):

$$\begin{aligned} {F_{i}^{2}}=Dconv\left( Concat\left( \widetilde{F_{i}^{2}}, \hat{F_{i}^{2}}\right) \right) \end{aligned}$$
(2)

where Concat() is the concatenation operation, Dconv() is the \(3 \times 3\) depthwise convolutional operation.

3.3 Region recalibration module

To suppress the influence of background features on counting results and further improve the representation ability of multi-level features, RRM is designed to recalibrate the weights of different regions in the feature map. The RRM is implemented with a Region-aware Attention Mechanism (RAAM), as shown in Fig. 4.

Fig. 4
figure 4

The details of Region-Aware Attention Mechanism (RAAM). \(\oplus\) represents the channel concatenation operation, \(\ominus\) represents the subtraction operation, and \(\otimes\) represents the elemental multiplication operation

Taking region features \({F_{i}^{2}}\) and multi-level features \({F_{i}^{1}}\) as input, RAAM first captures the relative influence of region features on each image location by computing the contrast features \(\widetilde{F_{i}^{3}}\):

$$\begin{aligned} {\widetilde{F_{i}^{3}}}=F_{i}^{2}-F_{i}^{1} \end{aligned}$$
(3)

Then, we send these contrast features \(\widetilde{F_{i}^{3}}\) into a \(1 \times 1\) convolutional network followed by a sigmoid activation layer to calculate weights \({W_{i}}\) of different regions:

$$\begin{aligned} {{W_{i}}}=Sig\left( Conv\left( \widetilde{F_{i}^{3}}\right) \right) \end{aligned}$$
(4)

where Sig() is the sigmoid activation function, Conv() is the \(1 \times 1\) convolutional operation.

To obtain the recalibration features \(\hat{F_{i}^{3}}\), an elemental multiplication operation between the region features \({F_{i}^{2}}\) and the weights \(W_{i}\).

$$\begin{aligned} \hat{F_{i}^{3}}=W_{i}\otimes F_{i}^{2} \end{aligned}$$
(5)

where \(\otimes\) is elemental multiplication between the weights \(W_{i}\) and the contextual features \({F_{i}^{2}}\).

Finally, we combine the recalibration feature \(\hat{F_{i}^{3}}\) and the multi-level feature \({F_{i}^{1}}\) using the channel concatenation operation to obtain the final feature \({F_{i}^{3}}\), which contains rich global information and local region adaptive information:

$$\begin{aligned} {F_{i}^{3}}=\hat{F_{i}^{3}} + F_{i}^{1} \end{aligned}$$
(6)

3.4 Loss function

Fig. 5
figure 5

The details of Region Awareness Loss(RAL). \(\oplus\) represents the summation operations

During training, we adopt MSE loss [41, 47] to measure the difference between the estimated density map and ground-truth, which is expressed as

$$\begin{aligned} L_{den}=\frac{1}{N} \sum \limits _{i=1}^{N}\left\| D_{i}-D_{i}^{gt}\right\| _{2}^{2} \end{aligned}$$
(7)

where N is the number of images, \(D_i\) is the density map of the given image I, and \(D_{i}^{gt}\) is the ground-truth.

In addition, considering the local correlation within different size regions of the density map, we also introduce a Region Awareness Loss (RAL) at each level to guide the proposed model for learning the scale diversity at the local regions, as shown in Fig. 5.

Specifically, we first use a \(1 \times 1\) convolution on the output features of each level to obtain the prediction \(X_i\) of of this level, and perform up-sampling operation on \(X_i\) to enforce it has the same size as the ground-truth Y. Then the ground-truth Y is divided into three layers of \(1\times 1, 2 \times 2\), and \(4 \times 4\) sub-regions sequentially by applying different average pooling layers. Additionally, the same operation is performed on \(X_i\). Finally, the structural similarity loss between \(X_i\) and Y on each sub-region is computed separately. Thereby, the \(L_{ral}\) is obtained by aggregating the losses on all layers:

$$\begin{aligned} L_{ral}= \frac{1}{N} \sum \limits _{i=1}^{N} \sum \limits _{j=1}^{S} \frac{1}{k_{j}^{2}}\left( 1-{SSIM}\left( P_{{ave }}\left( Y ; k_{j}\right) , P_{{ave }}\left( X_{i} ; k_{j}\right) \right) \right. \end{aligned}$$
(8)

where \(P_{ave}\) denotes the average pooling operation, \(k_{j}^{2}\) denotes the number of sub-regions, and S is the number of average pooling layers. N is the number of images. In addition, SSIM is the Structural Similarity Loss [48].

Therefore, the final loss is the summation of all the above losses.

$$\begin{aligned} L_{final}=L_{den}+\lambda \sum \limits _{i=3}^{5}{L_{ral}^{i}} \end{aligned}$$
(9)

where \(\lambda\)=0.01 is the scale-specific weight, which is used to balance the influence of \({L_{ral}^{i}}\).

4 Experiments

In this section, we validate the performance of the proposed method on five publicly available crowd counting datasets in several aspects, and compared with existing methods to validate the novelty of the proposed method.

4.1 Implementation details

Ground-truth generation We use a Gaussian kernel (normalized to 1) to generate ground-truth by blurring each head annotation in the crowd image. Specifically, for sparse crowd scenes, we follow the method of CSRNet [49] which uses a fixed Gaussian kernel to produce the ground-truth. In addition, for dense crowd scenes, we use an adaptive Gaussian kernel to produce ground-truth F(x):

$$\begin{aligned} F(x)=\sum \limits _{i=1}^{N} \delta \left( x-x_{i}\right) \times G_{\sigma _{i}}(x), { with } \sigma _{i}=\beta \bar{d} \end{aligned}$$
(10)

where x is the location of the pixel in crowd image, N represents the number of head annotations. \({x_{i}}\) is the target object in ground-truth \(\delta\), and \(\overline{d}_i\) is the average distance of its k nearest neighbors. We use the Gaussian kernel with parameter \(\sigma _{i}\) to convolve \({\delta \left( x-x_{i}\right) }\), where \(\beta\) is a constant. In our implementation, we set \(\beta\)=0.3 and \(\sigma _{i}\) = 3. Evaluation metrics The mean absolute error (MAE) and the mean square error (MSE) is used to evaluate the counting performance and robustness of our method respectively. They are defined as:

$$\begin{aligned} M A E= & {} \frac{1}{N}\sum \limits _{i=1}^{N}\left| z_{i}-\hat{z}_{i}\right| \end{aligned}$$
(11)
$$\begin{aligned} M S E= & {} \sqrt{\frac{1}{N}\sum \limits _{i=1}^{N}\left| z_{i}-\hat{z}_{i}\right| ^{2}} \end{aligned}$$
(12)

where N is the total number of crowd images, \(z_i\) is ground-truth of the i-th image, \(\hat{z}_{i}\) is prediction of our model in the i-th image. MAE determines the accuracy of the crowd estimation and MSE determines the robustness of the crowd estimation.

To more visually evaluate the quality of the crowd density maps generated by our method, we adopt the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity in Image (SSIM) [48] as evaluation criteria. PSNR is used to calculate the error between the corresponding pixels, SSIM is used to measure the structural similarity between the ground-truth and the estimated density map.

Implementation In our implementation, the encoder of our model consists of the first 13 convolutional layers and 3 pooling layers of VGG-16, which have been pre-trained on ImageNet. For the other convolutional layers, we use a Gaussian distribution with \(\delta\) = 0.01 to initialize them randomly. In addition, our up-sampling operation is based on bilinear interpolation. We also use Adam Optimizer to train our network for 200 epochs and learning rate is initially set to 1e-5.

4.2 Datasets

Shanghai Tech [32] This dataset is one of the largest publicly available benchmark. It has 1198 images and 330,165 head annotations, which is composed of Part A and Part B. The Part A has total 482 images randomly collected from the Internet, the crowd scene is relatively dense. The part B includes total 716 images taken from a commercial street, the crowd scene is relatively sparse.

UCF_CC_50 [47] This dataset is the first publicly available and extremely challenging benchmark dataset. It has only 50 highly dense crowd images and the number of annotations for each image varies from 94 to 4543. Since this dataset has less data, in order to effectively conduct the comparison experiments, we perform 5-fold cross-validation following the method of CSRNet [49].

UCF-QNRF [50] This dataset is one of the publicly available large scale crowd image datasets, which is consists of 1535 crowd images with 1251642 annotations. Since it has a wide range of people from 49 to 12,865 in each image, leading to its extremely challenging.

WorldExpo’10 [51] This dataset is a publicly large-scale crowd counting benchmark, it includes 1,132 annotated video sequences, each of which has 50 frame rates and its resolution is \(576\times 720\) pixels. In the test set, these frames have 5 scenes, each scene includes 120 images.

4.3 Counting performance in sparse crowd scenes

We validate the performance of our method on two datasets with relatively sparse crowd scenes: ShanghaiTech Part B [32] and WorldExpo’10 [51], and compare our method with some existing representative methods to analyze the advantages of our method in sparse crowd scenes.

Table 1 Performance comparison on the dataset with sparse crowd scenes

ShanghaiTech Part B [32] The comparison results are shown in Table 1. As can be seen from the table, in terms of MAE, our method improves by 10% compared with DSSINet which has the best performance. Similarly, in terms of MSE, our method improves by 6% compared with DSSINet.

WorldExpo’10 [51] As can be seen from the table, we achieve the best average MAE performance, which is 10% higher than the second-best method. Furthermore, in scene 2, our method still outperforms the other methods, leading the second place by 5% in terms of MAE. Similarly, in scene 4, our method is 11% ahead of the second best.

In summary, our method significantly outperforms existing methods in crowd scenes with relatively low crowd density, which is attributed to our Region-Aware Module (RAM). Since our RAM can adaptively capture the internal differences within different regions, which results in our method capturing local contextual information to the maximum extent, and thus performing crowd counting more accurately.

4.4 Counting performance in dense crowd scenes

We validate the performance of our method on three datasets with dense crowd scenes: ShanghaiTech Part A [30], UCF_CC_50 [41], and UCF-QNRF [44], thus analyzing the advantages of our method in dense crowd scenes.

Table 2 Performance comparison on the dataset with dense crowd scenes

ShanghaiTech Part A [32] The comparison results are shown in Table 2. As can be seen from the table, compares with EDENet that has the best performance, our method improves the MAE by 2%, which indicates that our method has higher counting accuracy. Similarly, our method improves the MSE by 3% compared with CDADNet, which reflects the higher robustness of our method.

UCF_CC_50 [47] The comparison results are shown in Table 2. As can be seen from the table, in terms of robustness, our method achieves the second-best performance, CDADNet is 3% ahead of our method. However, in terms of counting accuracy, our method achieves optimal performance, which is 3% ahead of CDADNet.

UCF-QNRF [50] The comparison is summarized in Table 2. As can be seen from the table, our method improves MAE by 2% compared with EDENet that has the best performance. Our method also produces the best MSE, which is 8% higher than the second-best method. Moreover, compared with the recently proposed ASANet and MTL-DP, our method is 18% and 14% ahead of them in terms of MSE, respectively.

The comparison results indicate that our method still has significant advantages in dense crowded regions, which reflects the effectiveness of our Region-Aware Attention Mechanism (RAAM). Since RAAM can further recalibrate the weights of different regions in the feature map, so that our method can focus on the crowd regions in the image and avoid the influence of complex background on the counting results.

4.5 Density map quality

In this section, we evaluate the quality of the density maps produced by using PSNR and SSIM as evaluation criteria. Specifically, we perform a comparison experiment between our method and some representative methods to analyze the superiority of the proposed method in terms of the quality of the generated density map.

Table 3 Comparison results of density map quality on ShanghaiTech Part A dataset
Fig. 6
figure 6

The training process on the ShanghaiTech part A dataset

Fig. 7
figure 7

The impact of parameter \(\lambda\) on different datasets. (a). The impact of \(\lambda\) on dense crowd scenes. (b). The impact of \(\lambda\) on sparse crowd scenes

Fig. 8
figure 8

Visually qualitative analysis on different crowd scenes

As can be seen from Table 3, our method obtains the PSNR of 28.31 and the SSIM of 0.89, which improves by 9% and 7% over the second-best method. It can be seen from these results that our method is significantly better than other methods, which indicates that our method can generate higher quality crowd density maps.

4.6 Ablation study

To validate the effectiveness of each component in our network, in this section, we perform several sets of ablation experiments on the ShanghaiTech part A&B dataset [32] to analyze the influence of the different settings on the counting performance.

We apply our proposed encoder and the decoder as the backbone, and apply MSE loss as the basic loss function. After that, we gradually add the corresponding modules to verify the effectiveness of these modules on the ShanghaiTec dataset. Furthermore, we visualize some density maps generated by the corresponding components to visually compare their performance, as shown in Fig. 8.

Table 4 The results produced by the different components of our method

Firstly, we verify the effectiveness of RAM and the results are shown in Table 4. Compared with Backbone individually, Backbone+RAM improves MAE by 3% and MSE by 4% on the ShanghaiTech part A dataset. In addition, Backbone+RAM improves MAE by 17% and MSE by 8% on the ShanghaiTech part B dataset. This significant improvement shows that our RAM can effectively capture the internal variation of the crowd region to extract wider contextual information. As shown in Fig. 8, compared with Backbone, Backbone+RAM can avoid the influence of the background regions to some extent, which further demonstrates the excellent performance of our RAM in extracting local region features.

Secondly, we verify the effectiveness of RRM and the results are shown in Table 4. Compared to the above two versions, Backbone+RAM+RRM improves MAE by 10% and 21% respectively on the ShanghaiTech part A& B dataset. This substantial performance improvement fully validates the effectiveness of the RAAM. As seen in Fig. 8, we use RAAM to further recalibrate the weights of local regions, which results in our method better focusing on the crowd regions and avoiding the influence of background regions on the final counting results.

Fig. 9
figure 9

Density maps generated by different methods. From top to bottom: sample, ground-truth, and the results of MCNN, SCAR, CAN, and HRANet

Thirdly, we verify the effectiveness of the RAL. Compared with using MSE as the loss function individually, using RAL to learn local correlations within different regions improves our model MAE by 10% and MSE by 8% on the ShanghaiTech part A. Therefore, we can conclude that using RAL as the loss function can improve the counting performance more effectively. As shown in Fig. 6, we visually show the training process of our method with different loss functions, from which we can see that using RAL to train our method can converge faster and more smoothly.

Finally, we verify the impact of parameter \(\lambda\) in the loss function on the counting performance. As shown in Fig. 7, our method achieves the optimal performance in different datasets when \(\lambda\) is set to 0.01. Therefore, we finally choose \(\lambda\)=0.01 as the scale-specific weight to balance the impact of RAL on each level.

Fig. 10
figure 10

The crowd localization performance of the proposed method. The two rows from top to bottom are the crowd images and the localization maps produced by the proposed method

5 Analysis

In this section, we further analyze the differences between our method and existing methods. In addition, we also verify the generalization ability of our method in crowd localization.

5.1 Analysis of counting performance

We visualize some density maps generated by the proposed method and some representative methods in different crowd scenes, as shown in Fig. 9. Based on this figure, we further analyze the reasons why our method outperforms existing methods.

As shown in the yellow marked regions in the figure, when dealing with crowd scenes with large crowd scale variation, the distribution of crowds in the density map generated by our method is closer to the ground-truth. It reflects the fact that our method can respond more effectively to rapid scale variation. Since the features at different levels are complementary to each other, their combination can improve the representation ability of multi-scale features. Thus, our method hierarchically combines these features, so that our method is more robust to scale variation.

As shown in the red marked regions in the figure, when dealing with crowd scenes with complex background, our method can easily adapt to cluttered crowd scenes. This reflects that our method can better focus on crowd regions. Due to the presence of region-aware attention mechanism, our method can further recalibrate the weights of different regions, thus suppressing the influence of background regions on the counting results.

In the same way, as shown in the last two columns in the figure, when dealing with crowd scenes with different crowd densities, our method estimates the number of pedestrians closer to the ground-truth, which is attributed to our Region-Aware Module (RAM). Our RAM can effectively extract adaptive features within different regions, so that our method can capture wider contextual information compared with existing methods.

5.2 Analysis of generalization ability

In this section, we verify the generalization ability of our method for performing crowd localization on the NWPU dataset [8].

In our implementation, we deploy our method on the IIM framework [62] which is proposed by the Gao et al. After that, we follow existing crowd localization methods and apply Recall (Rec.), instance-level Precision (Pre.), and F1-measure to validate our method. All of the above evaluation criteria are defined in [8].

As can be seen from the Fig. 11, our method significantly outperforms the average level. This is attributed to the fact that our method can generate high-resolution feature maps, which makes our method have more spatial information to perform crowd localization. In addition, our region-aware attention mechanism further recalibrates the weights of different regions, so that our method can effectively avoid the interference of noisy information and accurately locate the heads of the pedestrian.

Additionally, to illustrate the performance of the proposed method in performing crowd localization, we visualize the crowd localization maps generated by our method, as shown in Fig. 10.

Fig. 11
figure 11

Performance comparison of different methods for performing crowd localization

6 Conclusion

In this paper, we devise a Hierarchical Region-Aware Network (HRANet) for crowd counting, which addresses the crowd scale variation problem and the complex background problem from two aspects: more accurate focus on crowd regions and adequately considering local correlations on different regions. For the former, we propose a Region-Aware Module (RAM) and a Region Recalibration Module (RRM). By integrating these two modules, our method can capture the influence of local regions on different image locations to further recalibrate the weights of different regions, thus suppressing the representation ability of background information. For the latter, we design a Region Awareness Loss (RAL), which can guide our model to learn global and local correlations in crowd density map, thus producing locally consistent density maps. Several sets of experiments on five crowd counting datasets show that proposed method can achieve more accurate counting results compared with the existing advanced methods. In addition, multiple sets of comparative experiments have fully demonstrated strong generalization ability of our method.