1 Introduction

Crowd counting is to estimate the number and density distribution of people in an image or video frame. It is particularly prominent because of its special significance for public safety and management [6, 52, 53], especially during the COVID-19 pandemic, accurate crowd counting helps avoid gatherings of people. It has also attracted widespread attention of many scholars [5, 12, 23, 36, 38, 49, 58]. However, crowd counting is a very challenging due to the large scale variation of crowd head and complex backgrounds.

In recent years, with the renaissance of deep learning [28, 50, 51], convolutional neural network (CNN) based methods [8, 15], have achieved significant progress in crowd counting task [13, 18, 27, 55, 63]. They formulate the task as a regression problem [10, 22, 32, 39, 64], which designs the sophisticated network to establish the nonlinear relationship between the input crowd image and its corresponding crowd density map. Among them, efficiently modeling the scale variations of a crowd is a classical and hot research topic, and many researchers spare no effort to propose methods to handle it. For example, multi-column networks [37, 44, 66] are designed to model different scales of crowd head. However, they usually have complicated structures and need a long time to be optimized, requiring large computation resources to be implemented and not appropriate to real-world applications. The encoder-decoder frameworks are recently very popular in crowd counting tasks, and the sophisticated decoder is proposed. For instances, Zhao et al. [68] design the decoder with different auxiliary task branches for obtaining robust representations from the auxiliary tasks. Xie et al. [60] extract multi-scale features via the proposed decoder with the stacked dilated convolutional layers and the recurrent modules. Although they achieve promising performances, they may fail in the scene with the extreme crowd scale variations and complicated background stuff due to the limited scale representations and representational ability. Thus, modeling a crowd’s scale variations in different scenes is still a challenging and unsolved problem for crowd counting.

To solve the challenges mentioned above in the crowd counting task, we aim to extract efficient multi-scale feature representations for crowd counting from three aspects: (1) extract high-level semantic features of crowd for enhancing the crowd-aware representations; (2) model continuous scale variations of crowd for multi-scale crowd counting; and (3) extract long-range dependency of pixel for obtaining context information. To this end, we propose a compositional multi-scale feature enhanced learning approach (COMAL). Specifically, for Semantic feature enhancement, the semantic enhanced module (SEM) is designed, which embeds the semantic information from high-level features to the multi-scale crowd features. The diversity enhanced module (DEM) enriches the varieties of feature representations via three diversity enhanced blocks in a cascade manner for scale diversity enhancement. For context enhancement, the context enhanced module is proposed to extract context information from spatial and channel dimensions via the neural attention mechanism. With the help of COMAL, the multi-scale features can own strong representational ability and abundant feature representations, which can handle the scale variation challenges from different crowd scenes. Based on the proposed COMAL, we design a counting network under the encoder-decoder framework. The COMAL is used as the decoder for final crowd density estimation. Extensive experiments are performed on commonly-used crowd counting benchmarks, and our network outperforms the other state-of-the-art methods. The visualization results further prove the effectiveness of the proposed COMAL.

To summarize, the main contributions of our paper are fourfold:

  • We propose a semantic enhanced module (SEM) to embed the high-level semantic information into the multi-scale features, which can improve the crowd recognition performance on complex crowd scene.

  • We develop a diversity enhanced module (DEM) to enrich the scale representations. It helps the counting network to handle the extreme scale variations case better.

  • We design a context enhanced module (CEM) to strengthen the extracted multi-scale features with more context information. CEM can help the counting network recognize the foreground crowd and background stuff for the complex crowd scene.

  • We combine the above three modules into a compositional learning approach, COMAL, and build an encoder-decoder network based on it for crowd counting. With the assistance of COMAL, the counting network outperforms the other state-of-the-art methods on commonly-used crowd counting benchmarks.

The rest of this paper is organized as follows. Section 2 demonstrates the related works of CNN-based crowd counting and multi-scale feature learning methods. In Section 3, we introduce the COMAL and its components in detail. We introduce the experiment details and model analysis in Section 4 and conclude our method in Section 5.

2 Related works

In this section, we review the CNN-based crowd counting methods and multi-scale feature representation learning methods.

2.1 CNN-based crowd counting

We first review the crowd counting method [7, 29, 34, 45, 46, 57, 62] and summarize them in Table 1. For example, Zhang et al. [66] proposed a Multi-column Convolutional Neural Network (MCNN) with different convolutional structures to solve the scale variations of crowd heads. Sam et al. [44] designed Switch-CNN, which trained a switch classifier to select the optimal CNN regressor for the specific scale density estimation. However, The limitation of Switch-CNN is that it chooses one of the results of different sub-networks rather than fusing them. Deb et al. [11] proposed an aggregated multi-column dilated convolution network for perspective-free counting. Although the above multi-column networks have achieved significant progress, they only consider limited crowd scale and doesn’t perform well on continuous scale variation scene. To reduce the computational resources, Li et al. [30] proposed the CSRNet, which adopted the dilated convolutional layers to enlarge the receptive field of the network. However, the six successive dilated convolutional layers of CSRNet will cause a serious gridding effect [54], which can not efficiently extract crowd features. To solve this problem, our SEM adopted multiple parallel filters with different dilate rates for exploiting multi-scale features. Cao et al. [3] proposed a scale aggregation network (SANet), which applied the scale aggregation module to extract multi-scale features and the transposed convolutional layer to regress the final crowd density map. Besides, some neural attention based methods have also been applied to the crowd counting task [16, 19]. Guo et al. [19] explored a scale-aware attention fusion method with different dilated rates to obtain different visual granularities of the crowd’s region of interest. Gao et al. [16] proposed a space-/channel-wise attention regression network to exploit the context information of crowd scene for accurate crowd counting. The well-designed attention models effectively encode the large-range contextual information. We propose a compositional learning approach to enhance the multi-scale feature, which guides the counting network to learn robust representations for different crowd scenes.

Table 1 Summarizations of crowd counting methods

2.2 Multi-scale feature representation learning

Scale variation is a common problem in different computer vision tasks [4, 9, 20, 31, 67]. Many multi-scale feature representation learning methods are proposed to solve it. Lin et al. [31] proposed a feature pyramid network (FPN), which fused high-level features and low-level features by element-wise summation for small object detection. Zhao et al. [67] proposed a pyramid scene parsing network (PSPNet) for aggregating context information at different scales. Inspired by the spatial pyramid pooling (SPP) [21], Chen et al. [9] proposed the Atrous Spatial Pyramid Pooling (ASPP) module to use four convolutions with different dilated rates. ASPP can effectively enlarge the network’s receptive field and obtain multi-scale information, which prompt the network to achieve a new superior result on semantic segmentation task. He et al. [20] proposed the Adaptive Pyramid Network (APCNet), which used Adaptive Context Modules to leverage local and global representation to estimate an affinity weight for local regions. To obtain larger-scale information, Cao et al. [4] proposed a global context network (GCNet), which focuses on the connection between different image positions by establishing a long-range relationship between pixels. In this paper, we propose the DEM to enrich the multi-scale feature representations, and apply the proposed SEM and CEM to strengthen the feature representations.

3 Proposed method

In this section, we firstly introduce the overview of the counting network with the proposed COMAL. Then, SEM, DEM, and CEM are elaborated. Finally, we demonstrate the loss function and evaluation metrics we use.

3.1 Overview

The overview of the counting network we used in this paper is shown in Fig. 1. Following [2, 16, 30], we choose VGG-16 [48] as the feature encoder. However, in order to obtain semantic features, we use the first thirteen layers instead of the first ten convolutional layers. Then, the encoder features are fed to SEM, DEM and CEM sequentially to get the enhanced multi-scale crowd features. Finally, the extracted multi-scale features are processed by a single 1 × 1 convolutional layer and the bilinear interpolation operation to regress the final crowd density map. Each component of the counting network is demonstrated as follows.

Fig. 1
figure 1

Overview of the proposed counting network. Each input image is fed to the first 13 layers of VGG-16 to extract the crowd features. Then, the output of the first 10 layers of VGG-16 (Low-level feature) and the first 13 layers of VGG-16 (high-level feature) are sent to SEM, DEM and CEM to generate the enhanced multi-scale crowd features. Finally, the extracted multi-scale context features are processed by 1 × 1 convolutional layer and bilinear interpolation operation for final crowd density estimation

3.2 Semantic enhanced module

We propose the SEM to generate the multi-scale crowd features with abundant semantic information for final crowd density estimation. The detailed structure of SEM is shown in Fig. 2. It has two paths: the low-level feature process path (LFP) and the high-level feature process path (HFP). The LFP is designed to extract multi-scale features, and the HFP aims to enhance the extracted feature with more high-level semantic information. Specifically, we use four convolutional layers with different dilated rates in a parallel way to extract various scale features. After that, the different scale features are combined with the concatenate operation. A 1 × 1 convolutional layer is applied to reduce the feature dimensions. For HFP, the high-level feature from VGG-16 is fed to 1 × 1 convolutional layer to reduce the feature dimension and processed by the bilinear interpolation operation to the same size as the low-level feature dimensions. Different from the previous approach [31], which directly uses the element-wise summation operation to fuse the upsampled high-level features and low-level features, we follow the design of Exfuse [65]. The output of HFP is multiplied with the output of LFP with element-wise multiplication operation to generate the initial multi-scale features, which prompt the network with more feature discriminability. More analysis can be seen in Section 4.3.

Fig. 2
figure 2

Illustration of SEM. The low-level features are fed to different dilated convolutional layers to generate the initial multi-scale crowd features. The high-level features from the encoder are multiplied with the output of the dilated convolutional layers to modified the extracted multi-scale crowd features with more semantic information

3.3 Diversity enhanced module

Although SEM generates multi-scale crowd features, the representation of crowd features are limited, which will hinder the performance of counting network in complex scene. To increase the diversity of crowd features, we design the DEM, which consists of three diversity enhanced blocks (DEB). The design philosophy of DEM comes from [56]. As shown in Fig. 3, each DEB has two branches. One branch with a single 3 × 3 convolutional layer and another branch with two stack 3 × 3 convolutional layers. All 3 × 3 convolutional layers have the half channel number of the input features and the output of two branches are fused with the element-wise summation. We place three DEBs in a cascade manner after SEM, as shown in Fig. 3 (b), and it is equivalent to eight branches with different receptive fields in parallel, as shown in Fig. 3 (c). Thus, it can generate the abundant crowd features for modeling continuous scale variations. The performance of different numbers of DEB can be seen in Section 4.3.3.

Fig. 3
figure 3

Illustration of DEM. From left to right: (a) the structure of DEB, (b) the structure of three DEBs in a cascade manner (DEM), (c) the equivalent structure of (b)

3.4 Context enhanced module

To increase the discriminability of the proposed COMAL, we propose the CEM to exploit the context information from multi-scale crowd features. The detailed architecture of CEM is shown in Fig. 4. CEM includes two branches: position attention module (PAM) and channel attention module (CAM). The details of PAM and CAM are as follows.

Fig. 4
figure 4

Illustration of CEM. The PAM of CEM is designed for exploiting the context information from spatial dimension and the CAM of CEM is developed for acquiring the context information from channel dimension

3.4.1 Position attention module

The PAM encodes the context information by calculating the long-range pixel relationship. Its detailed structure is shown in Fig. 4. The input features are firstly processed by a 3 × 3 convolutional layer. After that, the processed features are fed into a 1 × 1 convolutional layer and the Softmax layer to get the position attention weight \(P_{i}^{att}\), which can be formulated as follows:

$$ P_{i}^{att}=\frac{\exp \left( P_{i} \cdot P_{j}\right)}{{\sum}_{j=1}^{N} \exp \left( P_{i} \cdot P_{j}\right)}P_{i}, $$
(1)

where {Pi|i ∈{1⋯N}} denotes the i-th position of input feature map, N is the number of positions in the feature map, which is equal to H × W.

The position attention weight \(P_{i}^{att}\) is fed to the bottleneck structure which is constructed by two 1 × 1 convolutional layers. Specifically, we place the LayerNormalization (LN) at the middle of two 1 × 1 convolutional layers for better weight optimization. The output of bottleneck is fused with the input of PAM via the residual learning and the final position attention feature can be formulated as follows:

$$ P_{i}^{\text {final }} = P + W_{p 2} \text{ReLU}\!\left( \!\text{LN}\!\left( \!W_{p 1}\!\right){\sum}_{i=1}^{N}P_{i}^{att}P \!\right), $$
(2)

where P denotes the input feature of PAM. ReLU(⋅) and LN(⋅) denote the ReLU and LN layer, respectively. Wp1 and Wp2 represent the weight of two 1 × 1 convolutional layers, respectively.

3.4.2 Channel attention module

The structure of CAM is similar with PAM, which is shown in Fig. 4. Different from PAM, we apply the global average pooling layer to acquire the global context information and the final channel attention weight \(C_{i}^{\text {final }}\) can be defined as follows:

$$ C_{i}^{\text{final }}=C+W_{c 2} \text{ReLU}\left( \text{LN}\left( W_{c 1} C_{m}\right)\right), $$
(3)

where C denotes the input feature of CAM. Cm represents the global average pooling feature. Wc1 and Wc2 denote the weight of two 1 × 1 convolutional layers, respectively.

3.5 Ground-truth density map generation

Following [66], we use the Gaussian kernel to convolve the head annotation points and generate a crowd density map F(x), which is defined as follows:

$$ \mathrm{F}(x)={\sum}_{i=1}^{N} \delta\left( x-x_{i}\right) * G_{\sigma}(x) $$
(4)

where Gσ(x) stands for the Gaussian kernel, xi is the ground truth head location, x is a pixel position in the input image. We convolve \(\delta \left (x-x_{i}\right )\) with a Gaussian kernel with parameter σ. For different datasets, σ is set as different values. For ShanghaiTech Part_B, UCF_CC_50, and UCF_QNRF, σ is set to 15. For ShanghaiTech Part_A, σ is equal to \(\beta \bar {d}^{i}\), where \(\bar {d}^{i}\) represents the average distance of k nearest neighbors and β is set to 0.3.

3.6 Loss function and evaluation metrics

We use the L2 loss to optimize the proposed COMAL. The loss function is defined as follows:

$$ L({\varTheta})=\frac{1}{2 N} {\sum}_{i=1}^{N}\left\|F\left( X_{i},{\varTheta}\right)-F_{i}\right\|_{2}^{2}, $$
(5)

where N is the total number of training images. \(F\left (X_{i},{\varTheta }\right )\) is the estimated density map generated by COMAL with parameters Θ. Xi represents the input image while Fi is the ground truth of the input image Xi.

The mean absolute error (MAE) and the mean square error (MSE) are chosen to evaluate the effectiveness of our method. The formulations are as follows:

$$ MAE=\frac{1}{N} {\sum}_{i=1}^{N}\left|C_{i}^{p r e d}-C_{i}^{g t}\right|, $$
(6)
$$ MSE=\sqrt{\frac{1}{N} {\sum}_{i=1}^{N}\left|C_{i}^{p r e d}-C_{i}^{\text{gt}}\right|^{2}}, $$
(7)

where N stands for the total number of the test images. \(C_{i}^{\text {gt}}\) and \(C_{i}^{pr ed}\) denote the ground truth number and the prediction number in the i-th image, respectively.

4 Experiments

In this section, we first describe the implementation details and experiment setup. Then, we introduce the commonly-used crowd counting datasets and compare our method with other state-of-the-art methods. Finally, we conduct ablation experiments to evaluate the effectiveness of each component from our method.

4.1 Implementation details

We apply the Adam to optimize our network. Following [16, 19, 40], the initial learning rate is set to 1 × 10− 5. And the learning rate decreased by 0.99 times every two epochs. The weight decay is set to 1 × 10− 4. To optimize the network better, we set a magnification factor to enlarge the value of the ground truth density map. The magnification factor is set to 100 for ShanghaiTech Part_A and UCF-QNRF, 200 for ShanghaiTech Part_B, and 10 for UCF_CC_50. All training images are cropped and resized to 576 × 768. The experiments are conducted under the Pytorch framework with a single NVIDIA GTX 2080Ti GPU.

4.2 Datasets and comparisons

4.2.1 Datasets

We evaluate our method on three commonly-used crowd counting datasets. The details of each dataset are shown in Table 2.

Table 2 Summarization of ShanghaiTech Part_A, ShanghaiTech Part_B, UCF_CC_50, and UCF-QNRF

ShanghaiTech [66] includes 1,198 images with 330,165 annotated people. It is divided two parts: Part_A and Part_B. Part_A contains 482 highly crowded images randomly grabbed from the Internet. Part_B contains 716 images taken on downtown Shanghai’s bustling streets.

UCF_CC_50 [24] dataset includes 50 images with 63,974 annotated heads. It is a very challenging dataset because the number of people in each image varies greatly.

UCF-QNRF [25] dataset contains 1,535 images with 1,251,642 annotated heads. It includes different crowd congested scenes and large variation crowd distributions, which is also challenge for current crowd counting methods.

4.2.2 ShanghaiTech

The comparison results on ShanghaiTech dataset are presented in Table 3. We can see that the proposed COMAL outperforms other state-of-the-art methods in terms of MSE metrics. Specifically, compared with CSRNet, our COMAL achieves lower 8.6 and 17.9 in terms of MAE and MSE metrics, which benefits from the proposed SEM that can avoid a serious gridding effect [54]. Compared with SCAR, our COMAL also performs better counting accuracy, which benefits from the proposed SEM and DEM. The qualitative results in Fig. 5 further prove the effectiveness of our method. We observe from the fourth column that our proposed DEM can capture continuous scale changes of the crowd.

Table 3 Comparison results of different methods on ShanghaiTech dataset
Fig. 5
figure 5

Visualization results of different counting methods. From top to bottom, they are input images, ground truth, the results of COMAL, SCAR, and CSRNet, respectively

Besides, we conduct the further statistic analysis of the performance of the proposed COMAL on ShanghaiTech Part_A dataset. Specifically, as shown in Table 4, the ShanghaiTech Part_A dataset is divided into five crowd density levels. We compare the performance of COMAL, SCAR and CSRNet on the five crowd density levels, the comparison details are shown in Fig. 6. We find that COMAL performs better than the other counting networks on all five crowd density levels, which demonstrates the effectiveness of the proposed method.

Table 4 Summarization of five crowd density levels on ShanghaiTech Part_A
Fig. 6
figure 6

Statistics analysis of SCAR [16], CSRNet [30] and our COMAL on different crowd density levels of ShanghaiTech Part_A dataset

4.2.3 UCF_CC_50

Following some previous works, we perform five-fold cross validation to evaluate the performance of the proposed COMAL. The quantitive results on UCF_CC_50 are presented in Table 5. Compared with other state-of-the-art methods, SFCN‡with Pre-GCC [40] uses synthetic data to expand the limited training images of UCF50 and achieves a better count performance. However we see that our COMAL achieves state-of-the-art results in methods without synthetic data pretraining, which further proves the superiority of our method. Although there are huge variation crowd distribution in this dataset, COMAL performs better 17.5 and 20.8 than the performance of TEDNet in terms of MAE and MSE metrics, which is a significant progress for crowd counting task.

Table 5 Comparision results of different methods on UCF_CC_50 dataset.“-” denotes the results are not provided by the original paper

4.2.4 UCF-QNRF

The performance of the proposed COMAL on UCF-QNRF is presented in Table 6. We can see that COMAL outperforms the other methods in methods without synthetic data pretraining, which further proves the superiority of our method. Compared with the performance of TEDNet, COMAL achieves lower 10.9 in terms of MAE, which further proves the effectiveness of our method. Without the help of synthetic data, our method still achieves similar performance on MAE metrics compared to Pre-GCC [40].

Table 6 Comparision results of different methods on UCF-QNRF dataset

4.3 Ablation study

4.3.1 The effectiveness of different structures of COMAL

To evaluate the effectiveness of different structures, we design four variants of COMAL and conduct extensive experiments on the ShanghaiTech Part_A dataset. The details of the four variants are as follows.

The first model is the first 10 layers of VGG-16, which is denoted as VGG-10. The second model places the proposed SEM on the first model, which is represented as VGG-10 + DEM. Based on the second model, the third model changes the first 10 layers of VGG-16 into the first 13 layers of VGG-16, and is denoted as VGG-13 + DEM. The fourth model adds DEBs into the third model and is represented as VGG-13 + SEM + DEM.

Qualitative and quantitative results are displayed in Fig. 7 and Table 7. We can see that the counting performance is continually improved with the injection of the proposed components into the counting model, and achieves the best results with all the proposed components, which proves the effectiveness of our method. Specifically, compared with the fourth model, COMAL performs better 6.6 and 8.6 in terms of MAE and MSE metrics, which demonstrates the importance of context information generated by CEM for final crowd counting.

Fig. 7
figure 7

Visualization results of COMAL with different components. From top to bottom, they are input images, ground truth, VGG-10, VGG-10 w/ DEM, VGG-13 w/DEM, VGG-13 w/ SEM and DEM, and COMAL, respectively

Table 7 Comparison results of COMAL with different structures on ShanghaiTech Part_A dataset

4.3.2 The effectiveness of the components of COMAL

We design three different structures to verify the effectiveness of each component in COMAL. As shown in Table 8. C(Nc) represents convolutional layer with Nc filters. From the first row and the last row of Table 8, we can see that the counting accuracy drops when we use the convolutional layer to replace the SEM, which demonstrates that the high-level semantic features are important for final crowd counting. Besides, compare the performance of the second row and the third row, we find that the method with DEM performs better than the method without DEM. This is contributed to the multi-scale features generated by the DEM. For the last two rows of Table 8, we can see that the performance of CEM outperforms the performance of CBAM [59] hugely, which further proves the effectiveness of our proposed CEM.

Table 8 Comparison results of the components of COMAL on ShanghaiTech Part_A dataset

4.3.3 The number of DEB

We explore the effect of the number of DEB to the final counting accuracy. The comparison results are displayed in Table 9. We can see that with the increased number NDEB of DEB, the counting performance of COMAL is improved, and COMAL achieves the best results when the NDEB is equal to 3, which is benefited from the scale diversities provided by DEBs. However, when the NDEB is larger than 3, the counting performance drops. The reason is that more DEBs increase the complexities of the network and hinder the optimization process of the counting network.

Table 9 Comparison results of COMAL with different numbers of DEB on ShanghaiTech Part_A dataset

4.3.4 The design of CEM

To evaluate the rationality of CEM, we explore the performance of COMAL with only PAM (COMAL w/PAM) or CAM (COMAL w/CAM) on ShanghaiTech Part_A dataset. The quantitative results are shown in Table 10. We can see that the counting accuracy has continually improved with the help of CA and PA. The model achieves the best results when the model with CEM, which demonstrates that effectiveness of our method. The qualitative results in Fig. 8 further prove the importance of CEM to final counting accuracy.

Table 10 Comparison results of different designs of CEM on ShanghaiTech Part_A dataset
Fig. 8
figure 8

Visualization results of COMAL with different attention modules. From top to bottom, they are input images, ground truth density maps, COMAL w/CA, COMAL w/PA, and COMAL, respectively

5 Conclusions

In this paper, we propose the COMAL for multi-scale crowd counting. We use the first 13 layers of VGG-16 as the encoder to extract features, and adopt the proposed decoder to process the extracted features for final density estimation. COMAL is evaluated on three challenging crowd counting datasets and achieves superior results compared with other state-of-the-art methods. However, COMAL owns lots of network parameters which is not suitable for the devices with limited computation resources. Besides, we only model the image spatial context information and do not consider to extract temporal information of video. Thus, in future work, we can explore our COMAL to video crowd counting task in a lightweight design.