1 Introduction

Crowd counting is an interesting research area that involves computer vision and deep learning to estimate the number of people in an images or video frames. Recently, it has gained significant attention in the computer vision community due to the significance of the problem. Crowd counting is generally implemented in two ways: (i) counting objects (input is an image and the output is a number i.e., total head count in the image.), and (ii) density map estimation (input is an image and the output is the density map of the crowd which is then integrated to get the total head count.). Traditional methods for crowd counting were all based on detecting hand-crafted local features such as full body [1, 2], body parts [3,4,5], shapes [6], or global features such as foreground [7], edge [8], texture [9] and gradient features [10, 11] and then use machine learning models such as linear regression, ridge regression, Gaussian process, support vector machines (SVMs), random forest, gradient boost, and neural networks to provide the total count or a density map of the image. However, all the accuracy of these methods significantly degrades on images with dense crowds due to challenges such as occlusions, low resolution, foreshortening and perspectives.

Recent research on crowd counting shows the efficacy of deep learning methods for crowd counting [12]. Convolution neural networks (CNNs) due to their strong capability of auto feature extraction [13, 14]. Although even small CNN models [15] outperform traditional counting methods, their accuracy degrade on high density scenes.

To achieve higher accuracy in dense scenes, deeper models with a large size of parameters are developed [16,17,18,19]. These deep models although achieve good accuracy create performance bottlenecks in real-time applications due to large memory requirement, higher training complexity, and large inference delay. In contrast, small-sized CNN models offer several benefits in real-time video surveillance e.g., they incur low inference delay, require low memory for deployment on embedded devices, quickly update over-the-air, and can be trained, fine-tuned and run in distributed manner [20]. However, lightweight and shallow CNN models are usually disregarded by some due to their limited accuracy. Contrary to that, we believe that leveraging best practices such as carefully designing the model architecture, the use of accurately annotated training data, and efficient learning strategies can jointly improve the accuracy of shallow models to a greater extent [21].

1.1 Model design strategy

Generally a good choice of convolution filters play a crucial role in feature learning and contribute to reducing the size of the model. In essence, larger filters (\(5\times 5\)) are more expensive and thus should be replaced by smaller filters (\(3\times 3\)). As an example, a stack of three (\(3\times 3\)) convolution layers is preferred over a single convolution layer of larger receptive fields such as \(7\times 7\) or \(9\times 9\) because of the non-linear activations between them. It also has less number of parameters (\(3\times 3^2\times C^2\) < \(1\times 7^2\times C^2\) i.e., 81% less parameters) and are computed faster. Further dimension reduction can be applied using \(a\times 1\) convolution before expensive convolutions (e.g., \(5\times 5\) or \(3\times 3\)) [22]. Furthermore, spatially separable convolutions are preferred i.e., a \(3\times 3\) convolution can be decomposed in two sequential \(1\times 3\) and \(3\times 1\) convolutions which leads to the same number of parameters and can even achieve better learning.

1.2 Data annotation strategy

In crowd density estimation, the point annotations on top of the heads in crowd image form a sparse “dot-map” or “localization-map” which is converted into a density map by convolving the head position with a Gaussian kernel. The scale parameter in the Gaussian kernel visually creates a blob around the point (pixel). A good density map is more accurate if the blob size covers the entire head and do not non-overlap with neighboring heads. However, due to the camera perspective effect, the size of heads vary in the same image. Adaptive kernels are used to select different values of the scale parameter which solve this problem to some extent. To cope with the perspective distortion and scale variations in crowd images, recent works propose often very deep models with complex architectures. However, unlike images captured with CCTV camera, in aerial images captured from drones, the perspective distortion is minimum and the scale variation is directly related to the drone altitude. If the drone-altitude is known, it can be used to generate accurate density maps.

1.3 Model learning strategy

The idea of curriculum learning (CL) in neural network was first presented in [23]. The idea behind CL is the natural learning process in humans and animals i.e., humans learn better when concepts are presented in specific order of complexity i.e., from simpler to difficult tasks. Unlike traditional learning methods, in curriculum learning the training samples are sorted by order of (typically increasing) complexity. CL has been demonstrated as an effective strategy to improve the learning capability and faster convergence in various tasks e.g., computer vision [24,25,26], natural language processing (NLP) [27, 28], reinforcement learning [29,30,31] etc. Recent studies [32, 33] inspire to adopt CL in our research.

This work leverages the aforementioned three strategies for generating ground truth density maps and to design and train an extremely lightweight crowd density estimation model. In the first step, we designed the model (LCDnet) by carefully choosing convolution filters of different sizes in different layers. To keep the model more compact, we used less number of filters in the initial layers so even with the larger input feature maps, the computational load is controlled. To further alleviate the computational complexity, we used rectangular filters of size (\(1\times 3\)) and (\(3\times 1\)) instead of square filters. This also improves the learning performance as indicated in [22]. The resultant CNN model is a shallow network with only 0.05 Million parameters. Next, to train the shallow model with drone-captured aerial images, we generated high quality (accurate) density maps. We tested different scale values of the Gaussian function and empirically found the most accurate values for each image. Lastly, the curriculum learning technique is employed to improve the learning performance of the model.

The contribution of our work is as follows:

  • A lightweight CNN model (LCDnet) with fewer parameters, low memory requirement, and faster run-time than existing models.

  • Generate density maps from sparse localization maps by considering drone-altitude to create adaptive Gaussian kernels to improve learning.

  • Propose an efficient strategy based on curriculum-learning approach to further improve model training and convergence.

  • Experimental demonstration of reasonably good performance using LCDnet on benchmark datasets. The accuracy is comparable to existing models of double-size than LCDnet.

2 Related work

A number of crowd counting datasets exists each of which can be divided into three categories. Surveillance-view datasets containing indoor or outdoor images collected by surveillance cameras (e.g., Mall [34], UCSD [35], WorldExpo’10 [15], ShanghaiTech Part B [36], Free-view dataset containing images from different sources including Internet (e.g., UCF-CC-50 [37], UCF-QNRF [38], ShanghaiTech Part A [36]), and Drone view datasets containing images collected using drones (e.g., DroneRGBT [39], CARPK [40]). Most of the earlier research works on crowd counting have been using the datasets in the first two categories. The drone-view datasets have been available recently.

While there has been different approaches for crowd counting, density estimation using CNN is the most widely used method. The first known CNN model for density estimation and counting is CrowdCNN [15]. The CrowdCNN model consists of three convolution layers followed by three fully connected layers. Following this, numerous works proposed different CNN-based models for crowd counting.

A multi-column CNN (MCNN) model is proposed in [36] which consists of three CNN columns, each containing filters of receptive fields of different sizes (small, medium large). The outputs of the three columns are combined to predict the final density map. MCNN exhibits good performance to adapt to the scale variations in images due to perspective effects or different image resolutions. A two column CNN model (CrowdNet) is proposed in [41]. The model consists of two CNN columns i.e., a deep network (five CNN layers) and a shallow network (three CNN layers). The outputs of both networks are combined to predict the final density map. A Switched CNN (SCNN) is proposed in [42], which consists of two parts, a Switch and a CNN regressor. The CNN regressor consists of three independent columns each with different size receptive fields. Patches from the input image are first fed to the Switch network, which relay it to one of the CNN regressor to predict the density map. The intuition in SCNN is to build CNN model which can adapt to the large scale variations without increasing the model computational complexity i.e., a patch is passed through only one column in the regressor. This reduces the computational complexity when compared to other multi-column CNN models e.g., MCNN and CrowdNet.

Unlike multi-column CNN models, authors in [43] proposed a single column network called multi scale CNN (MSCNN) to learn the scale variations. MSCNN uses three Inception modules [22] called multi-scale blobs (MSBs). Each MSB consists of multiple filters with different kernel size and is able to extract scale-relevant features. The aforementioned CNN models can adapt to the scale variations introduced in the training data but may fail to generalize well [18]. A cascaded multi-task learning (CMTL) model [18] is proposed to adapt to the wide variations of density levels in images. CMTL is also a two column network. The first column is a high-level prior that classify an input image into groups based on the total count in the image. The features learned by the high level prior are shared with the second column that estimates the respective density map.

The previous models mostly used multi-column architectures to learn scale-relevant features in crowd images achieves good results. To further improve the counting accuracy in highly congested scenes, authors in [16] propose deeper architecture by utilizing transfer learning. The Congested Scene Recognition (CSRNet) model [16] uses VGG16 [44] (first 10 layers) as the front-end to extract features, and a back-end network with dilated convolution to substitute the pooling layers thus avoiding the loss of spatial information. Transfer learning an improve the feature extraction capability of the crowd counting model and has been recently adopted in CANNet [45], GSP [46], TEDnet [47], Deepcount [48], SASNet [49], M-SFANet [19], and SGANet [50].

The aforementioned models often produces output density maps of lower resolution than the input image and uses patch-based training. The Trellis Encoder–Decoder Network (TEDnet) is proposed [47] which uses whole image as the model input and preserve the size of the density map to the actual resolution. Another encoder-decoder model with special Inception [22]-like modules is scale aggregation network (SANet) proposed in [17]. Like TEDnet, SANet also produces high resolution density map but has a simpler model structure.

Our proposed model LCDnet simultaneously achieves two benefits as compared to the aforementioned models. First, it is extremely lightweight as compared to other models and can run faster even at edge devices with limited compute resources (a comparison is shown later in the paper). Secondly, it also produces density maps of better quality of size (\(\frac{1}{2}\)) of input size as compared to other methods e.g., CSRNet (\(\frac{1}{8}\)) and MCNN (\(\frac{1}{4}\)). It also requires least amount of memory to fit in on-chip caches.

3 Proposed method

The aforementioned CNN-based crowd models are designed to improve counting accuracy. However, in a typical crowd monitoring system, the user may be interested in a rough estimation of crowd densities (e.g., low, medium, high, very high etc.) located in a geographical area rather than the exact count. Our goal in this paper is to develop a very lightweight model which can reasonably detect the presence of crowd scenes to an extent useful in analyzing crowds but essentially run faster on edge devices with limited computing resource. To this end, we propose a crowd density estimation model (LCDnet).

3.1 Network architecture

The architecture of LCDnet is shown in Fig. 1. It consists of six convolution (Conv) layers. Conv1 consists of 64 (\(5\times 5\)) filters. The output of Conv1 layer is fed to Conv2 and Conv3, each having 32 (\(3\times 3\)) filters. The outputs from Conv2 and Conv3 is fed to Conv4 and Conv5 both has 64 (\(3\times 3\)) filters respectively. The outputs of Conv4 and Conv5 is concatenated and fed to Conv6 which consists of 128 (\(1\times 1\)) filters to predict the final density map. It is worthy to note that LCDnet returns the density map rather than the crowd density class. There are two benefits for this. First, the density map preserves the location information of crowd which can be used to localize the crowd in the real-world. Second, it is easy to determine the total count (whole scene) or local count (specific part of the scene) from the predicted density map and even use the count information to configure user-defined crowd densities based on the application requirement.

In the proposed LCDnet architecture, the first convolution layer is used to detect features such as edges. These detected features are then used in two columns. Both columns contain three layers i.e., two layers of 32 filters of size \(1\times 3\), and \(3\times 1\) (in reverse order in column 2), respectively, which is followed by a layer of 64 filters of \(3\times 3\) size. The outputs of both columns are concatenated and fed to a \(1\times 1\) conv layer of size 128, which generates a density map. The output density map is half size of the input image.

Fig. 1
figure 1

LCDnet with curriculum learning

3.2 Ground truth generation

If \(x_i\) is a pixel containing the head position, it can be represented by a delta function \(\delta (x - x_i)\). The density map is generated by convolving the delta function with a Gaussian kernel \(G_\sigma\).

$$\begin{aligned} Y = \sum _{i=1}^{N}{ \delta (x-x_i) * G_\sigma }, \end{aligned}$$
(1)

where, N is the total number of annotated points (i.e., total count of heads) in the image. The integral of density map Y is equal to the total head count in the image. Visually, this operation creates a blurring of each head annotation using the scale parameter \(\sigma\). There are various kernel settings to generate a variety of density maps. The most basic approach is to keep \(\sigma\) fixed value, which means that the density map will apply same kernel to all head positions irrespective of their scale in the image [17]. As head sizes in image can vary due to camera prospective, a single value of \(\sigma\) may not be a good choice. Hence, some recent works propose to use adaptive Gaussian kernels to create density maps. The value of \(\sigma\) is calculated as the average distance to k-nearest neighboring head annotations. Visually, it generates lower degree of Gaussian blur for dense crowds and higher degree for region of sparse density in crowd scene. Typical settings includes \(k=1\) [38], \(k=10\) [17, 43]. Although adaptive Gaussian kernel may produce better results on images with large scale variations, our intuition is that drone images typically have less scale variations as compared to surveillance images e.g., from CCTV. The scale variations in drone images result from drone flying altitudes which do not vary too much due to regulatory measures. Thus, the scale variations are limited and a single value of \(\sigma\) can be experimentally determined to produce density maps. In our experiments, we empirically determined the value of \(\sigma\) based on the drone altitudes. The datasets used in this study contain images captured from different altitudes. Thus, we first segregated images into different groups by drone altitudes and empirically found separate values of \(\sigma\) for each group.

3.3 Training

The image resolutions in the datasets used in this study are not very high, thus we use whole image-based training operations without extracting patches or downsampling operations on training images. However, to avoid model overfitting, data augmentations techniques such as horizontal flipping, and random brightness and contrast are applied. The kernels in all Conv layers are randomly initialized using Gaussian distribution with the value of standard deviations 0.01. We used Adam optimizer with a base learning rate 0.0001. The loss function used is pixel-wise euclidean distance between the target and predicted density maps which is defined in Eq. (2).

$$\begin{aligned} L(\Theta ) = \frac{1}{N} \sum _{1}^{N}{ ||D(X_i;\Theta ) - D_i^{gt}||_2^2}, \end{aligned}$$
(2)

where N is the number of samples in training data, \(D(X_i;\Theta )\) is the predicted density map with parameters \(\Theta\) for the input image \(X_i\), and \(D_i^{\text {gt}}\) is the ground truth density map.

We further applied curriculum learning technique to improve the learning performance of our model. In CL settings, we used transfer learning using CSRNet to determine the difficulty level of each image. Based on the counting error, images are sorted in ascending order before packing them into mini-batches. Thus, mini-batches are created in the order of their cumulative complexity.

4 Experiments and results

The proposed model (LCDnet) was trained on a single GPU (Nvidia RTX-8000) using PyTorch deep learning framework. We also implemented and trained other models used in this study from the scratch for fair comparison.

4.1 Datasets

We evaluate the proposed model on two benchmark datasets i.e., DroneRGBT and CARPK. The DroneRGBT dataset contains images of people whereas the CARPK dataset contains images of cars, both captured from drones.

4.1.1 DroneRGBT

The dataset contains 3600 RGB and thermal image pairs with a spatial resolution of \(512\times 640\) pixels. The images cover a wide range of scenes e.g., campus, streets, parks, parking lots, playgrounds, and plazas. The dataset is divided into training set (1807 samples) and test set (912 samples) in such a way that both the training and test set include diverse images (i.e., different scenes, crowd densities, illumination, and scales) to avoid overfitting. The dataset provides head annotations of people. The count distribution and sample images from the dataset are presented in Fig. 3 and Fig. 2, respectively.

Fig. 2
figure 2

Sample images (top) and their corresponding density maps (bottom) from DroneRGBT dataset

Fig. 3
figure 3

Count distribution in DroneRGBT dataset

4.1.2 CARPK

The dataset contains 1 images of cars from 4 different parking lots captured with a drone. The dataset is divided into a training set containing 989 images and a test set containing 459 images. The dataset has a total number of 89, 777 cars. The original dataset contains bounding box annotations, however we transformed the original annotations to dot annotation by taking the center of the bounding box. The count distribution and sample images from the dataset are presented in Figs. 5 and 4, respectively.

Fig. 4
figure 4

Sample images (top) and their corresponding density maps (bottom) from CARPK dataset

Fig. 5
figure 5

Count distribution in CARPK dataset

4.2 Evaluation metrics

Most of the existing works on crowd counting use mean absolute error (MAE) and Grid average mean absolute error (GAME) to evaluate the accuracy of the model. MAE is the average of absolute error in predicted counts and actual counts of all images. It is calculated using 3:

$$\begin{aligned} \text {MAE} = \frac{1}{N} \sum _{1}^{N}{(e_n - \hat{g_n})}, \end{aligned}$$
(3)

where, N is the total number of images in the dataset, \(g_n\) is the ground truth (actual count) and \(\hat{e_n}\) is the prediction (estimated count) in the nth image. While MAE is the most widely used metric in crowd counting research and is often used to compare various models, MAE provide image-wide counting and does not provide where the estimations have been done in the image. Owing to the possible estimation errors in MAE, authors in [51] proposed Grid Average Mean absolute Error (GAME). In GAME, an image is divided into \(4^L\) non-overlapping patches and compute MAE separately for each patch. Thus, GAME poses a more robust and accurate estimation for crowd counting applications. It is defined in Eq. 4.

$$\begin{aligned} \text {GAME} = \frac{1}{N} \sum _{n=1}^{N}{(\sum _{l=1}^{4^L}{|e_n^l - g_n^l|)}} \end{aligned}$$
(4)

The GAME metric is more robust to localisation errors in density estimation by calculating localized error among the target and predicted density maps. We set the value of \(L=4\), thus each density map is divided into a grid size of \(4\times 4\) creating 16 patches. The absolute difference in the head count for each patch is measured and summed for all patches of the same density map, then averaged over the whole dataset. We compared the LCDnet model against existing models over the two metrics to provide a fair evaluation of the model accuracy. However, the true benefit of LCDnet is the lower model complexity at the cost of tolerable counting error. In addition, the performance is measured over two other metrics i.e., structural similarity index (SSIM), peak signal-to-noise ratio (PSNR). Both SSIM and PSNR evaluate the quality of the predicted density maps and are measured in Eq. (5) and (6) as follows:

$$\begin{aligned} \text {SSIM} (x,y) = \frac{(2\mu _x \mu _y + C_1) (2\sigma _x \sigma _y C_2)}{(\mu _z^2 \mu _y^2 + C_1) (\mu _z^2 \mu _y^2 + C_2)}, \end{aligned}$$
(5)

where \(\mu _x, \mu _y, \sigma _x, \sigma _y\) represents the means and standard deviations of the actual and predicted density maps, respectively.

$$\begin{aligned} \text {PSNR} = 10 log_{10}\left( \frac{\text {Max}(I^2)}{\text {MSE}} \right) , \end{aligned}$$
(6)

where \(\text {Max}(I^2)\) the maximal in the image data. If it is an 8-bit unsigned integer data type, the \(\text {Max}(I^2)=255\).

4.3 Evaluation results

We compared the proposed model (LCDnet) against two mainstream crowd density estimation models i.e., MCNN [36], and CSRNet [16]. MCNN is a relatively small-sized CNN model which has gained good counting accuracy over several benchmark datasets as compared to other models of similar size. CSRNet on the other hand is a deep CNN model that uses VGG-16 [44] as a front-end and has shown high accuracy in dense crowd scenes. Both models have been used for comparison in many crowd counting studies and thus we chose them as candidate models for small-sized and large-sized models in this study.

We compare the performance of LCDnet in terms of counting accuracy at the cost of model size and complexity against the MCNN and CSRNet on DroneRGBT and CARPK datasets. While the primary comparison is done against MCNN and CSRNet, we additionally provide complexity comparison against some other well-known counting models to highlight the benefit of the proposed model (LCDnet). The model complexity comparison is shown in Table 1 whereas the accuracy comparison is depicted in Table 2. The inference time is computed on GPU server (Nvidia RTX 8GB), and two different edge devices (Nvidia Jetson Xavier and Jetson Nano). The system details of these devices are as follows:

  • Server: GPU (Nvidia Quadro RTX-8000).

  • Jetson Xavior NX: 64bit system with Processor (6-core NVIDIA Carmel ARM), Memory (8GB), GPU (NVIDIA Volta architecture with 384 NVIDIA CUDA cores and 48 Tensor cores).

  • Jetson Nano: 64bit system with Quad-Core Arm Cortex-A57 MPCore, Memory (4GB), GPU (128-core NVIDIA Maxwell GPU).

Table 1 Comparison of proposed scheme (LCDnet trained with curriculum learning) against SOTA models for number of parameters (in Million), GMACs, size (in MB), and inference time (in milliseconds) for fixed input size
Table 2 Accuracy comparison of the proposed scheme (LCDnet trained with curriculum learning) against SOTA models over DroneRGBT dataset [39] and CARPK dataset [40]

On DroneRGBT dataset, LCDnet achieves MAE of 21.4 which is comparable with that of MCNN (17.9). In terms of complexity, LCDnet has almost half the number of parameters and half number of multiply-add-calculations (GMACs). LCDnet also incurs much lower (\(\frac{1}{2} \times\)) inference delay as compared to MCNN. Although the accuracy of CSRNet is much higher than both LCDnet (\(3\times\)) and MCNN (\(2.2\times\)), it has a very huge size requiring large memory size and much higher (\(20\times\)) inference delay than LCDnet.

On CARPK dataset, LCDnet achieves better results. It achieves MAE 13.1, which is close to MCNN (10.1) and slightly less than CSRNet (6.12). Some sample predictions using MCNN, CSRNet and the proposed LCDnet models over DroneRGBT and CARPK datasets, respectively, are shown in Figs. 6 and 7. It can be visualized that LCDnet has good detection capability and produces better quality density maps than MCNN for DroneRGBT dataset. We believe this is due to the use of small sized filters (\(1\times 3\) and \(3\times 1\)). The better quality of density map is expected and evident from the higher values of SSIM and PSNR.

Fig. 6
figure 6

Comparison of predictions on on DroneRGBT dataset. The first two columns shows crowd images and their corresponding ground truth. Columns 3–4 shows predictions using MCNN [36] and CSRNet [16] without curriculum learning. Column 5 shows predictions using LCDnet (ours), respectively

Fig. 7
figure 7

Comparison of predictions on on CARPK dataset. The first two columns shows crowd images and their corresponding ground truth. Columns 3–4 shows predictions using MCNN [36] and CSRNet [16]. without curriculum learning. Column 5 shows predictions using LCDnet (ours), respectively

5 Conclusion

This paper proposes a lightweight crowd density estimation model (LCDnet) for deployment over resource-constrained embedded devices (e.g., drones) suitable for real-time applications (e.g., surveillance) scenarios. The paper outlines various design principles and best practices used to develop efficient CNN architectures. LCDnet is designed by adopting three efficient strategies; (i) compact CNN model (ii) improved ground truth generation from head annotations and drone altitudes, and (iii) improved training mechanism using curriculum learning. LCDnet is evaluated on two different datasets of drone-captured images i.e., DroneRGBT, CARPK. Our experimental analysis shows that LCDnet achieves reasonably good accuracy at much lower computational cost. The small memory footprint and lower inference time makes LCDnet a good fit for drone-based video surveillance.