A Deep-Fusion Network for Crowd Counting in High-Density Crowded Scenes

People counting has been investigated extensively as a tool to increase the individual’s safety and to avoid crowd hazards at public places. It is a challenging task especially in high-density environment such as Hajj and Umrah, where millions of people gathered in a constrained environment to perform rituals. This is due to large variations of scales of people across different scenes. To solve scale problem, a simple and effective solution is to use an image pyramid. However, heavy computational cost is required to process multiple levels of the pyramid. To overcome this issue, we propose deep-fusion model that efficiently and effectively leverages the hierarchical features exits in various convolutional layers deep neural network. Specifically, we propose a network that combine multiscale features from shallow to deep layers of the network and map the input image to a density map. The summation of peaks in the density map provides the final crowd count. To assess the effectiveness of the proposed deep network, we perform experiments on three different benchmark datasets, namely, UCF_CC_50, ShanghaiTech, and UCF-QNRF. From experiments results, we show that the proposed framework outperforms other state-of-the-art methods by achieving low Mean Absolute Error (MAE) and Mean Square Error (MSE) values.


Introduction
Automated crowd analysis is crucial for efficient crowd management. Crowd analysis has numerous application, such as panic detection [64], crowd behavior understanding [44,57], crowd tracking [2], crowd flow segmentation [1], crowd congestion detection [33], and crowd counting [46,62]. Among these application, crowd counting problem has received tremendous attention from different researchers. This is due to reason that crowd counting can have potential applications in crowd surveillance and scene understanding. For effective crowd surveillance, it is imperative to predict the actual count and location of individuals in the scene. Crowd counting provides support in managing massive crowd, for example, during Hajj and Umrah, where millions of Muslims (from all over the world) gather in Holy city of Makkah to perform obligatory rituals [18]. It is the priority of the Saudi government to ensure smooth conduct of Hajj and Umrah. Therefore, researchers have proposed different automated methods [32] for efficient crowd management. For efficient crowd management, one of the priorities is to estimate crowd count and know distribution of people in the environment, which is the prime goal of this paper.
The goal of a crowd counting system is to count pedestrians in given images/videos irrespective of the scene and density. However, crowd counting is a difficult job, as it offers many challenges, such as variations in scales, clutter background, perspective distortions, and low-resolution images [56], as shown in Fig. 1. Among these challenges, scale variations are a challenging problem [65] and have not been effectively addressed so far for crowd counting problem. Scale variations refers to the variations in object's size (in our case, heads). These variations are due the perspective distortions caused by the location of the camera relative to the scene.
Several attempts have been made to solve the scale problem. A simple and straightforward solution is to re-size the image to different scales and learn multiple object detectors. Each detector will detect an object that falls in its scale range. Another simple way is to generate hand-crafted feature pyramid to detect different objects of different scales [16]. However, processing all scales of the pyramid leads to computational cost.
Due to the success of Deep Convolutional Neural Networks (DCNNs) in various tasks of computer vision, for example, object detection, classification and segmentation, researchers employ various CNN architectures to learn non-linear function from images to crowd count. However, CNN cannot implicitly handle the scale variations [25]. To make CNN adaptive to scale variations, it must be trained to capture scale variations to a certain extent. For example, a scale-aware network is proposed in [37] that resizes the input image in a way to bring all objects to similar scale and then trained a single-scale detector. Other recent methods [23,26,48] utilize the feature maps of the top layers to detect objects of different sizes. Generally, the receptive field of the topmost layers is large and contains little or no details about the small objects. Therefore, it compromises the performance of a detector in detecting small objects, especially people in high-density crowds. Furthermore, these networks require more parameters and have complex architecture to obtain desirable performance.
To handle the scale problem, we propose a framework that fuses feature maps from different layers for crowd counting. Specifically, the proposed framework consist of two parts, i.e., encoder and decoder. In the encoder part, the framework adopts fusion strategy that combines features from multiple layers to capture the details of heads of multiple scales. The decoder part estimates the crowd count and generates density map, where high peaks indicate the presence of human heads in that particular location.
The contribution of the proposed framework can be summarized as follows: • To capture the information of human heads of multiple scales, we proposed a fusion strategy that combines the information from different layers of the network. • The framework estimates the crowd density and crowd count simultaneously. The framework predicts density and count both in low-and high-density crowded scenes irrespective of the scene and density. • The propose method has been tested on different public benchmark datasets. From experiment results, we demonstrate that the effectiveness of proposed framework achieves superior performance compared to other reference methods.

Related Works
Crowd counting or density estimation methods can be broadly categorize into two main groups: holisticapproach and localapproach . In a holistic approach, global features of the image, i.e., textures, edges, and foreground pixels, are extracted from the image and a classifier/regression-based model is trained to learn the mapping between the features and actual crowd size. On the contrary, local approach utilizes the local features of image which are specific to individuals or group of people. We provide details of these approaches in the following subsections.

Holistic Approaches
Holistic approaches estimate the crowd size by utilizing the global image features. Features utilized by these methods include textures [42], foreground pixels [17], and edge features [34]. The methods proposed in [40,42] utilize gray level co-occurrence matrix (GLCM) for estimating the crowd density. Minkowski fractal dimension method is proposed in [41] for extracting texture features. Xiaohua et al. [61] achieve classification accuracy of 95% by employing the wavelet descriptors and SVM to classify the crowd density into four classes. However, this method performs well in low-density crowds, however, faces challenges when applied on high-density crowded scenes. Rahmalan et al. [27] proposed a method that achieved a superior performance on afternoon scene, due to less variations in illumination when compared to the morning dataset. This highlights the drawbacks of using texture features when employed in real-time situations, since texture features are not robust to illumination changes. Other approaches utilize foreground pixels and edges to estimate the crowd size. For example, Regazzoni et al. [47] combined multiple edge detectors, such as vertical edges for detecting the legs and arms of the individuals. Davies et al. [17] found the correlation between the size of crowd and pixels belonging to foreground to establish a principle that the number of people is linearly proportional to foreground pixels and number of edges. Such features are also used in [11][12][13], where crowd size is estimated by employing a feedforward neural network. The above-mentioned approaches rely on the static background, where the scenes are relatively captured at a high camera angle.
However, foreground pixels alone cannot provide sufficient information about the number of people in the scene, since the objects in the distance are smaller and will be represented by less number of pixels from the foreground. Therefore, as a solution, Ma et al. [38] propose a crowd estimation framework that incorporates perspective distortions. However, this method cannot handle partial or full occlusions. Similarly, Roqueiro et al. [50] applied the Median Background computing technique to define the foreground pixels. A threshold value is applied on the pixels followed by a morphological operations to smooth the results. A classifier was then trained to categorize the images as either contain zero persons or one or more persons. Similarly, [6][7][8][9] adopt holistic approaches to count the number of people in scene and account for occlusion and other non-linearities.
To summarize the discussion, holistic approaches tend to estimate the crowd size by exploiting global features of image. However, due to high variations in crowd dynamic, distribution, and density, crowd size is difficult to estimate. Therefore, as a solution, local approaches are proposed to overcome the limitations of global approaches.

Local Approaches
These methods use local features that are associated with pedestrian or groups of pedestrians with an image. These approaches can further be sub-categorized into two groups: (i) detection-based approaches use head, face to localize the individual in an image, where total number of detections represent the crowd count; (ii) localization-based method divides image into overlapping [24,29,62] or non-overlapping patches [10,51,63], and then, features are computed from each patch and crowd size is predicted by applying regression model.
Detection-based approaches are suitable to the scenes where the crowd is spare, i.e., the people in the scene are well separated and their bodies are fully visible. Therefore, pedestrian detectors/head detectors [16,19,21] are employed to get the crowd count. These methods work well in low-density crowds, where pedestrians are not occluded; however, in the real-time environment, pedestrians are always occluded and their bodies are not visible enough that can be detected by pedestrian detection methods. Therefore, as alternative, localization-based methods are proposed which divide the input image into a number of overlapping/ non-overlapping sub-regions, where counting is done in each region by employing regression model. For example, localization is performed by employing key points clustering method in [14,14,14,15]. In these methods, SURF features are extracted from an input image. Stationary points are removed by taking the mask of features points with optical flow. The remaining features are clustered into different groups by employing K-means algorithm. The group size is then estimated by employing a regression model. The shortcoming of these approaches is that these methods are restricted to moving objects and could not count the people who are stationary in the scene.

Convolution Neural Networks (CNN)
Deep Convolutional Neural Networks have achieved remarkable success in various fields of computer vision such as detection, classification, and semantic segmentation. Some researchers made proposed different deep learning frameworks for crowd counting in the recent two years. For crowd counting, Wang et al. [60] proposed first regression base CNN model. Fu et al. [22], on the other hand, proposed a deep convolutional network that classifies the input image into five classes. Shang et al. [53] leverage contextual information at both local and global levels estimate the crowd count by employing end-to-end CNN. Zhang et al. [63] proposed architecture that consists of multiple column, where each column implements a CNN each having different receptive fields to capture scale variations caused by perspective distortions. The network takes an input image of arbitrary size and predicts corresponding density map. Onoro-Rubio et al. [45] proposed a scale-aware crowd density estimation model, Hydra CNN that estimates crowd densities in complex crowded scenes without the need of geometric information of the scene. Sang et al. [52] propose a method, namely, SaCNN that estimates high-quality density maps, where crowd count is obtained by the integrating these density maps. Sindagi et al. [55] proposed end-to-end cascaded CNN that simultaneously estimate the crowd count and density maps. Liu et al. [36] proposed DecideNet that separately generates different density maps. Attention module is used to obtain final crowd count from these two different density maps. Zhang et al. [63] estimated the number of people in a single image using a Convolutional Neural Networks (CNNs) regression model with two configurations. One is a network to estimate head count from a given image, while the other one is to construct the density map of the crowd. The final count was obtained by integrating both output. Kang et al. [31] proposed a crowd segmentation approach by constructing a fully convolutional neural network (FCNN) based on both appearance features and motion features. They used one layer of a convolution kernel instead of the fully connected layers in the original CNNs to define the labels at each pixel in the segmentation map. The output is a segmentation map with different probabilities of the crowd on each pixel. Boominathan et al. [4] use fully convolutional network and combine both deep and shallow to estimate the crowd density from crowded images. The shallow network was designed with three convolutional layers to detect the small head blobs arising from people away from the camera. They concatenated the predictions from both networks to predict the crowd density. Crowd count was then obtained by a linear summation of the peaks of the predicted density map.

Proposed Methodology
In this section, we provide the details of proposed framework for estimating crowd count in complex scenes. The proposed crowd counting architecture shown in Fig. 2 uses feature maps from different levels that represent features at different scales that are fused together for crowd counting task. Generally, the proposed architecture follows the pipeline of popular crowd density estimation methods [35,59,60] that comprise of two networks. The first network is an encoder that takes the input image, and extracts multilevel features from the input image, and second network is a decoder stage that generates density map. The final density map represents the count of people per-pixel in the input image (Fig. 3).
The goal of counting framework is to estimate the distribution of people in input image by optimizing a defined loss function. Similar to other crowd counting framework, our also follows the architecture of VGG-16 [54]. VGG-16 is a popular state-of-the-art CNN and received tremendous success in numerous image classification tasks. The architecture of VGG-16 is divided into five covolutional blocks, where each convolutional block is followed by a max-pooling layer. The receptive field size of all convolutional layers is set to smallest size of 3 × 3 pixels with stride of 1. The size of max-pooling layer is 2 × 2 with stride of 2.
The first convolutional block is represented by conv1_x comprising of two convolutional layers with filter size of 3 × 3 and containing 64 channels each. The first covolutional block is succeeded by a max-pooling layer. The second covolutional block is represented by conv2_x and also consists of two convolutional layers with the same filter size ( 3 × 3 ) and the number of channels in each convolutional layer of block conv2_x is 128. The second convolutional block is followed by another max-pooling layer of size 2 × 2 and stride of 2. The third convolutional block conv3_x comprises of three convolutional layers, each with filter size of 3 × 3 and 256 channels. A Max-pooling layer is applied after fourth Fig. 2 Architecture of proposed framework for crowd counting and density estimation. The input is the arbitrary size image and output is density map. The summation of density map provides the final crowd count in the given image convolutional block (conv4_x) that consists of set of three convolutional layers of filter size 3 x 3 and consist of 512 channels. The fifth convolutional block (conv5_x) is similar to conv4_x, followed by the 5 th max-pooling layer. The stack of five convolutional blocks is followed by three fully connected layers. The spatial resolution is reduced by 1/2 after passing through each convolutional block; however, the spatial resolution of the feature map is intact inside the block.
Crowd counting methods [59,60] utilize the features from the last ( 5th ) convolutional layer. However, the last layers of deep neural networks contain rich contextual information, but less details about the small objects due to large receptive fields. These higher order layers can be used to capture the global context of the scene. Furthermore, the resolution of feature maps after subsequent convolutional and pooling layers is reduced that results in poor localization. On the other hand, the shallow layers contains rich information about small objects due to small receptive field. The resolution of feature maps of these layers is large; however, the feature maps are noisy and require further processing to make them suitable for feature extraction. Since existing crowd counting methods use the last convolutional layer for feature extraction, they are, therefore, unable to capture the details of small objects in high-density crowds.
Unlike other methods that use the feature map of the last convolutional layer for crowd counting, we use multiple feature map from different layers to capture the details of small, medium, and large objects. For this purpose, we use fusion strategy of features that fuses the feature maps from the shallow and top layers. We also incorporate the feature maps of mid-level convolutional layers. We assume that utilization of feature maps from different convolutional layers assists the crowd counting task to achieve higher accuracy as possible.
We adopt the fusion strategy adopted in [39] and fuse the feature maps of conv3_x, conv4_x, and conv5_x. The size and number of channels of these feature maps are different.
More precisely, the size of feature map of conv5_x is 1/2 of the feature map of conv4_x. Similarly, the size of feature map of conv4_x is 1/2 of the size of conv3_x. To effectively fuse feature maps of different resolutions, we perform two steps. First, we need to up-sample the resolution of higher order layers, i.e., conv4_x, and conv5_x by employing transposed convolutional layer. Second, to make the number of channels of different feature maps equal, we apply 1 × 1 convolutional layer after transposed convolutional layer.
To combine conv4_x and conv5_x, we apply transposed convolutional layer of size 2 × 2 to conv5_x to make its size equal to conv4_x. Let F 1 represents the fused feature map. We then apply 2 × 2 transposed convolutional layer to F 1 map to make its size equal to feature map of conv3_x. To make the number of channels equal, we apply 1 × 1 convolutional layer to conv3_x with 512 number of channels. We then fuse the feature maps of conv3_x and F 1 . Let F 2 is the resultant fused feature map. To further suppress the aliasing effect, we apply 1 × 1 convolution layer to fused feature map. The feature map now combines rich semantic from the deeper layers and also fine-grained information about the small objects from shallow layers. The final feature map is provided as input to prediction layer which employ 1 × 1 convolution and generates density value for each pixel of the feature map. We then up-sampled the final feature map by employing bi-linear interpolation to make its size equal to the size of input image.

Training and Implementation Details
We now discuss implementation and training details of the framework. Let S = {s 1 , s 2 , … , s n } represents n number of training images. We divide each image s i into patches, each of size l x m. Let P = {p 1 , p 2 , … , p m } , represents the m number of patches involved in training the network. With each patch p i , we associate a density level d i , that represents total number of people in each patch p i . We randomly choose patches for training and provide then as input to the proposed CNN. We employ a regression method to the learn features that represent crowd count in patch. However, during the training phase, we observed data imbalance problem. This is due to reason that in crowd counting datasets, the ground truth is always provided in the point annotations. Each point corresponds to the location of pedestrian in the scene. Usually, high-density crowds contains few thousands of people. This means that we can generate few thousands of positive samples, while most of pixels will belong to the background. In this way, the number of negative samples will be thousand times greater than positive samples. This creates data imbalance problem which will lead to poor generalization of the crowd counting model.
To address this problem, we employ methods in [5,55,63]  where is the variance of the kernel G. The above density function is feasible for the scenes captured from the orthogonal view. Such scenes do not suffer from perspective distortions due to which the size of the analyzed objects is constant. However, these pedestrian crowd scenes do not hold this assumption, where camera is usually installed at tilted position. Such configuration of the camera causes perspective distortions due to which the size of same objects appears different in disparate locations of the scene. To address this problem, we use a Gaussian kernel that compensates perspective distortions to generate density map [63]. To obtain continuous density function, we convolve H with G as in Eq. 2 The sum of the peaks of the density map represents crowd count in the given image. Figure shows the original image and their corresponding true crowd density maps. We then define the training loss L E function through which the network learns set of parameters . The loss function L E is the euclidean loss that measure the distance between the true density and predicted density maps and formulated as follows: where represents the parameter of the network learn during the training process, N represents the number of images used for training, S j is the current image, and D j is the ground truth density map of image j. The optimization of Eq. 3 provides high-quality density map that obtains accurate crowd count.

Experiment Results
In this section, we evaluate and compare the performance of the proposed framework with other existing methods on three publicly available datasets, namely, UCF_CROWD_50 [29], UCF_QNRF [30], and ShanghaiTech [63]

Datasets
We provide the details of each dataset as follows: UCF_CROWD_50 is the first high-density crowd dataset proposed by Idrees et al. [29] for evaluating crowd counting models. ShanghaiTech dataset is first introduced by Zhang et al. [63]. The dataset contains 1198 annotated images with 330,165 total point annotations. At that time, the dataset was considered as one of the largest due to large number of annotations. The dataset is divided into two parts, i.e., Part A and Part B. There are 482 images in Part A, which are collected from different sources over the Internet. On the other hand, there are 716 images in Part B, which are collected from the busy metropolitan areas of Shanghai. There is significant variation among the densities of two parts. Generally, part A contains the images with higher densities than part B. This significant variation in crowd densities of two parts poses a challenge for a crowd counting models to accurately estimate the count in images of varying densities. For training and testing, we follow the same convention adopted in [63]. The authors divided part A into 300 training images and the remaining 182 images are used for testing. The training set of Part B contains 400 images, while 316 images are reserved for testing. Figure 4 shows different sample images from each dataset.

Evaluation Metrics
To quantitatively evaluate the performance of crowd counting models, we use mean absolute error (MAE) and mean square error (MSE) by following the convention adopted in [30,63]. MAE and MSE are formulated in Eqs. 4 and 5, respectively, as follows: where N represents the number of training images, n is the predicted crowd count, and G n is ground-truth crowd count of pedestrians at image n. We use the above evaluation metrics to compare proposed approach with other reference methods on all three benchmark datasets.

Comparison with State-of-the-Art Methods
We compare proposed framework with four most related methods, i.e., Rodriguez et al. [49], Idrees et al. [28], Lempitsky et al. [35], and Zhang et al. [62] on UCF_CROWD_50 dataset. We report quantitative results of each method in  [49] perform lower compare to other methods. This is due to fact that method is detection based and relies on the performance of the detector [20] used in the framework. However, we observe that detection is not a viable solution for crowd counting in high-density crowds. Since, it is challenging to detect heads in high-density crowds due to intraclass variations in scales, appearance, and poses of human heads. Lempitsky et al. [35] learned a density regression model using SIFT features and optimize MESA distance between the predicted density map and ground truth. On the other hand, Zhang et al. [62] achieve comparable values of MAE and MSE, since the authors propose CNN-based crowd counting model that rely on hierarchical features instead of hand-crafted features. However, the proposed model is patch-based and causes much computational complexity during training and testing.
In Table 2, we compare the proposed framework with other reference methods on the ShanghaiTech dataset. From the table, it is obvious that the proposed framework beats other reference methods by achieving low values of MAE and MSE. Chen et al. [10] achieve relatively higher values of MAE and MSE. The method uses traditional Local   [55] propose a cascaded convolutional network to learn two tasks, i.e., crowd count and density map estimation. The network takes arbitrary size image and outputs a density map. Tang et al. [58] propose fusion CNN that has two key stages. The first stage adopts deep-fusion network to estimate the crowd density and the second stage employs regression to estimate the count. Han et al. [24] left behind proposed framework by a slight margin. The authors adopt divide-and-conquer strategy, and instead of estimating the count from the whole image, they divide the image into multiple overlap patches. Then, from each patch, they hierarchical features are extracted by CNN which is then followed by a fully connected network that regress the count in each patch. Markov random field is then applied to smooth the counting results in adjacent patches. From Table 2, we further observe that deep learning methods produce better results than the traditional statistical models. However, among deep learning models, our proposed framework achieves best results on both Part_A and Part_B, which highlights the fact that multilayer fusion is effective for accurately estimating the crowd count. Since multilayer fusion combines both high-level semantic information from higher layers and information about the small objects from lower layers. From the table, we further observed that Part_A is more challenging than Part_B, as most of the methods produce higher MAE and MSE values on Part_A than Part_B. Table 3 shows comparison results of different methods on UCF_QNRF dataset. It is obvious from the table that the proposed method beats other state-of-the-art methods by producing lower values of MAE and MSE. Switching CNN [51] produces comparable results. The method adopts a unique way for handling multiscale variations by leveraging variation of crowd densities in different locations of the input image. The method uses multiple regressors, each of different receptive field to capture scale variations in the image. The switch classifier routes the patches to best CNN regressor based on density level. We also report visualization of the predicted results and corresponding ground truth on three datasets in Figs. 5, 6, and 7.

Discussion
From experiment results, we observe that distance of camera from the scene and camera view point is the main cause of scale problem. Due to the scale problem, the size of humans near camera appears large than size of human at far distance. To capture such variations in sizes of human heads, it is important to model a network that can handle such scale variations in the image. Zhang et al. [63] propose a CNN network that addresses the scale problem using three column CNN structures. The model produces good results; however, training three column of the network independently causes computational cost. Other reference methods use limited and fixed scale range and, therefore, loss the abilities to learn a generalized model. Furthermore, these methods employ regression techniques to regress the crowd county or crowd density map directly from the image. The performance of these approaches is limited by the following two main reasons: (1) These approaches utilize the feature map of the last convolutional layer that contains rich contextual information about the scenes and, however, do not contain much information about small objects. (2) These approaches use CNNs that consist of subsequent pooling layers that reduce the  Idrees et al. [28] 315.0 508.0 Sindagi et al. [55] 252.0 514.0 Switching CNN [51] 228.0 445.0 Encoder-Decoder [3] 270.0 478.0 MCNN [63] 277 Visualization of ground truth and predicted density maps of samples frames selected from UCF_CC_50 dataset resolution of final crowd density map that leads to the loss of crucial information especially in images that consist of high-density crowds with large variations in scales. By contrast, the proposed framework addresses the shortcomings of previous models by adopting affordable and effective way of dealing with scale variations. We assume that higher layers contain rich information about the person near to camera, while lower layers contain information about the person far away from the camera. We fuse the feature map of these multiple layers (of different depths) to adapt the scale variations in the input image. We use the final fused map to learn a mapping function between the heads in image and crowd count. From empirical evidences, we observe that proposed fusion strategy learns multiscale discriminative features and effective to achieve better results compared to other state-of-the-art methods.

Conclusion
We presented a deep convolutional neural network that overcomes the problem of scale variations by fusing information from shallow to deep layers. The framework estimated density map and obtained final crowd count by the integration of peaks in density map. We evaluated proposed framework on different publicly available benchmark datasets. From experiment results, we demonstrated the effectiveness of proposed approach. However, we observed the accuracy of proposed framework still far behind the ground truth. This is due to the fact that in the analyzed high-density crowd datasets, humans are hard to recognize in some images due to low resolution and extremely small size of human head. In future, we plan to propose a network that handles these challenges to achieve high accuracy.