Keywords

1 Introduction

A normal human perception is limited to visible light spectrum only which ranges from ~380 nm to ~700 nm wavelength on the electromagnetic spectrum [1]. The visible cameras are designed to utilize the same wavelength range and replicate human vision by capturing RGB wavelengths for color representation. However, like a human eye, these systems are also affected by poor weather (such as fog, smoke, haze or storms etc.) and low illumination conditions. This limits their utilization to applications with only daytime scenarios. Beyond the visible spectrum is the infrared region which cannot be seen by humans. With the advancements in infrared/thermal imaging technologies, humans have extended their range of vision. These technologies enable the vision in most challenging situations such as low-visibility due to extreme weather conditions or low illumination etc. [2]. This is made possible because these cameras are essentially the heat sensors which can capture heat signatures from different objects. However, the spatial resolutions of these imaging technologies are relatively lower as compared to the visible cameras. Higher resolution is particularly important because it enables the capturing of small details in any image. The spatial resolution can be increased by use of a high-end camera, but that makes it a costly proposition. Many researchers use Image Super-resolution (SR) technique to reconstruct high resolution images from low resolution input images. SR is used to predict and fill details in low resolution images such that the output gives an image of a higher resolution. The resolution of the input image is increased based on the scaling factor used for super resolving the image.

SR is widely used in many computer vision applications including surveillance and security, small-objects detection and tracking to medical imaging. Most of the research regarding Image SR is focused towards images captured using visible cameras. However, surveillance environments are now commonly monitored using infrared/thermal cameras. The Computer Vision researchers are increasingly showing research interests in use of thermal images for a variety of applications [3,4,5]. Similarly, the same research trend is being noticed in SR applications using thermal images [6,7,8,9]. Rivadeneira et al. proposed a Convolutional Neural Network (CNN) based approach to compare performance of Image SR using both thermal and visible images [10]. The experimentation proved that the network trained with thermal images performed better than the latter.

This research work focuses on the use of deep Residual Network (ResNet) to perform SR of thermal images. ResNets have contributed significantly towards solving various Image SR issues. As presented in [11], a simple modification in traditional Convolutional Neural Networks (CNN) allows for bigger networks to be trained with an increased accuracy. The paper provides performance evaluation of some of the most popular ResNet architectures on the SR of thermal images. V. Chudasama et al. [12] provided a detailed comparison using Thermal Image Super Resolution (TISR) challenge dataset. The paper presented results of different state-of-the-art algorithms trained on this challenging dataset. Detailed analysis of large image datasets using state-of-the-art algorithms is though computationally expensive as it requires higher number of training epochs.

This research work specifically focuses on the application of crowd counting based on super-resolved thermal images using Enhanced Deep Super Resolution networks (EDSR), Residual Channel Attention Networks (RCAN) and Residual Dense Networks (RDN) algorithms. The algorithms were trained from scratch using fixed camera views obtained from two different video sequences (suitable for crowd counting application) of BU-TIV benchmark dataset. The super-resolved image outputs generated by these algorithms were then used to count the number of persons using pretrained model weights. To obtain ground truth of the person count, the total number of persons were also predicted on the original ground truth image using these weights. The main contributions of this paper are highlighted as follows:

  • A detailed comparative analysis of three most popular ResNet-based architectures i.e., EDSR, RCAN and RDN for thermal images SR to analyze crowd with inexpensive training dynamics. The crowd counting is performed on both sparse and highly dense static and dynamic nature crowd with added complexities due to far field camera viewing angles.

  • Selection of a suitable application-specific dataset with fixed camera views for thermal images SR analysis in crowd counting applications. The sub-dataset is carefully selected to include both near and far field sparse and highly dense crowds for better visualization and understanding.

The rest of the paper is organized as followed. Section 2 provides a brief overview of the related research. Section 3 presents the working methodology of compared algorithms, and provides details regarding the implementation of this research work. Section 4 discusses the results obtained from the experimentation. Finally, Sect. 5 concludes the paper and provides a future research direction.

2 Related Work

SR of thermal images has gained much of the interest among researchers working in this area. One of the earliest approaches to SR of thermal images made use of the Huber Total Variation (HTV) approach which employed Huber norm with bilateral Total Variation (TV) regularization [13]. Chen et al. proposed the use of visible camera to guide SR of thermal images [14]. The approach was tested on their dataset and showed a reasonable performance while also avoiding traditional over-texture problem. Hans et al. proposed an SR algorithm for thermal images based on sparse representation [15]. Their results showed a good performance of image reconstruction without introducing major counterfeit artifacts. Cascarano et al. proposed an SR algorithm which can handle both single and multiple images [16]. The algorithm was tested on aerial and terrestrial thermal images and showed a good performance.

The very first deep learning implementation for Image SR was presented by Dong et al. [17] in 2014, with the introduction of SRCNN. In 2016, residual networks were introduced by He et al. for image recognition [11]. Architectures based on residual networks were then explored for single image SR. This helped make significant advancements in this area. In [18], SRResNet was introduced which was a 16 blocks deep residual network. Improvements were made in the SRResNet architecture by Lim et al. [19] with the introduction of EDSR. These networks became the backbone of major future research work in the domain of single image SR using residual networks. A recent approach proposed in [20] was built on residual blocks as base units. The architecture was used to generate super-resolved images in ×2, ×3 and ×4 scales and showed a good generalization capacity.

As the SR of thermal images is a relatively new research area, there is a need to explore this area in detail and build motivation for future research using specific applications. Therefore, this research gives an overview of application-specific SR algorithms implemented on thermal images.

3 Working Methodology of the Proposed Approach

3.1 Selected Algorithms

Three different deep learning-based SR algorithms were used in this study for comparative analysis. All algorithms are built on ResNet-based architectures. As discussed in Sect. 1, ResNets are a special case of CNN where small modifications in traditional CNN networks such as addition of skip connections enable the training of larger networks. Surprisingly, the resultant larger network does not cause performance degradation as in the previous CNN networks and generates even better accuracies, which makes it a perfect candidate for thermal images SR.

SR algorithms used in this research include EDSR, RCAN and RDN. The EDSR network [19] is inspired from SRResNet, the first ever ResNet used for Image SR. It removed batch normalization layer in SRResNet which improved results. The RCAN [21] uses Residual in Residual network which allows training of very deep CNNs for Image SR with major performance improvements. The RDN [22] introduced Residual Dense Blocks (RDB) which extract local features using dense convolutional layers.

Similarly, Focal Inverse Distance Transform (FIDT) maps were used for localization and counting of crowd [23]. These maps accurately localize the crowd without head overlaps even in highly dense environments.

3.2 Dataset

The thermal images used to conduct this research work were extracted from video sequences of BU-TIV dataset [24]. Only two video sequences i.e., Marathon-2 and Marathon-4 were suitable for crowd counting application. Both sequences provided a good view of cameras fixed at an elevated platform to address the problem of crowd counting. Figure 1 shows example frames from Marathon-2 and Marathon-4 videos. Both video sequences have a far field view which adds complexities in crowd estimation process. Videos are selected to include both sparse and highly dense crowd environments. The crowd has both static and dynamic motion features in multiple directions. The skewed camera angles and fewer number of pixels per head further make these video sequences challenging for crowd analysis. Both selected video sequences are perfect choices for crowd analysis and provide opportunities to explore performance of SR with given challenging attributes.

The data was recorded on FLIR SC8000 cameras. A total of 3555 frames were extracted from these video sequences and then reshaped into 512 × 256 resolution. These were used as ground truth images. To construct low-resolution images for ×2 and ×4 factors, ground truth images were down sampled two and four times into images with 256 × 128 and 128 × 64 resolutions respectively. All images from both sequences were randomly shuffled to improve generalizability of the model. Dataset was then split into train, validation, and test sets with ratios of 80:10:10 respectively.

3.3 Evaluation Setup

The training was done using NVIDIA’s GeForce GTX 1080Ti. The deep learning framework used was PyTorch. The training parameters were kept same for all SR algorithms. Learning rate was fixed at 0.0001, ADAM was used as an optimizer function with β1, β2 and ε set at 0.9, 0.999 and 10–8 respectively. The networks were trained using L1 loss function. All networks were trained with 16 residual blocks and 64 feature maps to keep the comparison fair. Training was continued until the L1 loss reached a numerical value of 1.5 for ×2 upscaling, and 3.0 for ×4 upscaling. The threshold values were selected based on the observation that the loss graphs started to plateau around these values. The weights were saved as soon as L1 loss reached the threshold value. These weights were then used to super-resolve the 355 test images by factors of ×2 and ×4. The average Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) values were calculated from test images. PSNR provides a ratio between maximum power of an image and power of corrupting noise that affects its quality. SSIM calculates how similar two images are by comparing luminance, contrast, and structure between two images. LPIPS evaluates distance between two image patches.

Crowd counting was done on super-resolved test images through FIDT maps. The code was run using PyTorch backend. Pre-trained weights from University of Central Florida - Qatar National Research Fund (UCF-QNRF) dataset [25] were used for this crowd counting estimation.

Fig. 1.
figure 1

Views from BU-TIV dataset (a) Marathon-2 (b) Marathon-4

4 Results and Discussion

The training results obtained for ×2 upscaling factor are shown in Fig. 2. The L1 loss for RCAN reached the threshold value of 1.5 in the least number of epochs i.e., 72. RDN took 94 epochs to reach the value, whereas the training for EDSR had to be early stopped at 199 epochs as the graph had plateaued. For ×4 upscaling factor, training was continued until the L1 loss value crossed the numerical value of 3.0. The training results are displayed in Fig. 3. In this case, the L1 loss for RDN was able to reach threshold value in only 51 epochs. For RCAN, it took 54 epochs. The graph for L1 loss of EDSR had started to plateau in this case too, which is why training was early stopped at 300 epochs.

Weights obtained from the reported epochs were used for the testing phase. 355 unseen images were used in the testing phase. The test results are displayed in Table 1. It can be clearly observed that reasonable scores are achieved by all algorithms even when the maximum training epochs were only 300. An SSIM score close to 1.0, LPIPS score close to 0.0 and a high PSNR value represents images with a high level of structural similarity and near to being highly identical. The visual comparison of results on test images for ×2 and ×4 upscaling factors are displayed in Fig. 4 and Fig. 5 respectively. The frames are evaluated using the PSNR, SSIM and LPIPS metrics. It can be observed that the algorithms have accurately predicted static features in the video frames, e.g., the parked cars and the road. The slight difference in scores observable in the results is because of the dynamic features, e.g., moving cars and pedestrians. However, all algorithms showed improvements in the results as compared to the bicubic interpolated versions of the same images. The table also shows runtime analysis of each algorithm for 355 test images. Generally, the performance of EDSR was not as good as RCAN and RDN, but it has a considerably shorter execution time per step because of the simpler architecture. It was observed that the time taken with ×2 upscaling factor was more than the time taken for ×4 upscaling factor using RCAN and RDN. This is because the input images for ×2 upscaling factor are of a greater resolution than that of ×4 upscaling factor, as previously discussed in Sect. 3. It was also observed that application-specific datasets with fixed camera views are computationally efficient and generate robust detections with considerably few epochs.

Crowd counting using FIDT maps was done on all super-resolved images obtained in the testing phase by each method. To establish ground truth, the pretrained weights were used for crowd counting on the ground truth images. The Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) scores obtained by performing counting through bicubic up-sampling and selected SR methods are presented in Table 2. RCAN performed better than all other selected SR algorithms using both ×2 and ×4 upscaling factors. A visual comparison of crowd count between ground truth and the obtained results is also provided in Fig. 6 with both ×2 and ×4 upscaling factors.

Fig. 2.
figure 2

L1 loss and PSNR scores obtained during training of SR algorithms for ×2 factor

Fig. 3.
figure 3

L1 loss and PSNR scores obtained during training of SR algorithms for ×4 factor

Table 1. Quantitative comparison of selected SR algorithms on test data with ×2 and ×4 upscaling factors in terms of PSNR, SSIM and LPIPS metrics
Table 2. Performance evaluation of obtained crowd counting results with bicubic up-sampling and selected SR algorithms
Fig. 4.
figure 4

Visual comparison of ×2 SR results for both static and dynamic features

Fig. 5.
figure 5

Visual comparison of ×4 SR results for both static and dynamic features

Fig. 6.
figure 6

Comparison between ground truth and results obtained for all selected SR algorithms

5 Conclusion and Future Work

This paper investigated the performance of state-of-the-art ResNet-based Image SR algorithms, namely EDSR, RCAN and RDN. The images were super-resolved ×2 and ×4 on video sequences of BU-TIV dataset. PSNR, SSIM and LPIPS scores were used as evaluation metrics to compare performance of each algorithm. As compared to the bicubic interpolated versions, all selected SR algorithms were able to generate good results due to their ResNet-based architectures which are proven to have good accuracies with deeper layers. With careful selection of a dataset with sufficient number of images for training, the models were able to perform good even with fewer epochs, the maximum of which were 300 epochs used by EDSR for ×4 up-scaling factor. The paper also provided a qualitative analysis by observing performance on crowd counting application in both sparse and highly dense crowd environments. RCAN outperformed other SR algorithms by achieving minimum MAE and RMSE values.

As a future work, a similar analysis can be performed with Image SR algorithms based on Generative Adversarial Networks (GANs). Furthermore, a completely new architecture can also be designed particularly tailored for crowd counting application using SR on thermal images. Similarly, different multi-image SR methods can also be explored for performance comparison with single image SR algorithms.