Humans are able to perceive objects only in the visible spectrum range which limits the perception abilities in poor weather or low illumination conditions. The limitations are usually handled through technological advancements in thermographic imaging. However, thermal cameras have poor spatial resolutions compared to RGB cameras. Super-resolution (SR) techniques are commonly used to improve the overall quality of low-resolution images. There has been a major shift of research among the Computer Vision researchers towards SR techniques particularly aimed for thermal images. This paper analyzes the performance of three deep learning-based state-of-the-art SR algorithms namely Enhanced Deep Super Resolution (EDSR), Residual Channel Attention Network (RCAN) and Residual Dense Network (RDN) on thermal images. The algorithms were trained from scratch for different upscaling factors of ×2 and ×4. The dataset was generated from two different thermal imaging sequences of BU-TIV benchmark. The sequences contain both sparse and highly dense type of crowds with a far field camera view. The trained models were then used to super-resolve unseen test images. The quantitative analysis of the test images was performed using common image quality metrics such as PSNR, SSIM and LPIPS, while qualitative analysis was provided by evaluating effectiveness of the algorithms for crowd counting application. After only 54 and 51 epochs of RCAN and RDN respectively, both approaches were able to output average scores of 37.878, 0.986, 0.0098 and 30.175, 0.945, 0.0636 for PSNR, SSIM and LPIPS respectively. The EDSR algorithm took the least computation time during both training and testing because of its simple architecture. This research proves that a reasonable accuracy can be achieved with fewer training epochs when an application-specific dataset is carefully selected.
- Crowd counting
- Super resolution
- Residual networks
- Thermal imagery
A normal human perception is limited to visible light spectrum only which ranges from ~380 nm to ~700 nm wavelength on the electromagnetic spectrum . The visible cameras are designed to utilize the same wavelength range and replicate human vision by capturing RGB wavelengths for color representation. However, like a human eye, these systems are also affected by poor weather (such as fog, smoke, haze or storms etc.) and low illumination conditions. This limits their utilization to applications with only daytime scenarios. Beyond the visible spectrum is the infrared region which cannot be seen by humans. With the advancements in infrared/thermal imaging technologies, humans have extended their range of vision. These technologies enable the vision in most challenging situations such as low-visibility due to extreme weather conditions or low illumination etc. . This is made possible because these cameras are essentially the heat sensors which can capture heat signatures from different objects. However, the spatial resolutions of these imaging technologies are relatively lower as compared to the visible cameras. Higher resolution is particularly important because it enables the capturing of small details in any image. The spatial resolution can be increased by use of a high-end camera, but that makes it a costly proposition. Many researchers use Image Super-resolution (SR) technique to reconstruct high resolution images from low resolution input images. SR is used to predict and fill details in low resolution images such that the output gives an image of a higher resolution. The resolution of the input image is increased based on the scaling factor used for super resolving the image.
SR is widely used in many computer vision applications including surveillance and security, small-objects detection and tracking to medical imaging. Most of the research regarding Image SR is focused towards images captured using visible cameras. However, surveillance environments are now commonly monitored using infrared/thermal cameras. The Computer Vision researchers are increasingly showing research interests in use of thermal images for a variety of applications [3,4,5]. Similarly, the same research trend is being noticed in SR applications using thermal images [6,7,8,9]. Rivadeneira et al. proposed a Convolutional Neural Network (CNN) based approach to compare performance of Image SR using both thermal and visible images . The experimentation proved that the network trained with thermal images performed better than the latter.
This research work focuses on the use of deep Residual Network (ResNet) to perform SR of thermal images. ResNets have contributed significantly towards solving various Image SR issues. As presented in , a simple modification in traditional Convolutional Neural Networks (CNN) allows for bigger networks to be trained with an increased accuracy. The paper provides performance evaluation of some of the most popular ResNet architectures on the SR of thermal images. V. Chudasama et al.  provided a detailed comparison using Thermal Image Super Resolution (TISR) challenge dataset. The paper presented results of different state-of-the-art algorithms trained on this challenging dataset. Detailed analysis of large image datasets using state-of-the-art algorithms is though computationally expensive as it requires higher number of training epochs.
This research work specifically focuses on the application of crowd counting based on super-resolved thermal images using Enhanced Deep Super Resolution networks (EDSR), Residual Channel Attention Networks (RCAN) and Residual Dense Networks (RDN) algorithms. The algorithms were trained from scratch using fixed camera views obtained from two different video sequences (suitable for crowd counting application) of BU-TIV benchmark dataset. The super-resolved image outputs generated by these algorithms were then used to count the number of persons using pretrained model weights. To obtain ground truth of the person count, the total number of persons were also predicted on the original ground truth image using these weights. The main contributions of this paper are highlighted as follows:
A detailed comparative analysis of three most popular ResNet-based architectures i.e., EDSR, RCAN and RDN for thermal images SR to analyze crowd with inexpensive training dynamics. The crowd counting is performed on both sparse and highly dense static and dynamic nature crowd with added complexities due to far field camera viewing angles.
Selection of a suitable application-specific dataset with fixed camera views for thermal images SR analysis in crowd counting applications. The sub-dataset is carefully selected to include both near and far field sparse and highly dense crowds for better visualization and understanding.
The rest of the paper is organized as followed. Section 2 provides a brief overview of the related research. Section 3 presents the working methodology of compared algorithms, and provides details regarding the implementation of this research work. Section 4 discusses the results obtained from the experimentation. Finally, Sect. 5 concludes the paper and provides a future research direction.
2 Related Work
SR of thermal images has gained much of the interest among researchers working in this area. One of the earliest approaches to SR of thermal images made use of the Huber Total Variation (HTV) approach which employed Huber norm with bilateral Total Variation (TV) regularization . Chen et al. proposed the use of visible camera to guide SR of thermal images . The approach was tested on their dataset and showed a reasonable performance while also avoiding traditional over-texture problem. Hans et al. proposed an SR algorithm for thermal images based on sparse representation . Their results showed a good performance of image reconstruction without introducing major counterfeit artifacts. Cascarano et al. proposed an SR algorithm which can handle both single and multiple images . The algorithm was tested on aerial and terrestrial thermal images and showed a good performance.
The very first deep learning implementation for Image SR was presented by Dong et al.  in 2014, with the introduction of SRCNN. In 2016, residual networks were introduced by He et al. for image recognition . Architectures based on residual networks were then explored for single image SR. This helped make significant advancements in this area. In , SRResNet was introduced which was a 16 blocks deep residual network. Improvements were made in the SRResNet architecture by Lim et al.  with the introduction of EDSR. These networks became the backbone of major future research work in the domain of single image SR using residual networks. A recent approach proposed in  was built on residual blocks as base units. The architecture was used to generate super-resolved images in ×2, ×3 and ×4 scales and showed a good generalization capacity.
As the SR of thermal images is a relatively new research area, there is a need to explore this area in detail and build motivation for future research using specific applications. Therefore, this research gives an overview of application-specific SR algorithms implemented on thermal images.
3 Working Methodology of the Proposed Approach
3.1 Selected Algorithms
Three different deep learning-based SR algorithms were used in this study for comparative analysis. All algorithms are built on ResNet-based architectures. As discussed in Sect. 1, ResNets are a special case of CNN where small modifications in traditional CNN networks such as addition of skip connections enable the training of larger networks. Surprisingly, the resultant larger network does not cause performance degradation as in the previous CNN networks and generates even better accuracies, which makes it a perfect candidate for thermal images SR.
SR algorithms used in this research include EDSR, RCAN and RDN. The EDSR network  is inspired from SRResNet, the first ever ResNet used for Image SR. It removed batch normalization layer in SRResNet which improved results. The RCAN  uses Residual in Residual network which allows training of very deep CNNs for Image SR with major performance improvements. The RDN  introduced Residual Dense Blocks (RDB) which extract local features using dense convolutional layers.
Similarly, Focal Inverse Distance Transform (FIDT) maps were used for localization and counting of crowd . These maps accurately localize the crowd without head overlaps even in highly dense environments.
The thermal images used to conduct this research work were extracted from video sequences of BU-TIV dataset . Only two video sequences i.e., Marathon-2 and Marathon-4 were suitable for crowd counting application. Both sequences provided a good view of cameras fixed at an elevated platform to address the problem of crowd counting. Figure 1 shows example frames from Marathon-2 and Marathon-4 videos. Both video sequences have a far field view which adds complexities in crowd estimation process. Videos are selected to include both sparse and highly dense crowd environments. The crowd has both static and dynamic motion features in multiple directions. The skewed camera angles and fewer number of pixels per head further make these video sequences challenging for crowd analysis. Both selected video sequences are perfect choices for crowd analysis and provide opportunities to explore performance of SR with given challenging attributes.
The data was recorded on FLIR SC8000 cameras. A total of 3555 frames were extracted from these video sequences and then reshaped into 512 × 256 resolution. These were used as ground truth images. To construct low-resolution images for ×2 and ×4 factors, ground truth images were down sampled two and four times into images with 256 × 128 and 128 × 64 resolutions respectively. All images from both sequences were randomly shuffled to improve generalizability of the model. Dataset was then split into train, validation, and test sets with ratios of 80:10:10 respectively.
3.3 Evaluation Setup
The training was done using NVIDIA’s GeForce GTX 1080Ti. The deep learning framework used was PyTorch. The training parameters were kept same for all SR algorithms. Learning rate was fixed at 0.0001, ADAM was used as an optimizer function with β1, β2 and ε set at 0.9, 0.999 and 10–8 respectively. The networks were trained using L1 loss function. All networks were trained with 16 residual blocks and 64 feature maps to keep the comparison fair. Training was continued until the L1 loss reached a numerical value of 1.5 for ×2 upscaling, and 3.0 for ×4 upscaling. The threshold values were selected based on the observation that the loss graphs started to plateau around these values. The weights were saved as soon as L1 loss reached the threshold value. These weights were then used to super-resolve the 355 test images by factors of ×2 and ×4. The average Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) values were calculated from test images. PSNR provides a ratio between maximum power of an image and power of corrupting noise that affects its quality. SSIM calculates how similar two images are by comparing luminance, contrast, and structure between two images. LPIPS evaluates distance between two image patches.
Crowd counting was done on super-resolved test images through FIDT maps. The code was run using PyTorch backend. Pre-trained weights from University of Central Florida - Qatar National Research Fund (UCF-QNRF) dataset  were used for this crowd counting estimation.
4 Results and Discussion
The training results obtained for ×2 upscaling factor are shown in Fig. 2. The L1 loss for RCAN reached the threshold value of 1.5 in the least number of epochs i.e., 72. RDN took 94 epochs to reach the value, whereas the training for EDSR had to be early stopped at 199 epochs as the graph had plateaued. For ×4 upscaling factor, training was continued until the L1 loss value crossed the numerical value of 3.0. The training results are displayed in Fig. 3. In this case, the L1 loss for RDN was able to reach threshold value in only 51 epochs. For RCAN, it took 54 epochs. The graph for L1 loss of EDSR had started to plateau in this case too, which is why training was early stopped at 300 epochs.
Weights obtained from the reported epochs were used for the testing phase. 355 unseen images were used in the testing phase. The test results are displayed in Table 1. It can be clearly observed that reasonable scores are achieved by all algorithms even when the maximum training epochs were only 300. An SSIM score close to 1.0, LPIPS score close to 0.0 and a high PSNR value represents images with a high level of structural similarity and near to being highly identical. The visual comparison of results on test images for ×2 and ×4 upscaling factors are displayed in Fig. 4 and Fig. 5 respectively. The frames are evaluated using the PSNR, SSIM and LPIPS metrics. It can be observed that the algorithms have accurately predicted static features in the video frames, e.g., the parked cars and the road. The slight difference in scores observable in the results is because of the dynamic features, e.g., moving cars and pedestrians. However, all algorithms showed improvements in the results as compared to the bicubic interpolated versions of the same images. The table also shows runtime analysis of each algorithm for 355 test images. Generally, the performance of EDSR was not as good as RCAN and RDN, but it has a considerably shorter execution time per step because of the simpler architecture. It was observed that the time taken with ×2 upscaling factor was more than the time taken for ×4 upscaling factor using RCAN and RDN. This is because the input images for ×2 upscaling factor are of a greater resolution than that of ×4 upscaling factor, as previously discussed in Sect. 3. It was also observed that application-specific datasets with fixed camera views are computationally efficient and generate robust detections with considerably few epochs.
Crowd counting using FIDT maps was done on all super-resolved images obtained in the testing phase by each method. To establish ground truth, the pretrained weights were used for crowd counting on the ground truth images. The Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) scores obtained by performing counting through bicubic up-sampling and selected SR methods are presented in Table 2. RCAN performed better than all other selected SR algorithms using both ×2 and ×4 upscaling factors. A visual comparison of crowd count between ground truth and the obtained results is also provided in Fig. 6 with both ×2 and ×4 upscaling factors.
5 Conclusion and Future Work
This paper investigated the performance of state-of-the-art ResNet-based Image SR algorithms, namely EDSR, RCAN and RDN. The images were super-resolved ×2 and ×4 on video sequences of BU-TIV dataset. PSNR, SSIM and LPIPS scores were used as evaluation metrics to compare performance of each algorithm. As compared to the bicubic interpolated versions, all selected SR algorithms were able to generate good results due to their ResNet-based architectures which are proven to have good accuracies with deeper layers. With careful selection of a dataset with sufficient number of images for training, the models were able to perform good even with fewer epochs, the maximum of which were 300 epochs used by EDSR for ×4 up-scaling factor. The paper also provided a qualitative analysis by observing performance on crowd counting application in both sparse and highly dense crowd environments. RCAN outperformed other SR algorithms by achieving minimum MAE and RMSE values.
As a future work, a similar analysis can be performed with Image SR algorithms based on Generative Adversarial Networks (GANs). Furthermore, a completely new architecture can also be designed particularly tailored for crowd counting application using SR on thermal images. Similarly, different multi-image SR methods can also be explored for performance comparison with single image SR algorithms.
Data Availability Statement
All related data including dataset, trained model weights and high-resolution results are placed on following Google Drive link: https://drive.google.com/drive/folders/1LNLVVNCRDRIP__lN4DJSjxmws15BWcpw?usp=sharing (Last accessed on 15 Nov 21).
NASA. Visible Light | Science Mission Directorate. https://science.nasa.gov/ems/09_visiblelight. Accessed 15 Nov 2021
Kristoffersen, M., Dueholm, J., Gade, R., Moeslund, T.: Pedestrian counting with occlusion handling using stereo thermal cameras. Sensors 16(1), 62 (2016). https://doi.org/10.3390/s16010062.10.1007/s00521-021-05973-0
Fernandes, S.L., Rajinikanth, V., Kadry, S.: A hybrid framework to evaluate breast abnormality using infrared thermal images. IEEE Consum. Electron. Mag 8(5), 31–36 (2019). https://doi.org/10.1109/mce.2019.2923926
Ghose, D., et al.: Pedestrian detection in thermal images using saliency maps: In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (2019)
Zeng, X., Miao, Y., Ubaid, S., Gao, X., Zhuang, S.: Detection and classification of bruises of pears based on thermal images. Postharv. Biol. Technol. 161, 111090 (2020). https://doi.org/10.1016/j.postharvbio.2019.111090
Patel, H., et al.: ThermISRnet: an efficient thermal image super-resolution network. Opt. Eng. 60(07) (2020). https://doi.org/10.1117/1.oe.60.7.073101.10.1038/s41598-020-77979-y
Ahmadi, S., et al.: Laser excited super resolution thermal imaging for nondestructive inspection of internal defects. Sci. Rep. 10(1) (2020). https://doi.org/10.1038/s41598-020-77979-y
Kuni Zoetgnande, Y.W., Dillenseger, J.-L., Alirezaie, J.: Edge focused super-resolution of thermal images. In: 2019 International Joint Conference on Neural Networks (IJCNN) (2019). https://doi.org/10.1109/ijcnn.2019.8852320.10.3390/s21041265
Raimundo, J., Lopez-Cuervo Medina, S., Prieto, J.F., Aguirre de Mata, J.: Super resolution infrared thermal imaging using Pansharpening algorithms: quantitative assessment and application to UAV thermal imaging. Sensors 21(4), 1265 (2020). https://doi.org/10.3390/s21041265
Rivadeneira, R.E., Suárez, P.L., Sappa, A.D., Vintimilla, B.X.: Thermal image SuperResolution through deep convolutional neural network. In: Karray, F., Campilho, A., Yu, A. (eds.) ICIAR 2019. LNCS, vol. 11663, pp. 417–426. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27272-2_37
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016). https://doi.org/10.1109/cvpr.2016.90
Chudasama, V., et al.: Therisurnet-a computationally efficient thermal image super-resolution network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2020)
Panagiotopoulou, A., Anastassopoulos, A.: Super-resolution reconstruction of thermal infrared images. In: Proceedings of the 4th WSEAS International Conference on REMOTE SENSING (2008)
Chen, X., Zhai, G., Wang, J., Hu, C., Chen, Y.: Color guided thermal image super resolution. In: 2016 Visual Communications and Image Processing (VCIP) (2016). https://doi.org/10.1109/vcip.2016.7805509
Jino Hans, W., Venkateswaran, N.: An efficient super-resolution algorithm for IR thermal images based on sparse representation. In: Proceedings of the 2015 Asia International Conference on Quantitative InfraRed Thermography (2015). https://doi.org/10.21611/qirt.2015.0092.10.3390/rs12101642
Cascarano, P., et al.: Super-resolution of thermal images using an automatic total variation based method. Remote Sens. 12(10), 1642 (2020). https://doi.org/10.3390/rs12101642
Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_13
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Lim, B., et al.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017)
Kansal, P., Nathan, S.: A multi-level supervision model: a novel approach for thermal image super resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop. (2020)
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 294–310. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_18
Zhang, Y., Tian, Y., Kong, Y.; Zhong, B., Fu, Y: Residual dense network for image super-resolution. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). https://doi.org/10.1109/cvpr.2018.00262
Liang, D., Xu, W., Zhu, Y., Zhou, Y.: Focal inverse distance transform maps for crowd localization and counting in dense crowd. arXiv:2102.07925 [cs] (2021)
Wu, Z., Fuller, N., Theriault, D., Betke, M.: A thermal infrared video benchmark for visual analysis. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (2014). https://doi.org/10.1109/cvprw.2014.39
Idrees, H., et al.: Composition loss for counting, density map estimation and localization in dense crowds. arXiv:1808.01050 [cs] (2018)
We acknowledge support from National Center of Big Data and Cloud Computing (NCBC) and Higher Education Commission (HEC) of Pakistan for conducting this research.
Editors and Affiliations
© 2022 The Author(s)
About this paper
Cite this paper
Rizvi, S.Z., Farooq, M.U., Raza, R.H. (2022). Performance Comparison of Deep Residual Networks-Based Super Resolution Algorithms Using Thermal Images: Case Study of Crowd Counting. In: Biele, C., Kacprzyk, J., Kopeć, W., Owsiński, J.W., Romanowski, A., Sikorski, M. (eds) Digital Interaction and Machine Intelligence. MIDI 2021. Lecture Notes in Networks and Systems, vol 440. Springer, Cham. https://doi.org/10.1007/978-3-031-11432-8_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11431-1
Online ISBN: 978-3-031-11432-8