Skip to main content
Log in

Transformer fusion and histogram layer multispectral pedestrian detection network

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Due to the complementarity of multispectral data, the performance of pedestrian detection can be significantly improved, so multispectral pedestrian detection has received great attention from the research community. However, existing pedestrian detection algorithms still suffer from some problems, such as insufficient information exchange between the two streams, and lack of targeted network design for the characteristics of the image source. In practical application scenarios, different targeted network models are generally used during the day and night, and the day model and night model can be simply switched during the deduction process. Therefore, we propose two subnetworks FTHd (Fusion Transformer Histogram day) and FTn (Fusion Transformer night) for the characteristics of daytime and nighttime images. The texture features of RGB images during the day are more obvious. We first add a histogram layer to the input branch of the detection network. After that, we added the cross-modal feature fusion method CFT (Cross-Modal Fusion Transformer) module to fuse and interact features. By leveraging the Transformer’s self-attention, the network can naturally perform intra-modal and inter-modal fusion. The light at night is very weak, and thermal images play a key role. Since the texture information is weak, complex network structures are not required, and we combine the two streams into one stream to reduce the amount of computation. Finally, we add a CFT module to fuse and interact features. Compared with baseline methods, the proposed FTHd and FTn achieve improved pedestrian detection accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The KAIST dataset can be accessed via https://github.com/SoonminHwang/rgbt-ped-detection, and CVC-14 dataset can be accessed via http://adas.cvc.uab.es/elektra/enigma-portfolio/cvc-14-visible-fir-day-night-pedestrian-sequence-dataset/.

References

  1. Hwang, S., Park, J., Kim, N., et al.: Multispectral pedestrian detection: benchmark dataset and baseline. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition, pp. 1037–1045 (2015)

  2. Liu, J., Zhang, S,, Wang, S., et al.: Multispectral deep neural networks for pedestrian detection. arXiv:1611.02644 (2016)

  3. Li, C., Song, D., Tong, R., et al.: Multispectral pedestrian detection via simultaneous detection and segmentation. arXiv:1808.04818 (2018)

  4. Zhang, L., Zhu, X., Chen, X., et al.: Weakly aligned cross-modal learning for multispectral pedestrian detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5127–5137 (2019)

  5. Li, C., Song, D., Tong, R., et al.: Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 85, 161–171 (2019)

    Article  Google Scholar 

  6. Guan, D., Cao, Y., Yang, J., et al.: Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fus. 50, 148–157 (2019)

    Article  Google Scholar 

  7. Zhou, K., Chen, L., Cao, X.: Improving multispectral pedestrian detection by addressing modality imbalance problems. In: European Conference On Computer Vision. Springer, Cham, pp. 787-803 (2020)

  8. Kim, J., Kim, H., Kim, T., et al.: MLPD: multi-label pedestrian detector in multispectral domain. IEEE Robot. Autom. Lett. 6(4), 7846–7853 (2021)

    Article  MathSciNet  Google Scholar 

  9. Peeples, J., Xu, W., Zare, A.: Histogram layers for texture analysis. IEEE Trans. Artif. Intell. 3(4), 541–552 (2021)

    Article  Google Scholar 

  10. Li, H., Wu, X.J.: DenseFuse: a fusion approach to infrared and visible images. IEEE Trans. Image Process. 28(5), 2614–2623 (2018)

    Article  MathSciNet  Google Scholar 

  11. Li, H., Wu, X.J., Kittler, J.: RFN-Nest: an end-to-end residual fusion network for infrared and visible images. Inf. Fus. 73, 72–86 (2021)

    Article  Google Scholar 

  12. Xu, H., Ma, J., Jiang, J., et al.: U2Fusion: a unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 502–518 (2020)

    Article  Google Scholar 

  13. Zhang, H., Fromont, E., Lefèvre, S., et al.: Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 276-280 (2020)

  14. VS, V., Valanarasu, J.M.J., Oza, P., et al.: Image fusion transformer. arXiv:2107.09011 (2021)

  15. Prakash, A., Chitta, K., Geiger, A.: Multi-modal fusion transformer for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7077–7087 (2021)

  16. Fang, QY., Han, DP., Wang, ZK.: Cross-modality fusion transformer for multispectral object detection. arXiv:2111.00273 (2021)

  17. Cimpoi, M., Maji, S., Vedaldi, A.: Deep filter banks for texture recognition and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3828–3836 (2015)

  18. Zhang, H., Xue, J., Dana, K.: Deep ten: Texture encoding network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 708–717 (2017)

  19. Xue, J., Zhang, H., Dana, K.: Deep texture manifold for ground terrain recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 558–567 (2018)

  20. Hu, Y., Long, Z., AlRegib, G.: Multi-level texture encoding and representation (multer) based on deep neural networks. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 4410–4414 (2019)

  21. Zhai, W., Cao, Y., Zhang, J., et al.: Deep multiple-attribute-perceived network for real-world texture recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3613–3622 (2019)

  22. Chen, Z., Li, F., Quan, Y., et al.: Deep texture recognition via exploiting cross-layer statistical self-similarity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5231–5240 (2021)

  23. Basu, S., Karki, M., Mukhopadhyay, S., et al.: A theoretical analysis of Deep Neural Networks for texture classification. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 992–999 (2016)

  24. Basu, S., Mukhopadhyay, S., Karki, M., et al.: Deep neural networks for texture classification: a theoretical analysis. Neural Netw. 97, 173–182 (2018)

  25. González, A., Fang, Z., Socarras, Y., et al.: Pedestrian detection at day/night time with visible and FIR cameras: a comparison. Sensors 16(6), 820 (2016)

  26. Dollar, P., Wojek, C., Schiele, B., et al.: Pedestrian detection: an evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 743–761 (2011)

  27. Konig, D., Adam, M., Jarvers, C., et al.: Fully convolutional region proposal networks for multispectral person detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 49–56 (2017)

  28. Park, K., Kim, S., Sohn, K.: Unified multi-spectral pedestrian detection based on probabilistic fusion networks. Pattern Recognit. 80, 143–155 (2018)

    Article  Google Scholar 

  29. Choi, H., Kim, S., Park, K., et al.: Multi-spectral pedestrian detection based on accumulated object proposal with fully convolutional networks. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 621–626 (2016)

  30. Adadi, A., Berrada, M., et al.: Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160 (2018)

  31. Aceto, G., Ciuonzo, D., Montieri, A., et al.: MIMETIC: mobile encrypted traffic classification using multimodal deep learning. Comput. Netw. 165, 106944 (2019)

    Article  Google Scholar 

  32. Nascita, A., Montieri, A., Aceto, G., et al.: XAI meets mobile traffic classification: understanding and improving multimodal deep learning architectures. IEEE Trans. Netw. Serv. Manage. 18(4), 4225–4246 (2021)

    Article  Google Scholar 

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

DY and YZ conceived the idea. YZ and CF realized the idea and wrote the main manuscript text. CF, HL, and CD prepared all figures and tables. DY and QL provided supervision. All authors reviewed the manuscript.

Corresponding author

Correspondence to Dongsheng Yang.

Ethics declarations

Conflict of interest

The authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zang, Y., Fu, C., Yang, D. et al. Transformer fusion and histogram layer multispectral pedestrian detection network. SIViP 17, 3545–3553 (2023). https://doi.org/10.1007/s11760-023-02579-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02579-y

Keywords

Navigation