Learning to Exploit Multiple Vision Modalities by Using Grafted Networks

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12361)


Novel vision sensors such as thermal, hyperspectral, polarization, and event cameras provide information that is not available from conventional intensity cameras. An obstacle to using these sensors with current powerful deep neural networks is the lack of large labeled training datasets. This paper proposes a Network Grafting Algorithm (NGA), where a new front end network driven by unconventional visual inputs replaces the front end network of a pretrained deep network that processes intensity frames. The self-supervised training uses only synchronously-recorded intensity frames and novel sensor data to maximize feature similarity between the pretrained network and the grafted network. We show that the enhanced grafted network reaches competitive average precision (AP\(_{50}\)) scores to the pretrained network on an object detection task using thermal and event camera datasets, with no increase in inference costs. Particularly, the grafted network driven by thermal frames showed a relative improvement of 49.11% over the use of intensity frames. The grafted front end has only 5–8% of the total parameters and can be trained in a few hours on a single GPU equivalent to 5% of the time that would be needed to train the entire object detector from labeled data. NGA allows new vision sensors to capitalize on previously pretrained powerful deep models, saving on training cost and widening a range of applications for novel sensors.


Network Grafting Algorithm Self-supervised learning Thermal camera Event-based vision Object detection 



This work was funded by the Swiss National Competence Center in Robotics (NCCR Robotics).

Supplementary material (60.3 mb)
Supplementary material 1 (zip 61727 KB)


  1. 1.
    Aly, H.A., Dubois, E.: Image up-sampling using total-variation regularization with a new observation model. IEEE Trans. Image Process. 14(10), 1647–1659 (2005)CrossRefGoogle Scholar
  2. 2.
    Anumula, J., Neil, D., Delbruck, T., Liu, S.C.: Feature representations for neuromorphic audio spike streams. Front. Neurosci. 12, 23 (2018). Scholar
  3. 3.
    Bahnsen, C.H., Moeslund, T.B.: Rain removal in traffic surveillance: does it matter? IEEE Trans. Intell. Transp. Syst. 20(8), 1–18 (2018)Google Scholar
  4. 4.
    Brandli, C., Berner, R., Yang, M., Liu, S.C., Delbruck, T.: A 240 \(\times \) 180 130 dB 3 \(\mu \)s latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circ. 49(10), 2333–2341 (2014)CrossRefGoogle Scholar
  5. 5.
    Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  6. 6.
    Chen, K., et al.: Hybrid task cascade for instance segmentation. In: 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  7. 7.
    Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark (2019). CoRR abs/1906.07155Google Scholar
  8. 8.
    Devaguptapu, C., Akolekar, N., Sharma, M.M., Balasubramanian, V.N.: Borrow from anywhere: pseudo multi-modal object detection in thermal imagery. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2019)Google Scholar
  9. 9.
    Fisher, R.: CVonline: Image Databases (2020).
  10. 10.
    FLIR: Free FLIR thermal dataset for algorithm training (2018).
  11. 11.
    Gallego, G., et al.: Event-based vision: a survey (2019). CoRR abs/1904.08405Google Scholar
  12. 12.
    Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, 07–09 July 2015, vol. 37, pp. 1180–1189. PMLR, Lille (2015)Google Scholar
  13. 13.
    Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423 (2016)Google Scholar
  14. 14.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NIPS Deep Learning and Representation Learning Workshop (2015)Google Scholar
  15. 15.
    Hu, Y., Liu, H., Pfeiffer, M., Delbruck, T.: DVS benchmark datasets for object tracking, action recognition, and object recognition. Front. Neurosci. 10, 405 (2016)CrossRefGoogle Scholar
  16. 16.
    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). Scholar
  17. 17.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2014)Google Scholar
  18. 18.
    Krišto, M., Ivašić-Kos, M.: Thermal imaging dataset for person detection. In: 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1126–1131 (2019)Google Scholar
  19. 19.
    Lagorce, X., Orchard, G., Galluppi, F., Shi, B.E., Benosman, R.B.: Hots: s hierarchy of event-based time-surfaces for pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(7), 1346–1359 (2017)CrossRefGoogle Scholar
  20. 20.
    Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  21. 21.
    Lichtsteiner, P., Posch, C., Delbruck, T.: A 128 \(\times \) 128 120 dB 15 \(\mu \)s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circ. 43(2), 566–576 (2008)CrossRefGoogle Scholar
  22. 22.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  23. 23.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  24. 24.
    Moeys, D.P., et al.: Steering a predator robot using a mixed frame/event-driven convolutional neural network. In: 2016 Second International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP), pp. 1–8 (2016)Google Scholar
  25. 25.
    Orchard, G., Meyer, C., Etienne-Cummings, R., Posch, C., Thakor, N., Benosman, R.: Hfirst: a temporal approach to object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2028–2040 (2015)CrossRefGoogle Scholar
  26. 26.
    Orchard, G., Jayawant, A., Cohen, G.K., Thakor, N.: Converting static image datasets to spiking neuromorphic datasets using saccades. Front. Neurosci. 9, 437 (2015)CrossRefGoogle Scholar
  27. 27.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)CrossRefGoogle Scholar
  28. 28.
    Rebecq, H., Ranftl, R., Koltun, V., Scaramuzza, D.: Events-To-Video: bringing modern computer vision to event cameras. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  29. 29.
    Redmon, J., Farhadi, A.: YOLOv3: An incremental improvement (2018). arXivGoogle Scholar
  30. 30.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 91–99. Curran Associates Inc., New York (2015)Google Scholar
  31. 31.
    Rodin, C.D., de Lima, L.N., de Alcantara Andrade, F.A., Haddad, D.B., Johansen, T.A., Storvold, R.: Object classification in thermal images using convolutional neural networks for search and rescue missions with unmanned aerial systems. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2018)Google Scholar
  32. 32.
    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. In: International Conference on Laerning Representations (ICLR) (2015)Google Scholar
  33. 33.
    Scheerlinck, C., Rebecq, H., Gehrig, D., Barnes, N., Mahony, R.E., Scaramuzza, D.: Fast image reconstruction with an event camera. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 156–163 (2020)Google Scholar
  34. 34.
    Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., Benosman, R.: Hats: histograms of averaged time surfaces for robust event-based object classification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  35. 35.
    Sun, Y., Tzeng, E., Darrell, T., Efros, A.A.: Unsupervised domain adaptation through self-supervision (2019)Google Scholar
  36. 36.
    Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  37. 37.
    You, K., Long, M., Cao, Z., Wang, J., Jordan, M.I.: Universal domain adaptation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  38. 38.
    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  39. 39.
    Zhu, A.Z., Thakur, D., Özaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multivehicle stereo event camera dataset: an event camera dataset for 3d perception. IEEE Robot. Autom. Lett. 3(3), 2032–2039 (2018)CrossRefGoogle Scholar
  40. 40.
    Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K.: Unsupervised event-based learning of optical flow, depth, and egomotion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  41. 41.
    Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K.: Unsupervised event-based optical flow using motion compensation. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11134, pp. 711–714. Springer, Cham (2019). Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Institute of NeuroinformaticsUniversity of Zürich and ETH ZürichZürichSwitzerland

Personalised recommendations