Skip to main content

Perceiver Hopfield Pooling for Dynamic Multi-modal and Multi-instance Fusion

  • 2061 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13529)


Deep network architectures are usually based on domain-specific assumptions and are specialized to the modalities under consideration. This conceptual behavior also applies to multimodal networks, leading to modality-specific subnetworks. In this paper, we introduce a novel dynamic multi-modal and multi-instance (MM-MI) network based on Perceiver and Hopfield pooling which can learn intrinsic data fusion. We further introduce a novel composite dataset for evaluating MM-MI problems. We successfully show that our proposed architecture outperforms the late fusion baseline in all multi-modal setups by more than 40% accuracy on noisy data. Our simple, generally applicable, yet efficient architecture is a novel generalized approach for data fusion with high potential for future applications.


  • Perceiver
  • Hopfield pooling
  • Attention
  • Data fusion
  • Multi-modal
  • Multi-instance

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. Becker, S., Ackermann, M., Lapuschkin, S., Müller, K.R., Samek, W.: Interpreting and explaining deep neural networks for classification of audio signals. arXiv preprint arXiv:1807.03418 (2018)

  2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020).

    CrossRef  Google Scholar 

  3. Chen, S., Jin, Q.: Multi-modal conditional attention fusion for dimensional emotion prediction. arXiv preprint arXiv:1709.02251 (2017)

  4. Chumachenko, K., Iosifidis, A., Gabbouj, M.: Self-attention fusion for audiovisual emotion recognition with incomplete data. arXiv preprint arXiv:2201.11095 (2022)

  5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009)

    Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  7. Fung, M.L., Chen, M.Z.Q., Chen, Y.H.: Sensor fusion: a review of methods and applications. In: 29th Chinese Control And Decision Conference (CCDC) (2017)

    Google Scholar 

  8. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)

    Google Scholar 

  9. Hang, C., Wang, W., Zhan, D.C.: Multi-modal multi-instance multi-label learning with graph convolutional network. In: International Joint Conference on Neural Networks (IJCNN) (2021)

    Google Scholar 

  10. Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: International Conference on Machine Learning. PMLR (2021)

    Google Scholar 

  11. Kaiser, L., et al.: One model to learn them all. arXiv preprint arXiv:1706.05137 (2017)

  12. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    CrossRef  Google Scholar 

  13. LeCun, Y.: The MNIST database of handwritten digits (1998).

  14. Li, H., et al.: Multi-modal multi-instance learning using weakly correlated histopathological images and tabular clinical information. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2021)

    Google Scholar 

  15. Li, X., et al.: Multi-modal multi-instance learning for retinal disease recognition. arXiv preprint arXiv:2109.12307 (2021)

  16. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).

    CrossRef  Google Scholar 

  17. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. arXiv preprint arXiv:2107.00135 (2021)

  18. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 (2011)

    Google Scholar 

  19. Ramachandram, D., Taylor, G.W.: Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34(6), 96–108 (2017)

    CrossRef  Google Scholar 

  20. Ramsauer, H., et al.: Hopfield networks is all you need. arXiv preprint arXiv:2008.02217 (2020)

  21. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  22. Wang, Z., Wu, Y., Niu, Q.: Multi-sensor fusion in automated driving: a survey. IEEE Access 8, 2847–2868 (2019)

    CrossRef  Google Scholar 

  23. Yang, Y., Wu, Y.F., Zhan, D.C., Liu, Z.B., Jiang, Y.: Complex object classification: a multi-modal multi-instance multi-label deep network with optimal transport. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018)

    Google Scholar 

  24. Yurtsever, E., Lambert, J., Carballo, A., Takeda, K.: A survey of autonomous driving: common practices and emerging technologies. IEEE Access 8, 58443–58469 (2020)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Dominik Rößle .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rößle, D., Cremers, D., Schön, T. (2022). Perceiver Hopfield Pooling for Dynamic Multi-modal and Multi-instance Fusion. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13529. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15918-3

  • Online ISBN: 978-3-031-15919-0

  • eBook Packages: Computer ScienceComputer Science (R0)