Abstract
Deep network architectures are usually based on domain-specific assumptions and are specialized to the modalities under consideration. This conceptual behavior also applies to multimodal networks, leading to modality-specific subnetworks. In this paper, we introduce a novel dynamic multi-modal and multi-instance (MM-MI) network based on Perceiver and Hopfield pooling which can learn intrinsic data fusion. We further introduce a novel composite dataset for evaluating MM-MI problems. We successfully show that our proposed architecture outperforms the late fusion baseline in all multi-modal setups by more than 40% accuracy on noisy data. Our simple, generally applicable, yet efficient architecture is a novel generalized approach for data fusion with high potential for future applications.
Keywords
- Perceiver
- Hopfield pooling
- Attention
- Data fusion
- Multi-modal
- Multi-instance
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Becker, S., Ackermann, M., Lapuschkin, S., Müller, K.R., Samek, W.: Interpreting and explaining deep neural networks for classification of audio signals. arXiv preprint arXiv:1807.03418 (2018)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, S., Jin, Q.: Multi-modal conditional attention fusion for dimensional emotion prediction. arXiv preprint arXiv:1709.02251 (2017)
Chumachenko, K., Iosifidis, A., Gabbouj, M.: Self-attention fusion for audiovisual emotion recognition with incomplete data. arXiv preprint arXiv:2201.11095 (2022)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Fung, M.L., Chen, M.Z.Q., Chen, Y.H.: Sensor fusion: a review of methods and applications. In: 29th Chinese Control And Decision Conference (CCDC) (2017)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Hang, C., Wang, W., Zhan, D.C.: Multi-modal multi-instance multi-label learning with graph convolutional network. In: International Joint Conference on Neural Networks (IJCNN) (2021)
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: International Conference on Machine Learning. PMLR (2021)
Kaiser, L., et al.: One model to learn them all. arXiv preprint arXiv:1706.05137 (2017)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
LeCun, Y.: The MNIST database of handwritten digits (1998). http://yann.lecun.com/exdb/mnist/
Li, H., et al.: Multi-modal multi-instance learning using weakly correlated histopathological images and tabular clinical information. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2021)
Li, X., et al.: Multi-modal multi-instance learning for retinal disease recognition. arXiv preprint arXiv:2109.12307 (2021)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. arXiv preprint arXiv:2107.00135 (2021)
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 (2011)
Ramachandram, D., Taylor, G.W.: Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34(6), 96–108 (2017)
Ramsauer, H., et al.: Hopfield networks is all you need. arXiv preprint arXiv:2008.02217 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Wang, Z., Wu, Y., Niu, Q.: Multi-sensor fusion in automated driving: a survey. IEEE Access 8, 2847–2868 (2019)
Yang, Y., Wu, Y.F., Zhan, D.C., Liu, Z.B., Jiang, Y.: Complex object classification: a multi-modal multi-instance multi-label deep network with optimal transport. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018)
Yurtsever, E., Lambert, J., Carballo, A., Takeda, K.: A survey of autonomous driving: common practices and emerging technologies. IEEE Access 8, 58443–58469 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rößle, D., Cremers, D., Schön, T. (2022). Perceiver Hopfield Pooling for Dynamic Multi-modal and Multi-instance Fusion. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13529. Springer, Cham. https://doi.org/10.1007/978-3-031-15919-0_50
Download citation
DOI: https://doi.org/10.1007/978-3-031-15919-0_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15918-3
Online ISBN: 978-3-031-15919-0
eBook Packages: Computer ScienceComputer Science (R0)