Abstract
In many real-world settings, the external environment is perceived through multi-modal information, such as visual, radar, lidar, etc. Naturally, the fact motivates us to exploit interaction intra modals and integrate multiple source information using limited labels on the multimodal dataset as a semi-supervised task. A challenging issue in multimodal semi-supervised learning is the complicated correlations under pairwise modalities. In this paper, we propose a hypergraph variational autoencoder (HVAE) which can mine high-order interaction of multimodal data and introduce extra prior knowledge for inferring multimodal fusion representation. On one hand, the hypergraph structure can represent high-order data correlation in multimodal scenes. On the other hand, a prior distribution is introduced by mask-based variational inference to enhance multi-modal characterization. Moreover, the variational lower bound is leveraged to collaborate semi-supervised learning. We conduct experiments on semi-supervised visual object recognition task, and extensive experiments on two datasets demonstrate the superiority of our method against the existing baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–433 (2015)
Chen, Y.-T., Shi, J., Mertz, C., Kong, S., Ramanan, D.: Multimodal object detection via Bayesian fusion. arXiv preprint arXiv:2104.02904 (2021)
Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view CNNs for object classification on 3D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5648–5656 (2015)
Guillaumin, J.V., Schmid, C.: Multimodal semi-supervised learning for image classification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 902–909. IEEE (2010)
Liang, J., Li, R., Jin, Q.: Semi-supervised multi-modal emotion recognition with cross-modal distribution matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2852–2861 (2020)
Cheng, Y., Zhao, X., Cai, R., Li, Z., Huang, K., Rui, Z., et al.: Semi-supervised multimodal deep learning for RGB-D object recognition. In: IJCAI, pp. 3345–3351 (2016)
van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2019). https://doi.org/10.1007/s10994-019-05855-6
Kipf, T.N., Welling, M.: Semi-supervised Classification with Graph Convolutional Networks. Toulon, France (2017)
Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S., Hsieh, C.-J. Cluster-GCN: an efficient algorithm for training deep and large graph convolutional networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 257–266 (2019)
Rahman, S., Khan, S., Barnes, N.: Transductive learning for zero-shot object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6082–6091 (2019)
Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020)
Lee, M., Pavlovic, V.: Private-shared disentangled multimodal VAE for learning of latent representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1692–1700 (2021)
Xu, X., Lin, K., Gao, L., Lu, H., Shen, H.T., Li, H.T.: Learning cross-modal common representations by private-shared subspaces separation. In: IEEE Trans. Cybern. 52, 3261–3275 (2020)
Prakash, A., Chitta, K., Geiger, A.: Multi-modal fusion transformer for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7077–7087 (2021)
Feng, Y., You, H., Zhang, Z., Ji, R., Gao, Y.: Hypergraph neural net-works. Proc. AAAI Conf. Artif. Intell. 33(01), 3558–3565 (2019)
Kim, E.-S., Kang, W.Y., On, K.-W., Heo, Y.-J., Zhang, B.-T.: Hypergraph attention networks for multimodal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14 581–14 590 (2020)
Bai, S., Zhang, F., Torr, P.H.: Hypergraph convolution and hyper-graph attention. Pattern Recogn. 110 (2021)
Yu, J., Yin, H., Li, J., Wang, Q., Hung, N.Q.V., Zhang, X.: Self-supervised multi-channel hypergraph convolutional network for social recommendation. In: Proceedings of the Web Conference 2021, pp. 413–424 (2021)
Sun, X., et al.: Heterogeneous hypergraph embedding for graph classification. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 725–733 (2011)
Kingma, D.P., Welling, M.: Auto-encoding Variational Bayes. Banff, AB, Canada (2014)
Hui, B., Zhu, P., Hu, Q.: Collaborative graph convolutional networks: Unsupervised learning meets semi-supervised learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 4215–4222 (2020)
Gao, Y., Zhang, Z., Lin, H., Zhao, X., Du, S., Zou, C.: Hypergraph learning: methods and practices. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2548–2566 (2020)
Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016)
Wu, Z., et al.: 3D SshapeNets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)
Chen, D.-Y., Tian, X.-P., Shen, Y.-T., Ouhyoung, M.: On visual similarity based 3D model retrieval. Comput. Graph. Forum 22(3), 223–232 (2003)
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.-P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114 (2017)
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A.B., Morency, L.-P.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2247–2256 (2016)
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)
Feng, Y., Zhang, Z., Zhao, X., Ji, R., Gao, Y.: GVCNN: group-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 264–272 (2018)
Chen, J., Zhang, A.: HGMF: heterogeneous graph-based fusion for multimodal data with incompleteness. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1295–1305 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, J., Du, X., Li, Y., Hu, W. (2022). Hypergraph Variational Autoencoder for Multimodal Semi-supervised Representation Learning. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13532. Springer, Cham. https://doi.org/10.1007/978-3-031-15937-4_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-15937-4_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15936-7
Online ISBN: 978-3-031-15937-4
eBook Packages: Computer ScienceComputer Science (R0)