Abstract
Multimodal sentiment analysis (MSA) has been widely investigated in both computer vision and natural language processing. However, studies on the imperfect data especially with missing values are still far from success and challenging, even though such an issue is ubiquitous in the real world. Although previous works show the promising performance by exploiting the low-rank structures of the fused features, only the first-order statistics of the temporal dynamics are concerned. To this end, we propose a novel network architecture termed Time Product Fusion Network (TPFN), which takes the high-order statistics over both modalities and temporal dynamics into account. We construct the fused features by the outer product along adjacent time-steps, such that richer modal and temporal interactions are utilized. In addition, we claim that the low-rank structures can be obtained by regularizing the Frobenius norm of latent factors instead of the fused features. Experiments on CMU-MOSI and CMU-MOSEI datasets show that TPFN can compete with state-of-the art approaches in multimodal sentiment analysis in cases of both random and structured missing values.
B. Li and C. Li—Equal Contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Without ambiguity, we also use the notion of CP-rank to represent the number of rank-1 factors used in matrix/tensor approximation.
- 2.
References
Antol, S., et al.: Vqa: visual question answering. In: Proceedings of ICCV (2015)
Barezi, E.J., Fung, P.: Modality-based factorization for multimodal fusion. arXiv preprint arXiv:1811.12624 (2018)
Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of ICCV (2017)
Ben-Younes, H., Cadene, R., Thome, N., Cord, M.: Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of AAAI (2019)
Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Cai, L., Wang, Z., Gao, H., Shen, D., Ji, S.: Deep adversarial learning for multi-modality missing data completion. In: Proceedings of SIGKDD (2018)
Cambria, E., Poria, S., Bajpai, R., Schuller, B.: Senticnet 4: a semantic resource for sentiment analysis based on conceptual primitives. In: Proceedings of COLING (2016)
Carroll, J.D., Chang, J.J.: Analysis of individual differences in multidimensional scaling via an n-way generalization of "eckart-young" decomposition. Psychometrika 35(3), 283–319 (1970)
Chen, M., Wang, S., Liang, P.P., Baltrušaitis, T., Zadeh, A., Morency, L.P.: Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of ICMI (2017)
Cichocki, A., et al.: Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Found. Trends Mach. Learn. 9(4—-5), 249–429 (2016)
Dong, J., Zheng, H., Lian, L.: Low-rank laplacian-uniform mixed model for robust face recognition. In: Proceedings of CVPR (2019)
Duong, C.T., Lebret, R., Aberer, K.: Multimodal classification for analysing social media. arXiv preprint arXiv:1708.02099 (2017)
Fan, H., Chen, Y., Guo, Y., Zhang, H., Kuang, G.: Hyperspectral image restoration using low-rank tensor recovery. EEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 10(10), 4589–4604 (2017)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: elevating the role of image understanding in visual question answering. In: Proceedings of CVPR (2017)
Gu, Y., Li, X., Chen, S., Zhang, J., Marsic, I.: Speech intention classification with multimodal deep learning. In: Canadian Conference on Artificial Intelligence. pp. 260–271 (2017)
Guo, J., Zhou, Z., Wang, L.: Single image highlight removal with a sparse and low-rank reflection model. In: Proceedings of ECCV (2018)
Harshman, R.A., et al.: Foundations of the parafac procedure: Models and conditions for an" explanatory" multimodal factor analysis. UCLA Working Phonetics Paper (1970)
He, W., Yao, Q., Li, C., Yokoya, N., Zhao, Q.: Non-local meets global: an integrated paradigm for hyperspectral denoising. In: Proceedings of CVPR (2019)
Hou, M., Tang, J., Zhang, J., Kong, W., Zhao, Q.: Deep multimodal multilinear fusion with high-order polynomial pooling. In: Proceedings of NeurIPS (2019)
Jia, Y., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of MM (2014)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM review 51(3), 455–500 (2009)
Li, C., He, W., Yuan, L., Sun, Z., Zhao, Q.: Guaranteed matrix completion under multiple linear transformations. In: Proceedings of CVPR (2019)
Liang, P.P., et al.: Learning representations from imperfect time series data via tensor rank regularization. arXiv preprint arXiv:1907.01011 (2019)
Liu, B., Zhang, L.: A Survey of Opinion Mining and Sentiment Analysis. In: Aggarwal, C., Zhai, C., (eds.) Mining Text Data. Springer, Boston, MA (2012) https://doi.org/10.1007/978-1-4614-3223-4_13
Liu, H., Lin, M., Zhang, S., Wu, Y., Huang, F., Ji, R.: Dense auto-encoder hashing for robust cross-modality retrieval. In: Proceedings of MM (2018)
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018)
Lu, C., Peng, X., Wei, Y.: Low-rank tensor completion with a new tensor nuclear norm induced by invertible linear transforms. In: Proceedings of CVPR (2019)
Mai, S., Hu, H., Xing, S.: Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In: Proceedings of ACL (2019)
Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Morency, L.P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: Harvesting opinions from the web. In: Proceedings of ICMI (2011)
Nimishakavi, M., Jawanpuria, P.K., Mishra, B.: A dual framework for low-rank tensor completion. In: Proceedings of NeurIPS (2018)
Pan, Y., et al.: Compressing recurrent neural networks with tensor ring for action recognition. In: Proceedings of AAAI (2019)
Pang, B., Lee, L., et al.: Opinion mining and sentiment analysis. Found. Trends Inf. Ret. 2(1—-21), 1–135 (2008)
Pham, H., Liang, P.P., Manzini, T., Morency, L.P., Póczos, B.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of AAAI (2019)
Poria, S., Cambria, E., Winterstein, G., Huang, G.B.: Sentic patterns: dependency-based rules for concept-level sentiment analysis. Know.-Based Syst. 69, 45–63 (2014)
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52(3), 471–501 (2010)
Srebro, N., Shraibman, A.: Rank, trace-norm and max-norm. In: Proceedings of COLT (2005)
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods for sentiment analysis. CL 37(2), 267–307 (2011)
Tran, L., Liu, X., Zhou, J., Jin, R.: Missing modalities imputation via cascaded residual autoencoder. In: Proceedings of CVPR (2017)
Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31(3), 279–311 (1966)
Wang, A., Li, C., Jin, Z., Zhao, Q.: Robust tensor decomposition via orientation invariant tubal nuclear norms. In: Proceedings of AAAI (2020)
Wang, H., Meghawat, A., Morency, L.P., Xing, E.P.: Select-additive learning: improving generalization in multimodal sentiment analysis. In: Proceedings of ICME (2017)
Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of AAAI (2019)
Wöllmer, M., Weninger, F., Knaup, T., Schuller, B., Sun, C., Sagae, K., Morency, L.P.: Youtube movie reviews: sentiment analysis in an audio-visual context. IEEE Intell. Syst. 28(3), 46–53 (2013)
Yang, Y., Krompass, D., Tresp, V.: Tensor-train recurrent neural networks for video classification. In: Proceedings ICML (2017)
Yu, K., Zhu, S., Lafferty, J., Gong, Y.: Fast nonparametric matrix factorization for large-scale collaborative filtering. In: Proceedings of SIGIR (2009)
Yuan, L., Li, C., Mandic, D., Cao, J., Zhao, Q.: Tensor ring decomposition with rank minimization on latent space: an efficient approach for tensor completion. In: Proceedings of AAAI (2019)
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Proceedings of AAAI (2018)
Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Proc. AAAI (2018)
Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016)
Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)
Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of ACL (2018)
Acknowledgment
Binghua and Chao contributed equally. We thank our colleagues Dr. Ming Hou and Zihao Huang for discussions that greatly improved the manuscript. This work was partially supported by the National Key R&D Program of China (No. 2017YFE0129700), the National Natural Science Foundation of China (No. 61673224) and the Tianjin Natural Science Foundation for Distinguished Young Scholars (No. 18JCJQJC46100). This work is also supported by JSPS KAKENHI (Grant No. 20H04249, 20H04208, 20K19875).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, B., Li, C., Duan, F., Zheng, N., Zhao, Q. (2020). TPFN: Applying Outer Product Along Time to Multimodal Sentiment Analysis Fusion on Incomplete Data. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12369. Springer, Cham. https://doi.org/10.1007/978-3-030-58586-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-58586-0_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58585-3
Online ISBN: 978-3-030-58586-0
eBook Packages: Computer ScienceComputer Science (R0)