Deep Multimodal Fusion: A Hybrid Approach

Amer, Mohamed R.; Shields, Timothy; Siddiquie, Behjat; Tamrakar, Amir; Divakaran, Ajay; Chai, Sek

doi:10.1007/s11263-017-0997-7

Deep Multimodal Fusion: A Hybrid Approach

Published: 20 February 2017

Volume 126, pages 440–456, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Mohamed R. Amer¹,
Timothy Shields¹,
Behjat Siddiquie¹,
Amir Tamrakar¹,
Ajay Divakaran¹ &
…
Sek Chai¹

3391 Accesses
24 Citations
4 Altmetric
Explore all metrics

Abstract

We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representation power of generative models. Our focus is on detecting multimodal events in time varying sequences as well as generating missing data in any of the modalities. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We propose a new model that jointly optimizes the representation space using a hybrid energy function. We employ a Restricted Boltzmann Machines (RBMs) based model to learn a shared representation across multiple modalities with time varying data. The Conditional RBMs (CRBMs) is an extension of the RBM model that takes into account short term temporal phenomena. The hybrid model involves augmenting CRBMs with a discriminative component for classification. For these purposes we propose a novel Multimodal Discriminative CRBMs (MMDCRBMs) model. First, we train the MMDCRBMs model using labeled data by training each modality, followed by training a fusion layer. Second, we exploit the generative capability of MMDCRBMs to activate the trained model so as to generate the lower-level data corresponding to the specific label that closely matches the actual input data. We evaluate our approach on ChaLearn dataset, audio-mocap, as well as the Tower Game dataset, mocap-mocap as well as three multimodal toy datasets. We report classification accuracy, generation accuracy, and localization accuracy and demonstrate its superiority compared to the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

Article 13 May 2020

References

Amer, M., Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2014). Multimodal fusion using dynamic hybrid models. In WACV.
Bengio, Y. (2009). Learning deep architectures for ai. In FTML.
Camgoz, N., Kindiroglu, A., & Akarun, L. (2014). Gesture recognition using templatebased random forest classifiers. In ECCV.
Chang, J. (2014). Nonparametric gesture labeling from multi-modal data. In ECCV-W.
Chen, G., Clarke, D., Giuliani, M., Weikersdorfer, D., & Knoll, A. (2014). Multi-modality gesture detection and recognition with un-supervision, randomization and discrimination. In ECCV-W.
Cox, S., Harvey, R., Lan, Y., & Newman, J. (2008). The challenge of multispeaker lip-reading. In AVSP.
Druck, G., & McCallum, A. (2010). High-performance semi-supervised learning using discriminatively constrained generative models. In ICML.
Escalera, S., Baro, X., Gonzalez, J., Bautista, M., Madadi, M., Reyes, M., Ponce, V., Escalante, H., Shotton, J., & Guyon, I. (2014). Chalearn looking at people challenge 2014: Dataset and results. In ECCV-W.
Evangelidis, G., Singh, G., & Horaud, R. (2014). Continuous gesture recognition from articulated poses. In ECCV-W.
Fujino, A., Ueda, N., & Saito, K. (2008). Semi-supervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. In TPAMI.
Garg, N., & Henderson, J. (2011). Temporal restricted Boltzmann machines for dependency parsing. In ACL.
Glodek, M., et al. (2011). Multiple classifier systems for the classification of audio-visual emotional states. In ACII.
Gurban, M., & Thiran, J. P. (2009). Information theoretic feature extraction for audio-visual speech recognition. IEEE Transactions on Signal Processing, 57, 4765–4776.
Article MathSciNet Google Scholar
Hausler, C., & Susemihl, A. (2012). Temporal autoencoding restricted Boltzmann machine. In CoRR.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. In NC.
Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. In NC.
Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines. In ICML.
Lewandowski, N. B., Bengio, Y., & Vincent, P. (2012). Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In ICML.
Li, X., Lee, T., & Liu, Y. (2011). Hybrid generative-discriminative classification using posterior divergence. In CVPR.
Lucey, P., & Sridharan, S. (2006). Patch based representation of visual speech. In HCSnet workshop on the use of vision in human-computer interaction.
Matthews, I., et al. (2002). Extraction of visual features for lipreading. In: TPAMI.
Memisevic, R. & Hinton, G. E. (2007). Unsupervised learning of image transformations. In CVPR.
Mohamed, A. R., & Hinton, G. E. (2009). Phone recognition using restricted Boltzmann machines. In ICASSP.
Monnier, C., German, S., & Ost, A. (2014). A multi-scale boosted detector for efficient and robust gesture recognition. In ECCV-W.
Neverova, N., Wolf, C., Taylor, G., & Nebout, F. (2014). Moddrop: Adaptive multi-modal gesture recognition. In PAMI.
Neverova, N., Wolf, C., Taylor, G. W., & Nebout, F. (2014). Multi-scale deep learning for gesture detection and localization. In ECCV-W.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. (2011). Multimodal deep learning. In ICML.
Niebles, J., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. IJCV, 79(3), 299–318.
Article Google Scholar
Papandreou, G., Katsamanis, A., Pitsikalis, V., & Maragos, P. (2009). Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. In TASLP.
Patterson, E., et al. (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. In ICASSP.
Peng, X., Wang, L., & Cai, Z. (2014). Action and gesture temporal spotting with super vector representation. In ECCV-W.
Perina, A., et al. (2012). Free energy score spaces: Using generative information in discriminative classifiers. In TPAMI.
Pigou, L., Dieleman, S., & Kindermans, P. J. (2014). Sign language recognition using convolutional neural networks. In ECCV-W.
Ramirez, G., Baltrusaitis, T., & Morency, L. P. (2011). Modeling latent discriminative dynamic of multi-dimensional affective signals. In ACII.
Ranzato, M. A., et al. (2011). On deep generative models with applications to recognition. In CVPR.
Rehg, J. M., et al. (2013). Decoding children’s social behavior. In CVPR.
Salakhutdinov, R., & Hinton, G. E. (2006). Reducing the dimensionality of data with neural networks. In Science.
Salter, D. A., Tamrakar, A., Behjat Siddiquie, M. R. A., Divakaran, A., Lande, B., & Mehri, D. (2015). The tower game dataset: A multimodal dataset for analyzing social interaction predicates. In ACII.
Schuller, B., et al. (2011). Avec 2011—the first international audio visual emotion challenge. In ACII.
Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2013). Affect analysis in natural human interactions using joint hidden conditional random fields. In ICME.
Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2006). Learning joint top-down and bottom-up processes for 3d visual inference. In CVPR.
Srivastava, N., & Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann machines. In NIPS.
Sun, X., Lichtenauer, J., Valstar, M. F., Nijholt, A., & Pantic., M. (2011). A multimodal database for mimicry analysis. In ACII.
Sutskever, I., & Hinton, G. E. (2007). Learning multilevel distributed representations for high-dimensional sequences. In AISTATS.
Sutskever, I., Hinton, G., & Taylor, G. (2008). The recurrent temporal restricted Boltzmann machine. In NIPS.
Taylor, G. W., et al. (2010). Dynamical binary latent variable models for 3d human pose tracking. In CVPR.
Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2011). Two distributed-state models for generating high-dimensional time series. Journal of Machine Learning Research, 12, 1025–1068.
MathSciNet MATH Google Scholar
Wu, D. (2014). Deep dynamic neural networks for gesture segmentation and recognition. In ECCV-W.
Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In ICCV.
Zeiler, M. D., & Fergus, R. (2014). A multimodal database for mimicry analysis. In ECCV.
Zeiler, M. D., Taylor, G. W., Sigal, L., Matthews, I., & Fergus, R. (2011). Facial expression transfer with input–output temporal restricted Boltzmann machines. In NIPS.
Zhao, G., & Barnard, M. (2009). Lipreading with local spatiotemporal descriptors. Transactions of Multimedia, 11, 1254–1265.
Article Google Scholar

Download references

Acknowledgements

We would like to thank Dr. Natalia Nevrova for providing the features preprocessing code for the ChaLearn dataset, and Dr. Graham Taylor for his insightful feedback and discussions. This work is supported by DARPA W911NF-12-C-0001 and the Air Force Research Laboratory (AFRL). The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Author information

Authors and Affiliations

SRI International, Princeton, NJ, 08540, USA
Mohamed R. Amer, Timothy Shields, Behjat Siddiquie, Amir Tamrakar, Ajay Divakaran & Sek Chai

Authors

Mohamed R. Amer
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Shields
View author publications
You can also search for this author in PubMed Google Scholar
Behjat Siddiquie
View author publications
You can also search for this author in PubMed Google Scholar
Amir Tamrakar
View author publications
You can also search for this author in PubMed Google Scholar
Ajay Divakaran
View author publications
You can also search for this author in PubMed Google Scholar
Sek Chai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed R. Amer.

Additional information

Communicated by Cordelia Schmid ,V. Lepetit.

Mohamed R. Amer and Timothy Shields have contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amer, M.R., Shields, T., Siddiquie, B. et al. Deep Multimodal Fusion: A Hybrid Approach. Int J Comput Vis 126, 440–456 (2018). https://doi.org/10.1007/s11263-017-0997-7

Download citation

Received: 16 February 2016
Accepted: 06 February 2017
Published: 20 February 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11263-017-0997-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Multimodal Fusion: A Hybrid Approach

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep Multimodal Fusion: A Hybrid Approach

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation