Deep Multimodal Fusion: A Hybrid Approach


We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representation power of generative models. Our focus is on detecting multimodal events in time varying sequences as well as generating missing data in any of the modalities. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We propose a new model that jointly optimizes the representation space using a hybrid energy function. We employ a Restricted Boltzmann Machines (RBMs) based model to learn a shared representation across multiple modalities with time varying data. The Conditional RBMs (CRBMs) is an extension of the RBM model that takes into account short term temporal phenomena. The hybrid model involves augmenting CRBMs with a discriminative component for classification. For these purposes we propose a novel Multimodal Discriminative CRBMs (MMDCRBMs) model. First, we train the MMDCRBMs model using labeled data by training each modality, followed by training a fusion layer. Second, we exploit the generative capability of MMDCRBMs to activate the trained model so as to generate the lower-level data corresponding to the specific label that closely matches the actual input data. We evaluate our approach on ChaLearn dataset, audio-mocap, as well as the Tower Game dataset, mocap-mocap as well as three multimodal toy datasets. We report classification accuracy, generation accuracy, and localization accuracy and demonstrate its superiority compared to the state-of-the-art methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. Amer, M., Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2014). Multimodal fusion using dynamic hybrid models. In WACV.

  2. Bengio, Y. (2009). Learning deep architectures for ai. In FTML.

  3. Camgoz, N., Kindiroglu, A., & Akarun, L. (2014). Gesture recognition using templatebased random forest classifiers. In ECCV.

  4. Chang, J. (2014). Nonparametric gesture labeling from multi-modal data. In ECCV-W.

  5. Chen, G., Clarke, D., Giuliani, M., Weikersdorfer, D., & Knoll, A. (2014). Multi-modality gesture detection and recognition with un-supervision, randomization and discrimination. In ECCV-W.

  6. Cox, S., Harvey, R., Lan, Y., & Newman, J. (2008). The challenge of multispeaker lip-reading. In AVSP.

  7. Druck, G., & McCallum, A. (2010). High-performance semi-supervised learning using discriminatively constrained generative models. In ICML.

  8. Escalera, S., Baro, X., Gonzalez, J., Bautista, M., Madadi, M., Reyes, M., Ponce, V., Escalante, H., Shotton, J., & Guyon, I. (2014). Chalearn looking at people challenge 2014: Dataset and results. In ECCV-W.

  9. Evangelidis, G., Singh, G., & Horaud, R. (2014). Continuous gesture recognition from articulated poses. In ECCV-W.

  10. Fujino, A., Ueda, N., & Saito, K. (2008). Semi-supervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. In TPAMI.

  11. Garg, N., & Henderson, J. (2011). Temporal restricted Boltzmann machines for dependency parsing. In ACL.

  12. Glodek, M., et al. (2011). Multiple classifier systems for the classification of audio-visual emotional states. In ACII.

  13. Gurban, M., & Thiran, J. P. (2009). Information theoretic feature extraction for audio-visual speech recognition. IEEE Transactions on Signal Processing, 57, 4765–4776.

    MathSciNet  Article  Google Scholar 

  14. Hausler, C., & Susemihl, A. (2012). Temporal autoencoding restricted Boltzmann machine. In CoRR.

  15. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. In NC.

  16. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. In NC.

  17. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines. In ICML.

  18. Lewandowski, N. B., Bengio, Y., & Vincent, P. (2012). Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In ICML.

  19. Li, X., Lee, T., & Liu, Y. (2011). Hybrid generative-discriminative classification using posterior divergence. In CVPR.

  20. Lucey, P., & Sridharan, S. (2006). Patch based representation of visual speech. In HCSnet workshop on the use of vision in human-computer interaction.

  21. Matthews, I., et al. (2002). Extraction of visual features for lipreading. In: TPAMI.

  22. Memisevic, R. & Hinton, G. E. (2007). Unsupervised learning of image transformations. In CVPR.

  23. Mohamed, A. R., & Hinton, G. E. (2009). Phone recognition using restricted Boltzmann machines. In ICASSP.

  24. Monnier, C., German, S., & Ost, A. (2014). A multi-scale boosted detector for efficient and robust gesture recognition. In ECCV-W.

  25. Neverova, N., Wolf, C., Taylor, G., & Nebout, F. (2014). Moddrop: Adaptive multi-modal gesture recognition. In PAMI.

  26. Neverova, N., Wolf, C., Taylor, G. W., & Nebout, F. (2014). Multi-scale deep learning for gesture detection and localization. In ECCV-W.

  27. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. (2011). Multimodal deep learning. In ICML.

  28. Niebles, J., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. IJCV, 79(3), 299–318.

    Article  Google Scholar 

  29. Papandreou, G., Katsamanis, A., Pitsikalis, V., & Maragos, P. (2009). Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. In TASLP.

  30. Patterson, E., et al. (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. In ICASSP.

  31. Peng, X., Wang, L., & Cai, Z. (2014). Action and gesture temporal spotting with super vector representation. In ECCV-W.

  32. Perina, A., et al. (2012). Free energy score spaces: Using generative information in discriminative classifiers. In TPAMI.

  33. Pigou, L., Dieleman, S., & Kindermans, P. J. (2014). Sign language recognition using convolutional neural networks. In ECCV-W.

  34. Ramirez, G., Baltrusaitis, T., & Morency, L. P. (2011). Modeling latent discriminative dynamic of multi-dimensional affective signals. In ACII.

  35. Ranzato, M. A., et al. (2011). On deep generative models with applications to recognition. In CVPR.

  36. Rehg, J. M., et al. (2013). Decoding children’s social behavior. In CVPR.

  37. Salakhutdinov, R., & Hinton, G. E. (2006). Reducing the dimensionality of data with neural networks. In Science.

  38. Salter, D. A., Tamrakar, A., Behjat Siddiquie, M. R. A., Divakaran, A., Lande, B., & Mehri, D. (2015). The tower game dataset: A multimodal dataset for analyzing social interaction predicates. In ACII.

  39. Schuller, B., et al. (2011). Avec 2011—the first international audio visual emotion challenge. In ACII.

  40. Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2013). Affect analysis in natural human interactions using joint hidden conditional random fields. In ICME.

  41. Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2006). Learning joint top-down and bottom-up processes for 3d visual inference. In CVPR.

  42. Srivastava, N., & Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann machines. In NIPS.

  43. Sun, X., Lichtenauer, J., Valstar, M. F., Nijholt, A., & Pantic., M. (2011). A multimodal database for mimicry analysis. In ACII.

  44. Sutskever, I., & Hinton, G. E. (2007). Learning multilevel distributed representations for high-dimensional sequences. In AISTATS.

  45. Sutskever, I., Hinton, G., & Taylor, G. (2008). The recurrent temporal restricted Boltzmann machine. In NIPS.

  46. Taylor, G. W., et al. (2010). Dynamical binary latent variable models for 3d human pose tracking. In CVPR.

  47. Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2011). Two distributed-state models for generating high-dimensional time series. Journal of Machine Learning Research, 12, 1025–1068.

    MathSciNet  MATH  Google Scholar 

  48. Wu, D. (2014). Deep dynamic neural networks for gesture segmentation and recognition. In ECCV-W.

  49. Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In ICCV.

  50. Zeiler, M. D., & Fergus, R. (2014). A multimodal database for mimicry analysis. In ECCV.

  51. Zeiler, M. D., Taylor, G. W., Sigal, L., Matthews, I., & Fergus, R. (2011). Facial expression transfer with input–output temporal restricted Boltzmann machines. In NIPS.

  52. Zhao, G., & Barnard, M. (2009). Lipreading with local spatiotemporal descriptors. Transactions of Multimedia, 11, 1254–1265.

    Article  Google Scholar 

Download references


We would like to thank Dr. Natalia Nevrova for providing the features preprocessing code for the ChaLearn dataset, and Dr. Graham Taylor for his insightful feedback and discussions. This work is supported by DARPA W911NF-12-C-0001 and the Air Force Research Laboratory (AFRL). The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Author information



Corresponding author

Correspondence to Mohamed R. Amer.

Additional information

Mohamed R. Amer and Timothy Shields have contributed equally to this work.

Communicated by Cordelia Schmid ,V. Lepetit.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Amer, M.R., Shields, T., Siddiquie, B. et al. Deep Multimodal Fusion: A Hybrid Approach. Int J Comput Vis 126, 440–456 (2018).

Download citation


  • Deep learning
  • Conditional Restricted Boltzmann Machines
  • Hybrid
  • Generative
  • Discriminative
  • Multimodal fusion
  • Gesture recognition
  • Social interaction modeling