International Journal of Computer Vision

, Volume 126, Issue 2–4, pp 440–456 | Cite as

Deep Multimodal Fusion: A Hybrid Approach

  • Mohamed R. AmerEmail author
  • Timothy Shields
  • Behjat Siddiquie
  • Amir Tamrakar
  • Ajay Divakaran
  • Sek Chai


We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representation power of generative models. Our focus is on detecting multimodal events in time varying sequences as well as generating missing data in any of the modalities. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We propose a new model that jointly optimizes the representation space using a hybrid energy function. We employ a Restricted Boltzmann Machines (RBMs) based model to learn a shared representation across multiple modalities with time varying data. The Conditional RBMs (CRBMs) is an extension of the RBM model that takes into account short term temporal phenomena. The hybrid model involves augmenting CRBMs with a discriminative component for classification. For these purposes we propose a novel Multimodal Discriminative CRBMs (MMDCRBMs) model. First, we train the MMDCRBMs model using labeled data by training each modality, followed by training a fusion layer. Second, we exploit the generative capability of MMDCRBMs to activate the trained model so as to generate the lower-level data corresponding to the specific label that closely matches the actual input data. We evaluate our approach on ChaLearn dataset, audio-mocap, as well as the Tower Game dataset, mocap-mocap as well as three multimodal toy datasets. We report classification accuracy, generation accuracy, and localization accuracy and demonstrate its superiority compared to the state-of-the-art methods.


Deep learning Conditional Restricted Boltzmann Machines Hybrid Generative Discriminative Multimodal fusion Gesture recognition Social interaction modeling 



We would like to thank Dr. Natalia Nevrova for providing the features preprocessing code for the ChaLearn dataset, and Dr. Graham Taylor for his insightful feedback and discussions. This work is supported by DARPA W911NF-12-C-0001 and the Air Force Research Laboratory (AFRL). The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.


  1. Amer, M., Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2014). Multimodal fusion using dynamic hybrid models. In WACV.Google Scholar
  2. Bengio, Y. (2009). Learning deep architectures for ai. In FTML.Google Scholar
  3. Camgoz, N., Kindiroglu, A., & Akarun, L. (2014). Gesture recognition using templatebased random forest classifiers. In ECCV.Google Scholar
  4. Chang, J. (2014). Nonparametric gesture labeling from multi-modal data. In ECCV-W.Google Scholar
  5. Chen, G., Clarke, D., Giuliani, M., Weikersdorfer, D., & Knoll, A. (2014). Multi-modality gesture detection and recognition with un-supervision, randomization and discrimination. In ECCV-W.Google Scholar
  6. Cox, S., Harvey, R., Lan, Y., & Newman, J. (2008). The challenge of multispeaker lip-reading. In AVSP.Google Scholar
  7. Druck, G., & McCallum, A. (2010). High-performance semi-supervised learning using discriminatively constrained generative models. In ICML.Google Scholar
  8. Escalera, S., Baro, X., Gonzalez, J., Bautista, M., Madadi, M., Reyes, M., Ponce, V., Escalante, H., Shotton, J., & Guyon, I. (2014). Chalearn looking at people challenge 2014: Dataset and results. In ECCV-W.Google Scholar
  9. Evangelidis, G., Singh, G., & Horaud, R. (2014). Continuous gesture recognition from articulated poses. In ECCV-W.Google Scholar
  10. Fujino, A., Ueda, N., & Saito, K. (2008). Semi-supervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. In TPAMI.Google Scholar
  11. Garg, N., & Henderson, J. (2011). Temporal restricted Boltzmann machines for dependency parsing. In ACL.Google Scholar
  12. Glodek, M., et al. (2011). Multiple classifier systems for the classification of audio-visual emotional states. In ACII.Google Scholar
  13. Gurban, M., & Thiran, J. P. (2009). Information theoretic feature extraction for audio-visual speech recognition. IEEE Transactions on Signal Processing, 57, 4765–4776.MathSciNetCrossRefGoogle Scholar
  14. Hausler, C., & Susemihl, A. (2012). Temporal autoencoding restricted Boltzmann machine. In CoRR.Google Scholar
  15. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. In NC.Google Scholar
  16. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. In NC.Google Scholar
  17. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines. In ICML.Google Scholar
  18. Lewandowski, N. B., Bengio, Y., & Vincent, P. (2012). Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In ICML.Google Scholar
  19. Li, X., Lee, T., & Liu, Y. (2011). Hybrid generative-discriminative classification using posterior divergence. In CVPR.Google Scholar
  20. Lucey, P., & Sridharan, S. (2006). Patch based representation of visual speech. In HCSnet workshop on the use of vision in human-computer interaction.Google Scholar
  21. Matthews, I., et al. (2002). Extraction of visual features for lipreading. In: TPAMI.Google Scholar
  22. Memisevic, R. & Hinton, G. E. (2007). Unsupervised learning of image transformations. In CVPR.Google Scholar
  23. Mohamed, A. R., & Hinton, G. E. (2009). Phone recognition using restricted Boltzmann machines. In ICASSP.Google Scholar
  24. Monnier, C., German, S., & Ost, A. (2014). A multi-scale boosted detector for efficient and robust gesture recognition. In ECCV-W.Google Scholar
  25. Neverova, N., Wolf, C., Taylor, G., & Nebout, F. (2014). Moddrop: Adaptive multi-modal gesture recognition. In PAMI.Google Scholar
  26. Neverova, N., Wolf, C., Taylor, G. W., & Nebout, F. (2014). Multi-scale deep learning for gesture detection and localization. In ECCV-W.Google Scholar
  27. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. (2011). Multimodal deep learning. In ICML.Google Scholar
  28. Niebles, J., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. IJCV, 79(3), 299–318.CrossRefGoogle Scholar
  29. Papandreou, G., Katsamanis, A., Pitsikalis, V., & Maragos, P. (2009). Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. In TASLP.Google Scholar
  30. Patterson, E., et al. (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. In ICASSP.Google Scholar
  31. Peng, X., Wang, L., & Cai, Z. (2014). Action and gesture temporal spotting with super vector representation. In ECCV-W.Google Scholar
  32. Perina, A., et al. (2012). Free energy score spaces: Using generative information in discriminative classifiers. In TPAMI.Google Scholar
  33. Pigou, L., Dieleman, S., & Kindermans, P. J. (2014). Sign language recognition using convolutional neural networks. In ECCV-W.Google Scholar
  34. Ramirez, G., Baltrusaitis, T., & Morency, L. P. (2011). Modeling latent discriminative dynamic of multi-dimensional affective signals. In ACII.Google Scholar
  35. Ranzato, M. A., et al. (2011). On deep generative models with applications to recognition. In CVPR.Google Scholar
  36. Rehg, J. M., et al. (2013). Decoding children’s social behavior. In CVPR.Google Scholar
  37. Salakhutdinov, R., & Hinton, G. E. (2006). Reducing the dimensionality of data with neural networks. In Science.Google Scholar
  38. Salter, D. A., Tamrakar, A., Behjat Siddiquie, M. R. A., Divakaran, A., Lande, B., & Mehri, D. (2015). The tower game dataset: A multimodal dataset for analyzing social interaction predicates. In ACII.Google Scholar
  39. Schuller, B., et al. (2011). Avec 2011—the first international audio visual emotion challenge. In ACII.Google Scholar
  40. Siddiquie, B., Khan, S., Divakaran, A., & Sawhney, H. (2013). Affect analysis in natural human interactions using joint hidden conditional random fields. In ICME.Google Scholar
  41. Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2006). Learning joint top-down and bottom-up processes for 3d visual inference. In CVPR.Google Scholar
  42. Srivastava, N., & Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann machines. In NIPS.Google Scholar
  43. Sun, X., Lichtenauer, J., Valstar, M. F., Nijholt, A., & Pantic., M. (2011). A multimodal database for mimicry analysis. In ACII.Google Scholar
  44. Sutskever, I., & Hinton, G. E. (2007). Learning multilevel distributed representations for high-dimensional sequences. In AISTATS.Google Scholar
  45. Sutskever, I., Hinton, G., & Taylor, G. (2008). The recurrent temporal restricted Boltzmann machine. In NIPS.Google Scholar
  46. Taylor, G. W., et al. (2010). Dynamical binary latent variable models for 3d human pose tracking. In CVPR.Google Scholar
  47. Taylor, G. W., Hinton, G. E., & Roweis, S. T. (2011). Two distributed-state models for generating high-dimensional time series. Journal of Machine Learning Research, 12, 1025–1068.MathSciNetzbMATHGoogle Scholar
  48. Wu, D. (2014). Deep dynamic neural networks for gesture segmentation and recognition. In ECCV-W.Google Scholar
  49. Zanfir, M., Leordeanu, M., & Sminchisescu, C. (2013). The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In ICCV.Google Scholar
  50. Zeiler, M. D., & Fergus, R. (2014). A multimodal database for mimicry analysis. In ECCV.Google Scholar
  51. Zeiler, M. D., Taylor, G. W., Sigal, L., Matthews, I., & Fergus, R. (2011). Facial expression transfer with input–output temporal restricted Boltzmann machines. In NIPS.Google Scholar
  52. Zhao, G., & Barnard, M. (2009). Lipreading with local spatiotemporal descriptors. Transactions of Multimedia, 11, 1254–1265.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Mohamed R. Amer
    • 1
    Email author
  • Timothy Shields
    • 1
  • Behjat Siddiquie
    • 1
  • Amir Tamrakar
    • 1
  • Ajay Divakaran
    • 1
  • Sek Chai
    • 1
  1. 1.SRI InternationalPrincetonUSA

Personalised recommendations