Abstract
We present a triple stream DBN model (T_AsyDBN) for audio visual emotion recognition, in which the two audio feature streams are synchronous, while they are asynchronous with the visual feature stream within controllable constraints. MFCC features and the principle component analysis (PCA) coefficients of local prosodic features are used for the audio streams. For the visual stream, 2D facial features as well 3D facial animation unit features are defined and concatenated, and the feature dimensions are reduced by PCA. Emotion recognition experiments on the eNERFACE’05 database show that by adjusting the asynchrony constraint, the proposed T_AsyDBN model obtains 18.73% higher correction rate than the traditional multi-stream state synchronous HMM (MSHMM), and 10.21% higher than the two stream asynchronous DBN model (Asy_DBN).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ekman, P., Friesen, W.V.: Constants across Cultures in the Face and Emotion. Journal of Personality and Social Psychology 17(2), 124–129 (1971)
Lee, C.M., Narayanan, S.S.: Toward Detecting Emotions In Spoken Dialogs. IEEE Tran. on Speech and Audio Processing 13(2), 293–303 (2005)
Neiberg, D., Elenius, K., Laskowski, K.: Emotion Recognition In Spontaneous Speech Using GMMs. In: Proceedings ICSLP 2006, Pittsburgh, pp. 809–812 (2006)
Wang, J., Yin, L., Wei, X., Sun, Y.: 3D Facial Expression Recognition Based on Primitive Surface Feature Distribution. In: Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 1399–1406 (2006)
Zeng, Z., Fu, Y., Roisman, G.I., Wen, Z., Hu, Y., Huang, T.S.: Spontaneous Emotional Facial Expression Detection. Journal of Multimedia 1(5), 1–8 (2006)
Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., et al.: Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information. In: ACM Int. Conf. on Multimodal Interfaces, pp. 205–211 (2004)
Schuller, B., Müller, R., Hörnler, B., et al.: Audiovisual Recognition of Spontaneous Interest Within Conversations. In: ACM Int. Conf. on Multimodal Interfaces, pp. 30–37 (2007)
Pal, P., Iyer, A.N., Yantorno, R.E.: Emotion Detection from Infant Facial Expressions and Cries. In: Proc. ICASSP, vol. 2, pp. 721–724 (2006)
Petridis, S., Pantic, M.: Audiovisual Discrimination between Laughter and Speech. In: Proc. ICASSP, pp. 5117–5120 (2008)
Zeng, Z., Hu, Y., Roisman, G.I., Wen, Z., Fu, Y., et al.: Audio-Visual Emotion Recognition in Adult Attachment Interview. In: Int. Conf. on Multimodal Interfaces, pp. 139–145 (2006)
Zeng, Z., Tu, J., Pianfetti, et al.: Audio-visual Affective Expression Recognition through Multi-stream Fused HMM. IEEE Transactions on Multimedia 10(4), 570–577 (2008)
Song, M., You, M., Li, N., Chen, C.: A Robust Multimodal Approach for Emotion Recognition. Neurocomputing 71(10-12), 1913–1920 (2008)
Chen, D., Jiang, D., Ravyse, I., Sahli, H.: Audio-Visual Emotion Recognition Based on a DBN Model with Constrained Asynchrony. In: Proc. ICIG, pp. 912–916 (2009)
Ekman, P., Friesen, W.V.: Facial Action Coding System. Consulting Psychologist Press, Palo Alto (1978)
Young, S., Kershaw, O.D., Ollason, J., Valtchev, D.V., Woodland, P.: The HTK Book. Entropic Ltd., Cambridge (1999)
Hou, Y., Sahli, H., Ravyse, I., Zhang, Y., Zhao, R.: Robust Shape-based Head Tracking. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2007. LNCS, vol. 4678, pp. 340–351. Springer, Heidelberg (2007)
Hou, Y., Fan, P., Ravyse, I., Sahli, H.: 3D Face Alignment via Cascade 2D Shape Alignment and Constrained Structure from Motion. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2009. LNCS, vol. 5807, pp. 550–561. Springer, Heidelberg (2009)
Bilmes, J., Zweig, G.: The Graphical Models Toolkit: An Open Source Software System for Speech and Time Series Processing. In: Proc. ICASSP, pp. 3916–3919 (2002)
Martin, O., Kotsia, I., Macq, B., et al.: The eNTERFACE’05 Audio-visual Emotion Database. In: Proceedings of the 22nd Int. Conf. on Data Engineering Workshops (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jiang, D., Cui, Y., Zhang, X., Fan, P., Ganzalez, I., Sahli, H. (2011). Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models. In: D’Mello, S., Graesser, A., Schuller, B., Martin, JC. (eds) Affective Computing and Intelligent Interaction. ACII 2011. Lecture Notes in Computer Science, vol 6974. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24600-5_64
Download citation
DOI: https://doi.org/10.1007/978-3-642-24600-5_64
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24599-2
Online ISBN: 978-3-642-24600-5
eBook Packages: Computer ScienceComputer Science (R0)