Investigating Effectiveness of Linguistic Features Based on Speech Recognition for Storytelling Skill Assessment
This paper investigates the effectiveness of linguistic features based on speech recognition for storytelling skill assessment in group conversations. A multimodal data corpus, including the skill scores of storytellers, is used for this study. Three kinds of automatic speech recognition (ASR) results are compared from the viewpoint of the contribution to the skill assessment task. A regression model to predict the skill is trained by fusing the linguistic features and nonverbal features including utterance length, prosody, gaze, head and hand gestures. Experimental results show that the mean regression accuracy \((R^2 = 0.24)\) for the storytelling skills with the linguistic features based on ASR rate 49% is improved from \(R^2 = 0.17\) of the non-verbal model by 0.07 points. We summarize that the features extracted from text contribute to the skill assessment task although the ASR results contained not a few errors.
KeywordsStorytelling skill assessment Multimodal interaction Automatic linguistic analysis
This work was performed under the Research Program of “Dynamic Alliance for Open Innovation Bridging Human, Environment and Materials” in “Network Joint Research Center for Materials and Devices” and Japan Society for the Promotion of Science (JSPS) KAKENHI (15K00300, 15H02746).
- 1.Park, S., Shim, H.S., Chatterjee, M., Sagae, K., Morency, L.P.: Computational analysis of persuasiveness in social multimedia: a novel dataset and multimodal prediction approach. In: Proceedings of ACM ICMI, pp. 50–57 (2014)Google Scholar
- 2.Okada, S., Ohtake, Y., Nakano, Y.I., Hayashi, Y., Huang, H.H., Takase, Y., Nitta, K.: Estimating communication skills using dialogue acts and nonverbal features in multiple discussion datasets. In: Proceedings of ACM ICMI, pp. 169–176 (2016)Google Scholar
- 3.Okada, S., Bono, M., Takanashi, K., Sumi, Y., Nitta, K.: Context-based conversational hand gesture classification in narrative interaction. In: Proceedings of ACM ICMI, pp. 303–310 (2013)Google Scholar
- 5.Ramanarayanan, V., Leong, C.W., Chen, L., Feng, G., Suendermann-Oeft, D.: Evaluating speech, face, emotion and body movement time-series features for automated multimodal presentation scoring. In: Proceedings of ACM ICMI, pp. 23–30 (2015)Google Scholar
- 6.Chollet, M., Prendinger, H., Scherer, S.: Native vs. non-native language fluency implications on multimodal interaction for interpersonal skills training. In: Proceedings of ACM ICMI, pp. 386–393 (2016)Google Scholar
- 9.Rasipuram, S.B., Jayagopi, D.B.: Asynchronous video interviews vs. face-to-face interviews for communication skill measurement: a systematic study. In: Proceedings of ACM ICMI, pp. 370–377 (2016)Google Scholar
- 10.Chatterjee, M., Park, S., Morency, L.P., Scherer, S.: Combining two perspectives on classifying multimodal data for recognizing speaker traits. In: Proceedings of ACM ICMI, pp. 7–14 (2015)Google Scholar
- 11.Nojavanasghari, B., Gopinath, D., Koushik, J., Baltrušaitis, T., Morency, L.P.: Deep multimodal fusion for persuasiveness prediction. In: Proceedings of ACM ICMI, pp. 284–288 (2016)Google Scholar
- 12.Lee, A., Kawahara, T.: Recent development of open-source speech recognition engine Julius. In: Proceedings of APSIPA ASC, pp. 131–137 (2009)Google Scholar
- 13.Furui, S., Maekawa, K., Isahara, H.: A Japanese national project on spontaneous speech corpus and processing technology. In: Proceedings of ISCA Workshop on Automatic Speech Recognition, pp. 244–248 (2000)Google Scholar
- 14.Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to japanese morphological analysis. Proc. EMNLP 4, 230–237 (2004)Google Scholar