Combining User Modeling and Machine Learning to Predict Users’ Multimodal Integration Patterns

  • Xiao Huang
  • Sharon Oviatt
  • Rebecca Lunsford
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4299)


Temporal as well as semantic constraints on fusion are at the heart of multimodal system processing. The goal of the present work is to develop user-adaptive temporal thresholds with improved performance characteristics over state-of-the-art fixed ones, which can be accomplished by leveraging both empirical user modeling and machine learning techniques to handle the large individual differences in users’ multimodal integration patterns. Using simple Naive Bayes learning methods and a leave-one-out training strategy, our model correctly predicted 88% of users’ mixed speech and pen signal input as either unimodal or multimodal, and 91% of their multimodal input as either sequentially or simultaneously integrated. In addition to predicting a user’s multimodal pattern in advance of receiving input, predictive accuracies also were evaluated after the first signal’s end-point detection—the earliest time when a speech/pen multimodal system makes a decision regarding fusion. This system-centered metric yielded accuracies of 90% and 92%, respectively, for classification of unimodal/multimodal and sequential/simultaneous input patterns. In addition, empirical modeling revealed a .92 correlation between users’ multimodal integration pattern and their likelihood of interacting multimodally, which may have accounted for the superior learning obtained with training over heterogeneous user data rather than data partitioned by user subtype. Finally, in large part due to guidance from user-modeling, the techniques reported here required as little as 15 samples to predict a “surprise” user’s input patterns.


Training Sample User Modeling Input Pattern Machine Learning Technique Integration Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Oviatt, S.: Ten myths of multimodal interaction. Comm. of the ACM 42(11), 74–81 (1999)CrossRefGoogle Scholar
  2. 2.
    Oviatt, S.: Integration and synchronization of input modes during multimodal human computer interaction. In: Proc. of CHI, pp. 415–422 (1997)Google Scholar
  3. 3.
    Oviatt, S., Coulston, R., Tomko, S., Xiao, B., Lunsford, R., Wesson, M., Carmichael, L.: Toward a theory of organized multimodal integration patterns during human-computer interaction. In: Proc. of ICMI, pp. 44–51 (2003)Google Scholar
  4. 4.
    Xiao, B., Lunsford, R., Coulston, R., Wesson, M., Oviatt, S.: Modeling multimodal integration patterns and performance in seniors: Toward adaptive processing of individual differences. In: Proc. of ICMI, pp. 265–272 (2003)Google Scholar
  5. 5.
    Oviatt, S., Coulston, R., Lunsford, R.: When do we interact multimodally? Cognitive load and multimodal communication patterns. In: Proc. of ICMI, pp. 129–136 (2004)Google Scholar
  6. 6.
    Gupta, A., Anastasakos, T.: Dynamic time windows for multimodal input fusion. In: Proc. of Interspeech, pp. 2293–2296 (2004)Google Scholar
  7. 7.
    Oviatt, S., Lunsford, R., Coulston, R.: Individual differences in multimodal integration patterns: What are they and why do they exist? In: Proc. of CHI, pp. 241–249 (2005)Google Scholar
  8. 8.
    Huang, X., Oviatt, S.: Towards adaptive information fusion in multimodal systems. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 15–27. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Bengio, S.: An asynchronous hidden Markov model for audio-visual speech recognition. In: Proc. of Advances in Neural Information Processing Systems, pp. 1213–1220 (2003)Google Scholar
  10. 10.
    Oliver, N., Garg, A., Horvitz, E.: Layered representations for learning and inferring office activity from multiple sensory channels. Int. Journal on Computer Vision and Image Understanding 96(2), 163–180 (2004)CrossRefGoogle Scholar
  11. 11.
    Lester, J., Choudhury, T., Borriello, G.: A Practical approach to recognizing physical activities. In: The Proc. of Pervasive (to appear, 2006)Google Scholar
  12. 12.
    Bengio, S.: Multimodal authentication using asynchronous HMMs. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 770–777. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Rabiner, L.: A tutorial on hidden Markov model and selected applications in speech recognition. Proc. of the IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  14. 14.
    Duda, R., Hart, P., Stork, D.: Pattern classification. Morgan Kaufmann, San Francisco (2002)Google Scholar
  15. 15.
    Heckerman, D.: A tutorial on learning with Bayesian networks. Learning in Graphical Modals. MIT Press, Cambridge (1999)Google Scholar
  16. 16.
    Murphy, K.: The Bayes net toolbox for Matlab. Computing Science and Statistics 33 (2001)Google Scholar
  17. 17.
    Richardson, M., Domingos, P.: Markov Logic Networks. Machine Learning 62, 107–136 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Xiao Huang
    • 1
  • Sharon Oviatt
    • 1
    • 2
  • Rebecca Lunsford
    • 1
    • 2
  1. 1.Natural Interaction SystemsPortland
  2. 2.Center for Human-Computer Communication, Computer Science DepartmentOregon Health and Science UniversityBeaverton

Personalised recommendations