Multimodal Language Acquisition Based on Motor Learning and Interaction

  • Jonas Hörnstein
  • Lisa Gustavsson
  • José Santos-Victor
  • Francisco Lacerda
Part of the Studies in Computational Intelligence book series (SCI, volume 264)


This work presents a developmental and ecological approach to language acquisition in robots, which has its roots in the interaction between infants and their caregivers. We show that the signal directed to infants by their caregivers include several hints that can facilitate the language acquisition and reduce the need for preprogrammed linguistic structure. Moreover, infants also produce sounds, which enables for richer types of interactions such as imitation games, and for the use of motor learning. By using a humanoid robot with embodied models of the infant’s ears, eyes, vocal tract, and memory functions, we can mimic the adult-infant interaction and take advantage of the inherent structure in the signal. Two experiments are shown, where the robot learn a number of word-object associations and the articulatory target positions for a number of vowels.


Target Word Recognition Rate Motor Learning Visual Object Humanoid Robot 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Albin, D.D., Echols, C.H.: Stressed and word-final syllables in infant-directed speech. Infant Behavior and Development 19, 401–418 (1996)CrossRefGoogle Scholar
  2. 2.
    Andruski, J.E., Kuhl, O.K., Hayashi, A.: Point vowels in Japanese mothers’ speech to infants and adults. The Journal of the Acoustical Society of America 105, 1095–1096 (1999)CrossRefGoogle Scholar
  3. 3.
    Batliner, A., Biersack, S., Steidl, S.: The Prosody of Pet Robot Directed Speech: Evidence from Children. In: Proc. of Speech Prosody 2006, Dresden, pp. 1–4 (2006)Google Scholar
  4. 4.
    Burnham, D.: What’s new pussycat? On talking to babies and animnals. Science 296, 1435 (2002)CrossRefGoogle Scholar
  5. 5.
    Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley, Chichester (2006)zbMATHGoogle Scholar
  6. 6.
    Crystal, D.: Non-segmental phonology in language acquisition: A review of the issues. Lingua 32, 1–45 (1973)CrossRefGoogle Scholar
  7. 7.
    Davis, S.B., Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, speech, and signal processing ASSP-28(4) (August 1980)Google Scholar
  8. 8.
    de Boer, B.: Infant directed speech and evolution of language. In: Evolutionary Prerequisites for Language, pp. 100–121. Oxford University Press, Oxford (2005)Google Scholar
  9. 9.
    Fadiga, L., Craighero, L., Buccino, G., Rizzolatti, G.: Speech listening specifically modulates the excitability of tongue muscles: a TMS study. European Journal of Neuroscience 15, 399–402 (2002)CrossRefGoogle Scholar
  10. 10.
    Ferguson, C.A.: Baby talk in six languages. American Anthropologist 66, 103–114 (1964)CrossRefGoogle Scholar
  11. 11.
    Fernald, A.l.: The perceptual and affective salience of mothers’ speech to infants. In: The origins and growth of communication, Norwood, N.J, Ablex (1984)Google Scholar
  12. 12.
    Fernald, A.: Four-month-old infants prefer to listen to Motherese. Infant Behavior and Development 8, 181–195 (1985)CrossRefGoogle Scholar
  13. 13.
    Fernald, A., Mazzie, C.: Prosody and focus in speech to infants and adults. Developmental Psychology 27, 209–221 (1991)CrossRefGoogle Scholar
  14. 14.
    Gallese, V., Fadiga, L., Fogassi, L., Rizzolatti, G.: Action Recognition in the Premotor Cortex. Brain 199, 593–609 (1996)CrossRefGoogle Scholar
  15. 15.
    Gustavsson, L., Sundberg, U., Klintfors, E., Marklund, E., Lagerkvist, L., Lacerda, F.: Integration of audio-visual information in 8-months-old infants. In: Proceedings of the Fourth Internation Workshop on Epigenetic Robotics Lund University Cognitive Studies, vol. 117, pp. 143–144 (2004)Google Scholar
  16. 16.
    Fitzgibbon, A., Pilu, M., Risher, R.B.: Direct least square fitting of ellipses. Tern Analysis and Machine Intelligence, 21 (1999)Google Scholar
  17. 17.
    Fitzpatrick, P., Varchavskaia, P., Breazeal, C.: Characterizing and processing robotdirected speech. In: Proceedings of the International IEEE/RSJ Conference on Humanoid Robotics (2001)Google Scholar
  18. 18.
    Fukui, K., Nishikawa, K., Kuwae, T., Takanobu, H., Mochida, T., Honda, M., Takanishi, A.: Development of a New Humanlike Talking Robot for Human Vocal Mimicry. In: Proc. International Conference on Robotics and Automation, Barcelona, Spain, April 2005, pp. 1437–1442 (2005)Google Scholar
  19. 19.
    Guenther, F.H., Ghosh, S.S., Tourville, J.A.: Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language 96(3), 280–301Google Scholar
  20. 20.
    Hastie, T.: The elements of statistical learning data mining inference and prediction. Springer, Heidelberg (2001)zbMATHGoogle Scholar
  21. 21.
    Higashimoto, T., Sawanda, H.: Speech Production by a Mechanical Model: Construction of a Vocal Tract and Its Control by Neural Network. In: Proc. International Conference on Robotics and Automation, Washington DC, May 2002, pp. 3858–3863 (2002)Google Scholar
  22. 22.
    Hirsh-Pasek, K.: Doggerel: motherese in a new context. Journal of Child Language 9, 229–237 (1982)CrossRefGoogle Scholar
  23. 23.
    Hörnstein, J., Santos-Victor, J.: A Unified Approach to Speech Production and Recognition Based on Articulatory Motor Representations. In: 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, USA (October 2007)Google Scholar
  24. 24.
    Hörnstein, J., Soares, C., Santos-Victor, J., Bernardino, A.: Early Speech Development of a Humanoid Robot using Babbling and Lip Tracking. In: Symposium on Language and Robots, Aveiro, Portugal, (December 2007)Google Scholar
  25. 25.
    Hörnstein, J., Gustavsson, L., Santos-Victor, J., Lacerda, F.: Modeling Speech imitation. In: IROS-2008 Workshop - From motor to interaction learning in robots, Nice, France (September 2008)Google Scholar
  26. 26.
    Hörnstein, J., Lopes, M., Santos-Victor, J., Lacerda, F.: Sound localization for humanoid robots - building audio-motor maps based on the HRTF. In: IEEE/RSJ International Conference on intelligent Robots and Systems, Beijing, China, October 9-15 (2006)Google Scholar
  27. 27.
    Jusczyk, P., Kemler Nelson, D.G., Hirsh-Pasek, K., Kennedy, L., Woodward, A., Piwoz, J.: Perception of acoustic correlates of major phrasal units by young infants. Cognitive Psychology 24, 252–293 (1992)CrossRefGoogle Scholar
  28. 28.
    Kanda, H., Ogata, T.: Vocal imitation using physical vocal tract model. In: 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, USA, October 2007, pp. 1846–(1851)Google Scholar
  29. 29.
    Kass, M., Witkin, A., Terzopoulus, D.: Snakes: Active contour models. International Journal of Computer Vision (1987)Google Scholar
  30. 30.
    Krstulovic, S.: LPC modeling with speech production constraints. In: Proc. 5th speech production seminar (2000)Google Scholar
  31. 31.
    Kuhl, P., Andruski, J.E., Christovich, I.A., Christovich, L.A., Kozhevnikova, E.V., Ryskina, V.L., et al.: Cross-language analysis of Phonetic units in language addressed to infants. Science 277, 684–686 (1997)CrossRefGoogle Scholar
  32. 32.
    Kuhl, P., Miller, J.: Discrimination of auditory target dimensions in the presence or absence of variation in a second dimension by infants. Perception and Psychophysics 31, 279–292 (1982)Google Scholar
  33. 33.
    Lacerda, F., Marklund, E., Lagerkvist, L., Gustavsson, L., Klintfors, E., Sundberg, U.: On the linguistic implications of context-bound adult-infant interactions. In: Genova: Epirob 2004 (2004)Google Scholar
  34. 34.
    Lacerda, F., Klintfors, E., Gustavsson, L., Lagerkvist, L., Marklund, E., Sundberg, U.: Ecological Theory of Language Acquisition. In: Genova: Epirob 2004 (2004)Google Scholar
  35. 35.
    Lacerda, F.: Phonology: An emergent consequence of memory constraints and sonsory input. Reading and Writing: An Interdisciplinary Journal 16, 41–59 (2003)CrossRefGoogle Scholar
  36. 36.
    Lenneberg, E.: Biological Foundations of Language. Wiley, New York (1967)Google Scholar
  37. 37.
    Liberman, A., Mattingly, I.: The motor theory of speech perception revisited. Cognition 21, 1–36 (1985)CrossRefGoogle Scholar
  38. 38.
    Lien, J.J.-J., Kanade, T., Cohn, J., Li, C.-C.: Detection, tracking, and classification of action units in facial expression. Journal of Robotics and Autonomous Systems (1999)Google Scholar
  39. 39.
    Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object detection. In: IEEE ICIP, pp. 900–903 (2002)Google Scholar
  40. 40.
    Liljencrants, J., Fant, G.: Computer program for VT-resonance frequency calculations. In: Liljencrants, J., Fant, G. (eds.) STL-QPSR, pp. 15–20 (1975)Google Scholar
  41. 41.
    Maeda, S.: Compensatory articulation during speech: evidence from the analysis and synthesis of vocat-tract shapes using an articulatory model. In: Hardcastle, W.J., Marchal, A. (eds.) Speech production and speech modelling, pp. 131–149. Kluwer Academic Publishers, BostonGoogle Scholar
  42. 42.
    Moore, R.K.: PRESENCE: A Human-Inspired Architecture for Speech-Based Human-Machine Interaction. IEEE Transactions on Computers 56(9) (September 2007)Google Scholar
  43. 43.
    Mulford, R.: First words of the blind child. In: Smith, M.D., Locke, J.L. (eds.) The emergent lexicon: The child’s development of a linguisticvocabulary. Academic Press, New York (1988)Google Scholar
  44. 44.
    Nakamura, M., Sawada, H.: Talking Robot and the Analysis of Autonomous Voice Acquisition. In: Proc. International Conference on Intelligent Robots and Systems, Beijing, China, October 2006, pp. 4684–4689 (2006)Google Scholar
  45. 45.
    Nowak, M.A., Plotkin, J.B., Jansen, V.A.A.: The evolution of syntactic communication. Nature 404, 495–498 (2000)CrossRefGoogle Scholar
  46. 46.
    Roy, D., Pentland, A.: Learning words from sights and sounds: A computational model. Cognitive Science 26, 113–146 (2002)CrossRefGoogle Scholar
  47. 47.
    Saffran, J.R., Johnson, E.K., Aslin, R.N., Newport, E.: Statistical learning of tone sequences by human infants and adults. Cognition 70, 27–52 (1999)CrossRefGoogle Scholar
  48. 48.
    Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing 26(1), 43–49 (1978)CrossRefzbMATHGoogle Scholar
  49. 49.
    Stoel-Gammon, C.: Prelinguistic vocalizations of hearing-impaired and normally hearing subjects: a comparison of consonantal inventories. J. Speech Hear Disord. 53(3), 302–315 (1988)Google Scholar
  50. 50.
    Sundberg, U., Lacerda, F.: Voice onset time in speech to infants and adults. Phonetica 56, 186–199 (1999)CrossRefGoogle Scholar
  51. 51.
    Sundberg, U.: Mother tongue – Phonetic aspects of infant-directed speech, Department of Linguistics, Stockholm University (1998)Google Scholar
  52. 52.
    ten Bosch, L., Van hamme, H., Boves, L.: A computational model of language acquisition: focus on word discovery”. In: Interspeech 2008, Brisbane (2008)Google Scholar
  53. 53.
    Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2) (2001)Google Scholar
  54. 54.
    Vihman, M.M.: Phonological development. Blackwell, Oxford (1996)Google Scholar
  55. 55.
    Vihman, M., McCune, L.: When is a word a word? Journal of Child Language 21, 517–542 (1994)CrossRefGoogle Scholar
  56. 56.
    Viola, P., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: IEEE CVPR (2001)Google Scholar
  57. 57.
    Yoshikawa, Y., Koga, J., Asada, M., Hosoda, K.: Primary Vowel Imitation between Agents with Different Articulation Parameters by Parrot-like Teaching. In: Proc. Int. Conference on Intelligent Robots and Systems, Las Vegas, Nevada, October 2003, pp. 149–154 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Jonas Hörnstein
    • 1
  • Lisa Gustavsson
    • 2
  • José Santos-Victor
    • 1
  • Francisco Lacerda
    • 2
  1. 1.Institute for System and Robotics (ISR)Instituto Superior TécnicoLisbonPortugal
  2. 2.Department of LinguisticsStockholm UniversityStockholmSweden

Personalised recommendations