Research on Language and Computation

, Volume 8, Issue 2–3, pp 133–168 | Cite as

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

  • Daniel Duran
  • Hinrich Schütze
  • Bernd Möbius
  • Michael Walsh


In this paper, we develop a new conceptual framework for an important problem in language acquisition, the correspondence problem: the fact that a given utterance has different manifestations in the speech and articulation of different speakers and that the correspondence of these manifestations is difficult to learn. We put forward the Correspondence-by-Segmentation Hypothesis, which states that correspondence is primarily learned by first segmenting speech in an unsupervised manner and then mapping the acoustics of different speakers onto each other. We show that a rudimentary segmentation of speech can be learned in an unsupervised fashion. We then demonstrate that, using the previously learned segmentation, different instances of a word can be mapped onto each other with high accuracy when trained on utterance-label pairs for a small set of words.


Language acquisition Speech Perception Production Correspondence learning Segmentation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altosaar, T., ten Bosch, L., Aimetti, G., Koniaris, C., Demuynck, K., & van den Heuvel, H. (2010). A speech corpus for modeling language acquisition: Caregiver. In: LREC (pp. 1062–1068).Google Scholar
  2. 2.
    Aversano, G., Esposito, A., Esposito, A., & Marinaro, M. (2001). A new text-independent method for phoneme segmentation. In: Proceedings of the 44th IEEE 2001 midwest symposium on circuits and systems, MWSCAS (Vol. 2, pp. 516–519).Google Scholar
  3. 3.
    Blackburn, C. S., & Young, S. J. (1996). A self-learning speech synthesis system. In: ECSA 4th tutorial and workshop on speech production modelling (pp. 225–228).Google Scholar
  4. 4.
    Brent M. R. (1999) Speech segmentation and word discovery: A computational perspective. Trends in Cognitive Sciences 3(8): 294–301CrossRefGoogle Scholar
  5. 5.
    Cairns P., Shillcock R., Chater N., Levy J. (1997) Bootstrapping word boundaries: A bottom-up corpus-based approach to speech segmentation. Cognitive Psychology 33: 111–153CrossRefGoogle Scholar
  6. 6.
    Christophe A., Dupoux E., Bertoncini J., Mehler J. (1994) Do infants perceive word boundaries? An empirical study of the bootstrapping of lexical acquisition. Journal of the Acoustical Society of America 95(3): 1570–1580CrossRefGoogle Scholar
  7. 7.
    Coen, M. H. (2006). Self-supervised acquisition of vowels in american english. In: AAAI’06: proceedings of the 21st national conference on Artificial intelligence (pp. 1451–1456). Menlo Park, CA: AAAI Press.Google Scholar
  8. 8.
    Crystal D. (2003) A dictionary of linguistics & phonetics. Blackwell Publishing, New JerseyGoogle Scholar
  9. 9.
    de Marcken, C. G. (September 1996). Unsupervised language acquisition. Ph.D. thesis, Massachusetts Institute of Technology.Google Scholar
  10. 10.
    Driesen, J., ten Bosch, L., & Van hamme, H. (2009). Adaptive non-negative matrix factorization in a computational model of language acquisition. In: Interspeech (pp. 1731–1734).Google Scholar
  11. 11.
    Fowler C.A. (2004) Speech as a supramodal or amodal phenomenon. In: Calvert G.A., Spence C., Stein B.E. (eds) The Handbook of multisensory processes.. MA: MIT Press, Cambridge, pp 189–201Google Scholar
  12. 12.
    Frank, M. C., Goldwater, S., Mansinghka, V., Griffiths, T., & Tenenbaum, J. (2007). Modeling human performance in statistical word segmentation. In: Proceedings of the 29th annual meeting of the cognitive science society (pp. 281–286).Google Scholar
  13. 13.
    Gersho A., Gray R. M. (1991) Vector quantization and signal compression. Springer, BerlinGoogle Scholar
  14. 14.
    Gold, K., Scassellati, B. (2006) Audio speech segmentation without language-specific knowledge. In: Cognitive science society (pp. 1370–1375).Google Scholar
  15. 15.
    Goldinger S.D. (1997) Words and voices—perception and production in an episodic lexicon. In: Johnson K., Mullennix J.W. (eds) Talker variability in speech processing.. Academic Press, San Diego, pp 33–66Google Scholar
  16. 16.
    Goldinger S. D. (1998) Echoes of echoes? An episodic theory of lexical access. Psychological Review 105(2): 251–279CrossRefGoogle Scholar
  17. 17.
    Goldsmith J., Xanthos A. (2009) Learning phonological categories. Language 85(1): 4–38CrossRefGoogle Scholar
  18. 18.
    Goldwater S., Griffiths T. L., Johnson M. (2009) A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112(1): 21–54CrossRefGoogle Scholar
  19. 19.
    Huang X., Acero A., Hon H. -W. (2001) Spoken language processing: A guide to theory, algorithm and system development. Prentice-Hall, Englewood CliffsGoogle Scholar
  20. 20.
    Hubert L., Arabie P. (1985) Comparing partitions. Journal of Classification 2: 193–218CrossRefGoogle Scholar
  21. 21.
    Johnson K. (1997) Speech perception without speaker normalization: An exemplar model. In: Johnson K., Mullenix J.W. (eds) Talker ariability in speech processing.. Academic Press, San Diego, pp 145–165Google Scholar
  22. 22.
    Jusczyk P. W. (1999) How infants begin to extract words from speech. Trends in Cognitive Sciences 3(9): 323–328CrossRefGoogle Scholar
  23. 23.
    Kuhl P. K. (1987) Perception of speech and sound in early infancy. In: Salapatek P., Cohen L. (eds) Handbook of infant perception, vol. 2. Academic Press, New York, pp 275–382Google Scholar
  24. 24.
    Kuhl P. K. (1988) Auditory perception and the evolution of speech. Human Evolution 3(1–2): 19–43CrossRefGoogle Scholar
  25. 25.
    Kuhl P. K., Meltzoff A. N. (1982) The bimodal perception of speech in infancy. Science 218(4577): 1138–1141CrossRefGoogle Scholar
  26. 26.
    Kuhl P. K., Rivera-Gaxiola M. (2008) Neural substrates of language acquisition. Annual Review of Neuroscience 31: 511–534CrossRefGoogle Scholar
  27. 27.
    Lin, Y. (2004). Learning phonetic features from waveforms (pp. 64–70). Tech. Rep. 103, Department of Linguistics, UCLA.Google Scholar
  28. 28.
    Lin, Y. (2005) Learning features and segments from waveforms: A statistical model of early phonological acquisition. Dissertation, UCLA.Google Scholar
  29. 29.
    Ljolje, A., Hirschberg, J., van Santen, J. (1997). Automatic speech segmentation for concatenative inventory selection. In Progress in speech synthesis (pp. 305–311). Berlin: Springer.Google Scholar
  30. 30.
    Massaro D.W. (2004) From multisensory integration to talking heads and language learning. In: Calvert G., Spence C., Stein B.E. (eds) Handbook of multisensory processes.. MA: MIT Press, Cambridge, pp 153–176Google Scholar
  31. 31.
    Mattys S. L., Jusczyk P. W., Luce P. A., Morgan J. L. (1999) Phonotactic and prosodic effects on word segmentation in infants. Cognitive Psychology 38: 465–494CrossRefGoogle Scholar
  32. 32.
    McQueen J. M. (1998) Segmentation of continuous speech using phonotactics. Journal of Memory and Language 39: 21–46CrossRefGoogle Scholar
  33. 33.
    Meltzoff A. N., Moore N. K. (1977) Imitation of facial and manual gestures by human neonates. Science 198: 75–78CrossRefGoogle Scholar
  34. 34.
    Miller, M., Wong, P., & Stoytchev, A. (2009). Unsupervised segmentation of audio speech using the voting experts algorithm. In: AGI (pp. 138–143).Google Scholar
  35. 35.
    Mitchell, T. M. (May 1980). The need for biases in learning generalizations. Technical report cbm-tr-117, Rutgers Computer Science Department.Google Scholar
  36. 36.
    Morgan N., Bourlard H., Hermansky H. (2004) Automatic speech recognition: An auditory perspective. In: Greenberg S., Ainsworth W. A., Popper A. N., Fay R. R. (eds) Speech processing in the auditory system. Springer, New York, pp 309–338Google Scholar
  37. 37.
    Peterson G. E., Barney H. L. (1952) Control methods used in a study of the vowels. Journal of the Acoustical Society of America 24(2): 175–184CrossRefGoogle Scholar
  38. 38.
    Pierrehumbert J. B. (2001) Exemplar dynamics: Word frequency, lenition, and contrast. In: Bybee J. L., Hopper P. (eds) Frequency effects and the emergence of lexical structure. John Benjamins Publishing Company, Amsterdam, pp 137–157Google Scholar
  39. 39.
    Qiao, Y., Shimomura, N., & Minematsu, N. (2008). Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons. In ICASSP (pp. 3989–3992).Google Scholar
  40. 40.
    Quine W. V. O. (1960) Word and object. The MIT Press, Cambridge, MAGoogle Scholar
  41. 41.
    Rosenblum L. D. (2008) Speech perception as a multimodal phenomenon. Current Directions in Psychological Science 17(6): 405–409CrossRefGoogle Scholar
  42. 42.
    Roy D. K., Pentland A. P. (2002) Learning words from sights and sounds: A computational model. Cognitive Science 26(1): 113–146CrossRefGoogle Scholar
  43. 43.
    Saffran J. R., Aslin R. N., Newport E. L. (1996) Statistical learning by 8-month-old infants. Science 274(5294): 1926–1928CrossRefGoogle Scholar
  44. 44.
    Scharenborg, O., Ernestus, M., Wan, V. (2007). Segmentation of speech: Child’s play? In Interspeech (pp. 1953–1956).Google Scholar
  45. 45.
    Scharenborg O., Wan V., Ernestus M. (2010) Unsupervised speech segmentation: An analysis of the hypothesized phone boundaries. Journal of the Acoustical Society of America 127(2): 1084–1095CrossRefGoogle Scholar
  46. 46.
    Schweitzer A., Braunschweiler N., Dogil G., Klankert T., Möbius B., Möhler G., Morais E., Säuberlich B., Thomae M. (2004) Multimodal speech synthesis. In: Wahlster W. (eds) SmartKom: Foundations of Multimodal Dialogue Systems.. Springer, Berlin, pp 411–435Google Scholar
  47. 47.
    Sharma, M., & Mammone, R. J. (1996). “Blind” speech segmentation. In ICSLP (pp. 1237–1240).Google Scholar
  48. 48.
    Slaney, M. (1998). Auditory toolbox. Online web resource, accessed: 2009-06.
  49. 49.
    Steinbach M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In KDD workshop on text mining.Google Scholar
  50. 50.
    Stouten V., Demuynck K., Van hamme H. (2008) Discovering phone patterns in spoken utterances by non-negative matrix factorization. Signal Processing Letters, IEEE 15: 131–134CrossRefGoogle Scholar
  51. 51.
    Strehl, A. (May 2002). Relationship-based clustering and cluster ensembles for high-dimensional data mining. Ph.D. thesis, UTexas Austin.Google Scholar
  52. 52.
    ten Bosch, L., Van hamme, H., & Boves, L. (2008). Unsupervised detection of words—questioning the relevance of segmentation. In ISCA ITRW.Google Scholar
  53. 53.
    Toledano, D. T., Gómez, L. A. H., & Grande, L. V. (2003). Automatic phonetic segmentation. In IEEE transactions on speech and audio processing (Vol. 11, pp. 617–625).Google Scholar
  54. 54.
    van Segbroeck M., Van hamme H. (2009) Unsupervised learning of timefrequency patches as a noise-robust representation of speech. Speech Communication 51: 1124–1138CrossRefGoogle Scholar
  55. 55.
    Vallabha G. K., McClelland J. L., Pons F., Werker J. F., Amano S. (2007) Unsupervised learning of vowel categories from infant-directed speech. PNAS 104(33): 13273–13278CrossRefGoogle Scholar
  56. 56.
    Varadarajan, B., Khudanpu, S., & Dupoux, E. (June 2008). Unsupervised learning of acoustic sub-word units. In ACL/HLT (pp. 165–168).Google Scholar
  57. 57.
    Wade T., Dogil G., Schütze H., Walsh M., Möbius B. (2010) Syllable frequency effects in a context-sensitive segment production model. Journal of Phonetics 38(2): 227–239CrossRefGoogle Scholar
  58. 58.
    Werker J. F., Curtin S. (2005) PRIMIR: A developmental framework of infant speech processing. Language Learning And Development 1(2): 197–234CrossRefGoogle Scholar
  59. 59.
    Wolpert D., Macready W. G. (1997) No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1(1): 67–82CrossRefGoogle Scholar
  60. 60.
    Xanthos, A. (2003) An incremental implementation of the utterance-boundary approach to speech segmentation. In computational linguistics in the Netherlands (pp. 171–180).Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  • Daniel Duran
    • 1
  • Hinrich Schütze
    • 1
  • Bernd Möbius
    • 1
  • Michael Walsh
    • 1
  1. 1.Institute for Natural Language ProcessingUniversity of StuttgartStuttgartGermany

Personalised recommendations