To our knowledge, prior to the current study, Kanzi had never been experimentally exposed to degraded forms of familiar words. Nonetheless, Kanzi was capable of correctly recognising them in a match-to-sample paradigm, picking the appropriate corresponding lexigram at a rate significantly higher than chance. Furthermore, we expanded upon Heimbauer et al.’s (2011) work by demonstrating for the first time that a nonhuman is capable of recognising not just degraded speech but also fully computer-generated speech. This type of speech lacks the naturalness and many alternative cues to word meaning, yet Kanzi continued to be able to recognise it even when it had undergone some forms of degradation. Although Kanzi’s performance with computer-generated speech was lower than with natural speech (both unmanipulated and degraded), he still chose the correct lexigram at a rate significantly higher than chance for the unmanipulated computer-generated stimuli and for the sinusoidally degraded versions of these stimuli. It is worth noting that the specific synthesis mode (formant synthesis) used for creating the computer-generated stimuli was overtly crude and un-human like (robotic-like) and previous work has shown that humans can struggle with such computer-generated stimuli, primarily when presented with novel phrases (Pisoni 1997). That Kanzi could still understand even some degraded versions of formant synthesised computer-generated stimuli demonstrates the existence of perceptual mechanisms in bonobos that are remarkably resilient when presented with highly deviant, non-natural speech.
Our findings support, but also subtly contrast with, previous work looking at the processing of degraded speech in a language competent chimpanzee, Panzee (Heimbauer et al. 2011). Similarly to Panzee, Kanzi performed best with natural, unmanipulated stimuli with a success rate comparable to his historical performance with the same words (91%). Like Panzee, Kanzi also scored significantly above chance on both noise-vocoded and sine-wave stimuli with a natural voice. Interestingly, however, we found differences regarding which form of degraded stimuli his performance was best with. Specifically, while Kanzi’s success rate was significantly higher than chance for both natural voice and computer-generated stimuli that had been sinusoidally manipulated, with noise-vocoded stimuli, Kanzi scored below chance for computer-generated voice stimuli variants. This finding is counter to what we initially predicted, and one potential explanation for this discrepancy might be attributable to the more tonal nature of the bonobo vocal communication system. In comparison to both chimpanzees and humans, bonobo vocalisations are generally up to an octave higher in pitch (Grawunder et al. 2018, though see also Garcia and Dunn 2019; Grawunder et al. 2019). It may therefore be that for bonobos, the underlying information content of sinusoidally manipulated words is inherently easier to abstract than for noise-based word degradation. However, given the results are derived from single individuals (in both chimpanzees and bonobos), follow up work replicating this effect in additional subjects is central to confirming this hypothesis.
We also ran a virtually identical experiment in humans to allow direct comparability between human and nonhuman subjects. As predicted, for all participants we saw ceiling effects, confirming that the tasks were trivial for human listeners. Generally, in human degraded-speech experiments, subjects are trained on degraded stimuli sets and then exposed to novel words or phrases which they are asked to accurately decipher (e.g. Heimbauer et al. 2011). In our experiment, subjects heard single words and in addition to this, were given 3 choices to make their selection from, which undoubtedly made the task considerably easier. Nevertheless, one advantage of the setup is that it nicely highlights, despite Kanzi’s success with parsing degraded speech, there still exists a gap that separates him from the humans we tested when dealing with such hyper-variable stimuli.
From a more general perspective, these results have interesting potential implications for our understanding of the evolutionary origins of speech. The fact that humans are capable of processing speech, even when it has been substantially perturbed (degraded) has, in the past, been argued to indicate that the perceptual mechanisms necessary for speech comprehension are unique to humans and distinct from general auditory processing (Remez et al. 1994, cf. Fitch 2011). Although comparative work also demonstrating degraded speech processing in chimpanzees is suggestive that these perceptual mechanisms may be rooted more deeply within the primate lineage (Heimbauer et al. 2011), additional data in other closely related ape species are important to making a stronger case for homology, and to more confidently rule out the possibility of convergence i.e. the same ability evolving independently in different species (Fitch 2017). Our work contributes to this debate by revealing very similar abilities in another species closely related to humans.
It should also be pointed out that, similarly to Panzee, Kanzi has had extensive experience with human speech over his lifetime, which undoubtedly played a role in his ability to parse degraded speech. However, whether experience alone is sufficient to explain the data remains unresolved. The next important step in understanding the relative contributions of phylogeny and experience to processing of highly variable speech is to run similar experiments with species more phylogenetically distantly related to humans, who also have experience with human speech. If the capacity for processing degraded speech stems more from experience and ontogenetic factors, we would expect species familiar with human speech (e.g. domesticated dogs) to share this ability with Panzee and Kanzi. An alternative approach to ascertain the influence of developmental and evolutionary processes would be to probe whether competence on such tasks varies as a function of age; if phylogeny plays a role in speech processing, then enculturated individuals, once initially trained to use lexigrams, should be able to succeed at recognising degraded forms. If experience also plays a role, we should then expect improved performance as a function of age.
In conclusion, we show that, similarly to humans and chimpanzees, bonobos are capable of processing both degraded and computer-generated speech. These data therefore provide critical further support for the theory that the perceptual mechanisms necessary for dealing with speech are not unique to humans, but rooted more deeply within the primate lineage, likely evolving before language or speech emerged.