Ten pet dogs, all owned by members of the public, were recruited for the experiment (see Table 1 for details). The dogs were aged between 1 and 15 years (M = 6.7; SD = 5.2) and six were male. One of the dogs (Skye) was familiar with the experimenter. Prior to the start of the experiment, each dog’s owner confirmed that it had been trained with each of three commands (‘Come here’, ‘Lay/lie down’, and ‘Sit’), had good hearing, and was sufficiently mobile to respond to the commands. Before the first sessions of testing, the experimenter informally confirmed that the dog would respond appropriately to her voice by issuing the commands. All of the dogs were food motivated and were given occasional food rewards over the course of the experiment to help maintain their engagement with the task. In each case, the treats we used formed part of the dog’s normal diet and were supplied by the dog’s owner.
Audio recordings were made of a 21-year-old female’s voice (SKS) speaking the commands ‘Come here’, ‘Lay down’, and ‘Sit’ using an Audio-Technica AT2020 cardioid condenser microphone (Audio-Technica Ltd., Leeds, UK) connected to a PC running the Windows 7 operating system (Microsoft Corp., Redmond, WA) via a Yamaha AUDIOGRAM6 USB audio interface (Yamaha Corp., Hamamatsu, Japan). Audio signals were sampled at a frequency of 44.1 kHz in a 32-bit floating point format using the Audacity 2.2.2 recording and editing software (Audacity Team) and saved using the waveform audio file format (wav).
There are three ways of describing or viewing our manipulations. The anatomical processes or physical structures of GPR and VTL lead to acoustic variables such as f0 and formant dispersion. These are heard perceptually as pitch and timbre, respectively. To change the perceived pitch of a recorded speech sound, we may adjust its f0 which will simulate a particular GPR. If we wish to change a sound’s timbre, then we can adjust its degree of formant dispersion which simulates a particular VTL. It would only be possible to directly manipulate either GPR or VTL if we had a physical model of the speech production mechanism.
The mean f0 of the spoken sections of ‘Come here’ was 208 Hz. A representative vowel /e/ in ‘Come h/e/re’ had formants of 447 Hz (F1), 2664 Hz (F2) and 3033 Hz (F3). The original sound files were manipulated to either decrease f0 by a factor of two (simulating a reduction in GPR) whilst leaving formant dispersion untouched, increase formant dispersion by 30% (simulating an increase in VTL) whilst leaving f0 untouched, or the manipulation was in both f0 and formant dispersion. These values were chosen to match the average difference in f0 and formant dispersion between adult male and female voices (Huber et al. 1999). The most natural sounding manipulated voiced commands had both f0 and their formant dispersion changed. Thus, the command ‘Come here’ converted to a male speaker by manipulating both simulated GPR and VTL had a mean f0 of 99 Hz and /e/ vowel formants of 348 Hz (F1), 2119 Hz (F2) and 2424 Hz (F3). Figure 2 shows spectrograms for the four versions of the ‘Come here’ command.
Similar adjustments (halving of f0; 30% increase in formant dispersion) were made to recordings of the commands ‘Lay down’ and ‘Sit’ to produce simulated reduced GPR only, simulated increased VTL only, and both simulated reduced GPR and increased VTL recordings. The mean f0 of the spoken sections of the original recording of ‘Lay down’ was 186 Hz. A representative long vowel /a/ in ‘L/a/y down’ had formants of 610 Hz (F1), 2174 Hz (F2) and 2743 Hz (F3). The mean f0 of the spoken sections of the original recording of ‘Sit down’ was 222 Hz. A presentative short vowel /I/ in ‘S/i/t’ had formants of 527 Hz (F1), 2169 Hz (F2) and 2828 Hz (F3). All values were calculated, and adjustments made, using the Praat programme (version 5.1.26; Boersma 2001). Sound recordings were played back using an iPhone X (Apple Inc., Cupertino, CA) connected to a Bluetooth speaker (Anker Soundcore, Anker Innovations Ltd., Hong Kong). Rewards, where given, were the standard training treats used by each dog’s owner.
Huber et al. (1999) reported that the mean fundamental frequency recorded from women (aged between 20 and 30 years with a mean age of 23.5 years) speaking at a comfortable effort level is 218 Hz with a standard deviation of 24 Hz (for the open backed unrounded vowel /ɑ/ sustained over 2–3 s). This is consistent with our adult female speaker’s mean fundamental frequency. F1 and F2 frequencies of the vowels (for instance /I/ in ‘Sit’) lie comfortably in the vowel ellipse shown in the classic Peterson and Barney (1952) study of vowels, and F1–F3 frequencies are all within the range of normal adult female speech reported by Kent and Vorperian (2018). The height of our speaker was 5′6″, which is near to the mean height for adult women born in the UK in 1996 (5′5″; NCD Risk Factor Collaboration 2016). Hence, we can assume that her VTL was of a typical length for an adult woman.
Each dog was tested in its own home and in the presence of its owner over 2 or 3 consecutive days. Testing lasted for approximately 30 min each day, not including occasion play breaks. Dogs tested over 2 days received 60 trials on each day, and those tested over 3 days received 40 trials on each day. On each trial, the experimenter stood approximately 1.5 m in front of, and facing, the dog with the Bluetooth speaker attached to a lanyard around her neck. The dog’s owner was positioned immediately behind the dog and held onto a lead attached to the dog’s collar. The owner ensured that the dog was standing at the start of each trial and then relaxed their grip on the lead to allow the dog to approach the experimenter (so that it might respond to ‘Come here’) when a command was played.
To orient the dog’s attention towards her at the beginning of each block of test trials, the experimenter gave the dog a treat. She then showed the dog a second treat in her left hand before closing her hand and raising it near to her mouth, obscuring her mouth from the dog’s view. On each trial, a recording of a command was played. If the dog’s response matched the command and was made within 5 s of the end of playback of the command, the trial was marked as correct. Otherwise, the trial was marked either as no response or, if the dog performed a response that did not match the command, as incorrect. To maintain the dog’s motivation to engage with the task, it was periodically rewarded with the treat held in the experimenter’s hand. A reward was given at the end of every fifth trial if the response on that trial was correct. If a correct response was not made on the fifth trial, then the next correct response was rewarded. In either case, the treat was then replaced, and counting was restarted. The longest run of trials between rewards for any dog was eight. At the end of each trial, the dog was returned to its standing starting position by its owner before the next trial commenced. This resulted in an interval of approximately 30 s from the start of one trial to the start of the next.
There were 12 types of trials generated by the combination of the three commands (‘Come here’, ‘Lay down’, and ‘Sit’) and the four different voices (original, simulated GPR reduced, simulated VTL increased, both simulated GPR reduced and VTL increased). Over the course of the experiment, each dog experienced each of the 12 trial types ten times, giving a total of 120 trials. The sequence in which trials were presented was randomized with the constraint that no command was given, or voice used, more than four times in succession within a session, and a combination of command and voice was not presented more than three times in succession. Before the beginning of the first day of testing, there were ten randomly selected practice trials to familiarize both the dog and its owner with the testing procedure. During these practice trials, no data were recorded.
For each of the 12 combinations of command and voice condition, we collected data on ten trials. To assess the stability of performance across testing, data were partitioned into two blocks of 60 trials—five from each condition. These data were then used to calculate the proportion of trials on which each dog made the correct response in each condition across each block. Data were analysed using a four-way repeated measures analysis of variance (ANOVA) with the factors of Trial Block (first vs. second), Command (Come here; Lay down; Sit), GPR Condition (normal vs. reduced), and VTL Condition (normal vs. increased). If the dogs were sensitive to the normal correlation between GPR and VTL, we would expect performance to be worse when either GPR or VTL was manipulated alone when compared to the original (female) voice or to the synthesized male voice (where both GPR and VTL were altered). We therefore predicted an interaction between the factors of GPR condition and VTL condition. Dependent samples Student’s t tests were used to make pairwise comparisons between conditions where appropriate, and Šidák correction for multiple comparisons was applied. All analyses were conducted using IBM SPSS Statistics version 27 (IBM Corp., Armonk, NY).