Voice attractiveness: Influence of stimulus duration and type
Voice attractiveness is a relatively new area of research. Some aspects of the methodology used in this domain deserve particular attention. Especially, the duration of voice samples is often neglected as a factor and happens to be manipulated without the perceptual consequences of these manipulations being known. Moreover, the type of voice stimulus varies from a single vowel to complex sentences. The aim of this experiment was to investigate the extent to which stimulus duration (nonmanipulated vs. normalized) and type (vowel vs. word) influence perceived voice attractiveness. Twenty-seven male and female raters made attractiveness judgments of 30 male and female voice samples. Voice samples included a single vowel /a/, a three-vowel series /i a o/, and the French word “bonjour” (i.e., “hello”). These samples were presented in three conditions: nonmanipulated, shortened, and lengthened duration. Duration manipulation was performed using the pitch synchronous overlap and add (PSOLA) algorithm implemented in Praat. Results for the effect of stimulus type showed that word length samples were more attractive to the opposite sex than vowels. Results for the effect of duration showed that the nonmanipulated sound sample duration was not predictive of perceived attractiveness. Duration manipulation, on the other hand, altered perceived attractiveness for the lengthening condition. In particular, there was a linear decrease in attractiveness as a function of modification percentage (especially for the word, as compared with the vowels). Recommendations for voice sample normalization with the PSOLA algorithm are thus to prefer shortening over lengthening and, if not possible, to limit the extent of duration manipulation—for example, by normalizing to the mean sample duration.
KeywordsVoice perception Attractiveness PSOLA algorithm Duration normalization
There is growing evidence that the human voice conveys important, socially relevant information about the speaker, independently of any linguistic and emotional content (Latinus & Belin, 2011). Especially, in the last 10 years, a number of studies have focused on the link between the voice acoustic features and its perceived attractiveness (e.g., Bruckert, Lienard, Lacroix, Kreutzer, & Leboucher, 2006; Collins, 2000; Feinberg, Jones, Little, Burt, & Perrett, 2005; Hodges-Simeon, Gaulin, & Puts, 2011). Some of these studies revealed, for example, that lower-pitched male voices and higher-pitched female voices are generally more attractive to opposite-sex listeners (Collins, 2000; Collins & Missing, 2003; Vukovic et al., 2011) when pitch values are not extreme (Borkowska & Pawlowski, 2011). Formant characteristics (frequency, dispersion) also play a role in voice attractiveness (Collins & Missing, 2003; Feinberg et al., 2011, 2006; Puts, Barndt, Welling, Dawood, & Burriss, 2011). Around the fertile phase of their menstrual cycle, women display stronger preference for masculine, low-pitch voices (Feinberg et al., 2006; Puts, 2005) and have themselves a higher-pitched voice (at least for some types of speech; Bryant & Haselton, 2009) and a more perceptually attractive voice (Pipitone & Gallup, 2008). This suggests that voice perception has a role in guiding human mate choice. This idea is supported further by the fact that in some studies, voices are rated more attractive in individuals bearing signals of mate quality, such as body symmetry (Hughes, Harrison, & Gallup, 2002; Hughes, Pastizzo, & Gallup, 2008), reproductive/mating success (Hughes, Dispenza, & Gallup, 2004), and face attractiveness (Lander, 2008; Saxton, Caryl, & Roberts, 2006).
This is a relatively new area of research. As such, the methods used are variable, and some methodological questions remain unanswered. The main question we chose to address in this study concerns the duration of the sound excerpts. Many studies do not mention the range of their sample durations, and the studies that mention them reveal substantial variations between experiments. For example, the vowel duration was 250 – 380 ms (mean = 290 ms) in Collins and Missing (2003), 640 ms on average in Feinberg, Jones, Little, et al. (2005), and 201 – 477 ms in Bruckert et al. (2010). Studies using sentences also sometimes have mentioned the average duration, but only for information purposes (e.g., Lander, 2008; Puts, Apicella, & Cárdenas, 2012). The only study investigating the effect of voice sample duration on listeners’ ratings found no relationship between the duration of 1-to-10 counting sequences and attractiveness of male and female voices (Hughes et al., 2008). The authors found only a significant negative relationship between duration and estimated intelligence in male voices, but this could as well have been due to speech rate (Feldstein, Dohm, & Crown, 2001) rather than to sound duration (only the total duration of the sequence was taken into account). Since previous research in speech has shown that stimulus length influences speech perception (e.g., Diehl, Lotto, & Holt, 2004) and emotion recognition (e.g., Pell & Kotz, 2011), it remains to be seen whether sample duration affects perceived attractiveness.
Furthermore, some studies have chosen to standardize sound duration, using an algorithm developed initially by Moulines and Charpentier (1990): the pitch synchronous overlap and add (PSOLA) algorithm. To obtain an expanded voice sample, for example, the algorithm first analyzes and segments the sound signal. Then it synthesizes a new time-stretched version by overlapping and adding time segments extracted from the input sound. Using the Praat implementation of this algorithm (Boersma & Weenink, 2011), authors have normalized the duration of individual vowels (e.g., to 500 ms, Feinberg, DeBruine, Jones, & Perrett, 2008, and Feinberg et al., 2006; to 350 ms, Saxton, Debruine, Jones, Little, & Roberts, 2009). Their aim was to “to control for variation in spoken vowel duration between individuals” (Feinberg et al., 2006, p. 217), and they did not investigate the possible impact of this manipulation on subsequent attractiveness ratings, certainly because this would not affect their results given the design they used (comparison of two versions, masculinized and feminized, of the same length-modified voices; Feinberg et al., 2006; Saxton et al., 2009). However, it is possible that such a manipulation makes the output voices sound less natural and, consequently, less attractive than the original. It is also likely that such perceptual consequences would be more pronounced for samples with durations that are the most distant from the target duration chosen for normalization. There are some designs where this could be detrimental—for instance, when the attractiveness of an individual’s voice is put in relation to other characteristics of that individual or when brain correlates of attractiveness are studied. Consequently, knowing whether (and to what extent) duration contraction and expansion change voice attractiveness would be valuable for future voice perception studies.
Additionally, in voice research, samples vary considerably according to their content. Some authors use short sounds with neutral content, such as numbers (e.g., from 1 to 10; Hughes et al., 2002, and subsequent work), while others use neutral sentences such as the Rainbow passage (see Puts, Gaulin, & Verdolini, 2006, and subsequent work) or the time of day (e.g., “it’s fifteen minutes to three,” used by Lander, 2008). Connoted sentences have also been chosen, such as the equivalent of “hello” (Apicella & Feinberg, 2009), “I really like you/I really don’t like you” (Jones, Feinberg, Debruine, Little, & Vukovic, 2008; Vukovic et al., 2008), and even free speech sentences (Fischer et al., 2011; Hodges-Simeon et al., 2011; Puts, 2005). Still others use monophthong vowel sounds (e.g., /a/ and /i/ in English). These stimuli are most common in studies on voice preference (Bruckert et al., 2010; Collins, 2000; Feinberg, Jones, DeBruine, et al., 2005; Ferdenzi, Lemaître, Leongómez, & Roberts, 2011). The use of vowels is beneficial in two regards. First, these samples enable perceptual judgments of pitch and voice quality without being colored by contextual factors (co-articulation, emphasis, and semantic meaning). Second, vowel stimuli are often preferred for acoustics measures, such as for voice quality and formant analysis (Patel, Scherer, Björkner, & Sundberg, 2011). To date, little is known about differences in the perceived attractiveness among several voice sample types (e.g., word vs. vowel) recorded from the same individual. Most important with regard to our main question, it is not known whether some voice sample types are more sensitive to duration and duration manipulation effects on attractiveness (providing there are any).
In this experiment, we tested two different questions: (1) Do speech type and speech segment length influence attractiveness judgments of voices, and (2) does duration manipulation of a voice stimulus affect its perceived attractiveness. To answer these questions, we used different types of stimuli commonly used in attractiveness studies—namely, a single vowel, a three-vowel sequence, and a word, with varied sample durations. We manipulated duration to compare attractiveness ratings of different types of original stimuli of varied durations versus the same stimuli with normalized short and long durations.
Twenty-seven Caucasian participants (15 men, 12 women) 22.1 ± 4.5 years of age (range, 17 – 34) were recruited from students and members of the staff of the University of Geneva, Switzerland. These participants served as raters in the main experiment. To avoid unwanted variability in attractiveness ratings due to language (Bresnahan, Ohashi, Nebashi, Liu, & Morinaga Shearman, 2002) and, possibly, sexual orientation, participants were required to be native speakers of French and to report being heterosexual. Participation was voluntary, and participants gave their informed written consent before starting the experiment.
Voice recordings of 30 Caucasian participants were used. These participants were distinct from the raters. Half of the speakers were males, and half were females (age: 22.9 ± 3.8 years; range 18 – 34 years). These individuals were also recruited mostly from students and members of the staff of the University of Geneva. All were French native speakers, heterosexual, and nonsmokers and declared not having a cold or illness on the day of voice recording or any speech impediment that would affect the way they naturally spoke. The voice recordings were a part of a larger experiment in which participants were also videotaped to create a database of voices and faces, the Geneva Attractiveness Database (GEAD; currently under development and soon to be released for academic research). Therefore, participants received compensation, either financially or by credits for the psychology course at the University, for their participation in the study.
The voices were recorded with a BCM 104 condenser studio microphone with cardioid directional characteristics (Neumann, Berlin; www.neumann.com) in a quiet room at a constant distance from the microphone. The recording sessions were led by one of three female experimenters,1 but the participants were alone in the experimental room during the recording and were in contact with the experimenter through a speaker-microphone device. The voices were recorded onto a computer hard disk using Cubase v.5.5.0 (www.steinberg.fr) at a sampling rate of 44.1 kHz with 24-bit quantization and were saved as uncompressed wav files. The amplitude of the recorded signal was adjusted for each participant with a mixing table. Participants were required to pronounce the sentence “Bonjour. Il est deux heures moins dix” (Hi. It’s ten to two) and a series of six monophthong vowels /ε/, / i/, /a/, /o/, /u/ and /y/ (International Phonetic Alphabet). The sentence and the series of vowels were each pronounced twice. The audio samples used for the ratings were extracted from the second repetition (when participants are expected to be more relaxed). We only used “bonjour” and three of the vowels in the middle of the series—/i/, /a/, and /o/—as a three-vowel series (using vowels in the middle of the sequence limits intonation variations; see Collins, 2000). The vowel /a/ was also used independently as a single vowel. The audio samples were isolated using Praat v.5.2 (Boersma & Weenink, 2011), and normalization of sound intensity was performed by matching the average absolute amplitude of all recordings using MATLAB v.7.12.
Each audio sample (/i/, /a/, /o/ within the three-vowel sequence and “bonjour”) was used in three versions: nonmanipulated duration (ranging from 209 to 400 ms for the vowels and from 436 to 791 ms for the word), duration shortened (200 ms for the vowels and 420 ms for the word), and duration lengthened (400 ms for the vowels and 820 ms for the word). Reduction of duration included a manipulation ranging from − 4 % to − 50 % of the original length for the vowels and from − 4 % to − 47 % for the word. Extension of duration represented a manipulation ranging from + 0 % to + 92 % for the vowels and from + 4 % to + 88 % for the word. Duration lengthening and shortening were performed using the PSOLA algorithm in Praat (Boersma & Weenink, 2011). From these nonmanipulated and manipulated samples, a total of 270 samples were presented to the raters: samples from 30 participants × 3 types of stimuli (single vowel /a/, three-vowel sequence /i a o/, word “bonjour”) × 3 duration (nonmanipulated, short, long) conditions. Examples of stimuli are provided in the supplementary materials, and all stimuli are available from the authors upon request for nonprofit research.
Participants were instructed to rate attractiveness on a scale of 1 – 7 as accurately and as quickly as possible. The response options were evenly placed in a semicircle in the center of the screen. Both attractiveness ratings and response times (time between apparition of the rating scale and the click of the mouse on the chosen response option) were recorded. The maximum allowed response time was 7 s. A null rating was given if no response was provided within this time frame, and the test automatically stopped if more than 10 % of the trials were missed. All participants complied with the instructions, and none had to restart the experiment. The maximum number of answers that were missed per participant was 2 % of 270 trials. To minimize possible biases in the response time measure, the mouse cursor was automatically repositioned at a central point equidistant from the answer options (squares 1 – 7) after each trial (see Fig. 2 for a summary of the procedure and a view of the user interface). Before performing the analyses, outliers in the response time were removed. The outliers were defined as values greater or less than three standard deviations of the participant’s mean (2 % of the trials). Since the distribution was skewed, the remaining response time values were then log-transformed (cf. Whelan, 2008).
The analyses described below were performed on attractiveness and response time scores averaged by stimulus (i.e., voice; sample size, N = 30). All post hoc analyses were Tukey HSD tests (α = .05), unless otherwise noted.
Effect of stimulus type and duration
Effect of percentage of duration manipulation
Effect of nonmanipulated duration
The original nonmanipulated duration did not significantly predict attractiveness (linear regressions between /a/ and “bonjour” original durations and attractiveness: rs = −.16, ps > .406). It predicted response time, but for the single vowel /a/ only [r = −.37, F(1, 28) = 4.40, p > .05; longer durations triggered shorter response times].
Cronbach’s alphas were computed on raw (nonaveraged) data, allowing us to quantify the level of agreement between participants for the attractiveness ratings. Agreement was high for all conditions, since alphas were >.70 (Kline, 1993; ranging from .84 to .93). To compare the different conditions (type, duration), we used a bootstrapping procedure performed with MATLAB software. Instead of using a single alpha by condition (e.g., shortened “bonjour”), this method is based on a repeated resampling procedure: It uses a randomly chosen N-size subsample of the initial participant sample to compute Cronbach’s alpha. Resampling was performed 1,000 times, thus providing a distribution of 1,000 alphas per condition. To test the difference between the alphas of two conditions (e.g., shortened vs. nonmanipulated “bonjour”), we computed the differences for all iterations of the two distributions, which provided a distribution of the differences. Alphas of the two conditions were considered as significantly different when the distribution of the differences, based on the confidence interval (determined with a chance level of p < .001), did not include the zero value (Davison & Hinkley, 1997). Two-by-two comparisons showed no significant difference as a function of stimulus type (for a given duration condition) or as a function of duration (for a given stimulus type).
The main aim of this experiment was to investigate whether duration of a voice sample would have an effect on its perceived attractiveness and, more specifically, whether duration normalization would affect attractiveness judgments. We showed in this experiment that the nonmanipulated sound sample duration was not predictive of perceived attractiveness. Between-rater agreement for attractiveness rating was high, regardless of the duration condition (nonmanipulated, shortened, lengthened). Shortening the voice samples in Praat (up to almost 50 % of the original duration) did not significantly modify their perceived attractiveness. On the contrary, lengthening had detrimental effects on attractiveness ratings. Furthermore, lengthening was linearly related to a decrease in attractiveness. This was also true when we limited the analyses to samples stretched no more than 50 % of their original duration. The deleterious effect of lengthening could be due to the algorithm procedure itself, which, although a very powerful tool, might alter the signal integrity, making the sample sound more unnatural than the nonmanipulated one.
Some experimental designs might require equalization of the voice samples’ duration, and manipulation with the PSOLA algorithm might be useful in these cases (note that taking only a fixed-length portion of the signal is not recommended for vowels, because of sharp cuts, and is even impossible for words). Reaction time or brain activation studies are good examples of such designs, where stimulus duration may by itself affect the outcome variables. If, in addition, the individual level of attractiveness is important—for example, if brain responses to a given voice are meant to be linked with other characteristics of that voice or of the person producing that voice—then it is imperative to be aware that voice duration manipulation may introduce some unwanted noise into the results by altering attractiveness. Some recommendations to researchers in voice attractiveness can thus be formulated from our results. First, duration normalization of male and female voice samples does not have to be systematic, since there is no relationship between natural duration and perceived attractiveness (at least for the duration ranges used in the present study—namely, 209 – 400 ms for vowels and 436 – 791 ms for the word “bonjour”). Second, for designs where normalization is required, as was mentioned above, our results suggest that duration manipulation with PSOLA should be performed preferably in the direction of shortening, rather than lengthening. If the samples are long enough, normalization to the shortest duration should be applied. If not, we recommend limiting the amount of manipulation (i.e., the percentage of duration change, as compared with the natural duration). One way to operationalize this would be to normalize to the mean duration of the samples. It must be kept in mind that we chose the duration normalization parameters as a function of the naturally occurring durations of the voice samples we collected in our given settings. This might vary from one laboratory to another. For example, in another study, vowel durations were normalized to 500 ms, which would be too long for our samples (Feinberg et al., 2008, 2006). Differences in the natural duration of produced sounds may be due to the instructions given to the participants—for example, whether they are required or not to sustain the sound and whether/how the experimenter pronounces the sound to the participant (which we believe should be avoided). Finally, adding stimulus type as a variable in our design revealed that the duration lengthening procedure was more deleterious for the word “bonjour” than for the vowels. The above-described recommendations are thus even more relevant for researchers willing to normalize voice samples that are more complex than plain vowels.
As a secondary aim of our experiment, we were also able to determine whether the voice sample type influenced attractiveness ratings. Using three different types of stimuli commonly used in voice attractiveness studies (single vowel /a/, three-vowel sequence /i a o/, and the word “bonjour”), this experiment provided evidence that when pronounced by the opposite sex, the word is rated as more attractive than one or several vowels. One reason for this may be the inclusion of prosodic and potentially emotional information in word length samples, but not vowel or vowel series samples; however, we cannot conclude this with certainty in the present study. Indeed, it has been shown before that cues of social interest displayed by the speaker—which might be better indicated in a word than in a vowel—positively affect the listener’s evaluation of the voice (Jones et al., 2008). Response time was inversely proportional to sample durations, suggesting that the greater the amount of information (in terms of sound duration), the easier the judgments. It might also be that attractiveness judgments require only a small amount of information and that raters, therefore, have more time to prepare their motor response during the longer samples. Finally, between-rater agreement for attractiveness ratings was high and did not differ according to stimulus type, suggesting a good reliability of participants’ judgments for vowels as well as for words. Consequently, all stimulus types investigated can be confidently used, and the choice of a given stimulus type (when several are available, as in our GEAD) should depend on whether priority is given to acoustics measurements (then vowels are good candidates; e.g., Patel et al., 2011; Shrivastav, Camacho, Patel, & Eddins, 2011) or to prosody and content (then a word such as “bonjour” may be more appropriate).
This experiment provides elements to answer methodological questions that many researchers in voice attractiveness might have raised at some point when designing their experiments. We chose a particular experimental design to answer those questions, but some limitations to our approach should be mentioned. Additional parameters possibly influencing attractiveness should be investigated—for example, speech rate. In the three-vowel sequences, we used vowel presentation every 600 ms, but rate of speech might be influential and should therefore be studied. Several studies report standardizing the rate of sound excerpts presentation (e.g., one numeral per second, Hughes et al., 2004, and Saxton et al., 2006; one vowel per 0.5 s, Feinberg, Jones, Little, et al., 2005), but because it was not the aim of the studies, potential effects of rate on perceptual variables were not described. Additionally, although we measured between-subjects consistency (Cronbach’s alpha), within-subjects reliability of attractiveness ratings was not investigated, due to time constraints of the test. Recent evidence suggests that several repetitions are needed to obtain reliable responses on rating scales (Shrivastav, Sapienza, & Nandur, 2005). Repeated ratings are rarely done in voice attractiveness experiments, and this question would be worth being explored further.
This experiment investigated some methodological questions related to the manipulation of voice samples’ duration and to the choice of a stimulus type in voice attractiveness studies. Although more of these aspects still need to be studied (e.g., presentation rate of multiple sounds, within-rater consistency, etc.), our study provides evidence for formulating recommendations to voice attractiveness researchers. No effect of duration on attractiveness perception was shown for the range of our samples. Therefore, if a similar range of durations is obtained, no normalization is necessary. Nevertheless, if duration normalization with the PSOLA algorithm is applied, one must be cautious of the perceptual consequences. Our results showed that lengthening the samples affected attractiveness perception. Therefore, shortening the sample duration is preferred to lengthening. A more practical implementation may be to normalize the duration to the mean of sample duration to limit manipulation range. Although the different stimulus types investigated triggered reliable attractiveness judgments, it must be kept in mind when designing an experiment that words such as “bonjour” are more sensitive to the deleterious effects of duration lengthening than are more simple sounds like vowels.
The three experimenters did the recording for 1, 9, and 20 of the 30 voices used in this experiment. Mean attractiveness of the voices (nonmanipulated duration) did not differ between the two latter experimenters [one-way ANOVA with experimenter as between-subjects factor: F(2,27) = 1.16, p = .330].
The authors wish to thank Benoît Bediou, Sophie Jarlier, Christophe Mermoud, and Lucas Tamarit for technical help, Mylena Da Paz Cabral and Olga Vorontsova for their help in stimulus collection, and Juan David Leongómez for useful discussions on the design. This research was financed by a grant from the Swiss National Science Foundation (100014_130036) and was supported by the National Center of Competence in Research “Affective Sciences” (51NF40-104897), hosted by the University of Geneva.
- Boersma, P., & Weenink, D. (2011). Praat: doing phonetics by computer (Version 5.2.46). [Computer program]. Retrieved from http://www.praat.org/
- Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge: Cambridge University Press.Google Scholar
- Kline, P. (1993). Handbook of psychological testing. London: Routledge.Google Scholar
- Vukovic, J., Feinberg, D. R., Jones, B. C., DeBruine, L. M., Welling, L. L. M., Little, A. C., & Smith, F. G. (2008). Self-rated attractiveness predicts individual differences in women’s preferences for masculine men’s voices. Personality and Individual Differences, 45(6), 451–456.CrossRefGoogle Scholar
- Vukovic, J., Jones, B. C., Feinberg, D. R., DeBruine, L. M., Smith, F. G., Welling, L. L. M., & Little, A. C. (2011). Variation in perceptions of physical dominance and trustworthiness predicts individual differences in the effect of relationship context on women’s preferences for masculine pitch in men’s voices. British Journal of Psychology, 102, 37–48.PubMedCrossRefGoogle Scholar
- Whelan, R. (2008). Effective analysis of reaction time data. Psychological Record, 58(3), 475–482.Google Scholar