Acoustic voice characteristics with and without wearing a facemask

Nguyen, Duy Duong; McCabe, Patricia; Thomas, Donna; Purcell, Alison; Doble, Maree; Novakovic, Daniel; Chacon, Antonia; Madill, Catherine

doi:10.1038/s41598-021-85130-8

Acoustic voice characteristics with and without wearing a facemask

Article
Open access
Published: 11 March 2021

Volume 11, article number 5651, (2021)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Acoustic voice characteristics with and without wearing a facemask

Download PDF

Duy Duong Nguyen¹,
Patricia McCabe¹,
Donna Thomas¹,
Alison Purcell¹,
Maree Doble¹,
Daniel Novakovic¹,
Antonia Chacon¹ &
…
Catherine Madill¹

10k Accesses
59 Citations
54 Altmetric
4 Mentions
Explore all metrics

Abstract

Facemasks are essential for healthcare workers but characteristics of the voice whilst wearing this personal protective equipment are not well understood. In the present study, we compared acoustic voice measures in recordings of sixteen adults producing standardised vocal tasks with and without wearing either a surgical mask or a KN95 mask. Data were analysed for mean spectral levels at 0–1 kHz and 1–8 kHz regions, an energy ratio between 0–1 and 1–8 kHz (LH1000), harmonics-to-noise ratio (HNR), smoothed cepstral peak prominence (CPPS), and vocal intensity. In connected speech there was significant attenuation of mean spectral level at 1–8 kHz region and there was no significant change in this measure at 0–1 kHz. Mean spectral levels of vowel did not change significantly in mask-wearing conditions. LH1000 for connected speech significantly increased whilst wearing either a surgical mask or KN95 mask but no significant change in this measure was found for vowel. HNR was higher in the mask-wearing conditions than the no-mask condition. CPPS and vocal intensity did not change in mask-wearing conditions. These findings implied an attenuation effects of wearing these types of masks on the voice spectra with surgical mask showing less impact than the KN95.

Does the wearing of masks change voice and speech parameters?

Article 22 September 2021

The effect of wearing face masks on voice and intelligibility of speech during the COVID-19 pandemic

Article Open access 08 March 2023

Speaking with a KN95 face mask: a within-subjects study on speaker adaptation and strategies to improve intelligibility

Article Open access 30 July 2022

Introduction

Facemasks are an essential piece of personal protective equipment (PPE) and can be broadly categorized into respirators, medical masks (including surgical masks and procedure masks), and woven fabric (cloth) masks¹. Respirators and surgical masks provide different levels of barrier to prevent infectious transmission via aerosols and droplets². Masks with higher barrier levels (e.g. N95) are used in aerosol generating procedures (AGPs) and other high risk activities¹. During non-aerosol generating protocols, surgical masks offer a similar degree of protection to N95 masks against viral respiratory infections including coronaviruses in health care workers (HCWs)³. Although surgical masks do not provide the same level of protection as N95 masks, they prevent some aerosols and droplets from being released from phonation and respiratory activities, contributing to reducing the risk of transmission⁴. In the SARS-CoV2 pandemic (COVID-19), such masks have been recommended for use by not only HCWs but also the general public in areas with known or suspected widespread transmission, high population density, or settings where physical distancing cannot be effectively achieved⁵. Although masks are effective PPE⁴, wearing a mask negatively affects the physiological and psychological performance of HCWs⁶.

Masks also interfere with effective verbal communication. Certain masks particularly the N95 respirators can impact speech understanding by listeners⁷. Word intelligibility dropped between 1 and 17% while wearing respirators commonly used by HCWs in which N95 mask resulted in a mean (standard deviation, SD) of modified rhyme test (MRT) score of 83 (16.2)% compared to 92 (5.8)% in non-mask controls⁸. The use of N95 mask in background noise resulted in a significant decrease in speech perception accuracy⁹. Speaking while wearing a mask at longer distances decreases speech perception accuracy by an even greater magnitude than not wearing a mask¹⁰. A mask also physically creates a visual barrier precluding lip reading¹¹, precluding communication cues in people with hearing loss and communication disabilities such as aphasia¹². From a user’s perspective, wearing masks increased perception of vocal effort, reduced auditory feedback, and difficult coordination of speech and breathing¹³. Understanding the aspects of the voice changes whilst wearing a mask is important so clinical decision, making and choice of mask is appropriate to meet infection control and optimal verbal communication.

Although it is believed that facemasks attenuate sound transmission like a low-pass filter^10,14, little information is available on voice characteristics whilst wearing a facemask. The scarce literature on the topic suggests possible changes in the speech spectrum. Mendel et al.¹⁵ compared speech spectral levels calculated as total root mean square (RMS) power from the Connected Speech Test (CST) stimuli produced by one speaker with and without wearing a surgical mask. They found a significant difference in the total RMS power between the two conditions. However, the affected frequency band was not reported. Atcherson et al.¹¹ found that the total RMS values of speech signals from the CST stimuli were significantly higher when not wearing a mask compared to the conditions with a mask. They also did not mention which frequency range was affected by the mask. Goldin et al.¹⁴ utilised a head and torso simulator to play white noise via the model’s mouth without a mask and with a surgical mask and a N95 respirator. They found that facemasks attenuated the sound levels at frequency regions between 2 and 7 kHz by 3–4 dB for the surgical mask and nearly 12 dB for the N95 mask compared with the non-mask condition. Their model lacked natural speech features while its face contour and surface were not similar to human face contour and skin, affecting fitting levels of the masks. However, based on their findings it seems reasonable to hypothesize that mask wearing would attenuate speech spectra at similar frequency bands.

Clear and natural speech production is necessary in accurate speech understanding and requires less listening effort than degraded speech¹⁶. Given the widespread use of facemasks in COVID-19 pandemic, it seems reasonable to further clarify the characteristics of the voice signal in speech of vocally healthy speakers who are wearing a mask. Given the above-mentioned findings of the modification of the speech spectra by the mask, the present study quantified the low- and high-frequency energy regions. Spectral analyses not only give information about the overall spectral shape that may be meaningful in speech perception¹⁷ but also provide important acoustic correlates of voice quality¹⁸. These spectral measures were selected as both the low and high frequency regions also contribute to speech recognition^19,20. Low-frequency spectral bands are important in recognizing vowels^21,22 and voiced fricative consonants²³. High frequency spectral energy makes a significant contribution to speech recognition^24,25 including the recognition of vowels²⁶, voiceless and voiced fricative consonants²⁷, spoken and sung text²⁸, and speech recognition in noise²⁹. It has also been shown that the high-frequency region provided perceptual cues for speaker identity³⁰ and gender identification³¹. In addition, the high frequency region plays an important role in the perception of clear speech: a shift of energy concentration toward higher frequency regions contributes to the clear speech effect for normal-hearing listeners³².

Presumably, the quality and audibility of the voice might also change whilst wearing a facemask as previous studies have observed voice changes in phonation with the mouth covered³³. This change may interfere with auditory-perceptual voice judgment by speech language pathologists (SLPs) and ear nose and throat specialists (ENTs). Dysphonic voice quality has also been proven to result in reduced comprehension of speech content by listeners^34,35. Wearing a mask may add to this effect by increasing the difficulty of understanding speech of an individual with dysphonic voice. Both voice quality and audibility can be effectively examined using acoustic analysis, which is a non-invasive objective assessment. Traditional acoustic measures of voice quality are based on frequency-based measurements³⁶ and include fundamental frequency (F0)³⁷ and noise (harmonic-to-noise ratio, HNR)^38,39,40,41. Amongst these, HNR has been used as a measure of vocal clarity⁴². The vocal signal can also be analysed based on spectral-based measurement of vowel and connected speech, which does not depend upon reliable tracking of vocal F0⁴³. The cepstral peak prominence (CPP) has been shown to have stronger weighted correlations with overall voice quality than other acoustic measures⁴⁴. Given that it is a measure of periodicity and harmonics strength, a signal with a strong harmonic structure would have a higher CPP than aperiodic signals⁴⁵. It has been considered a significant predictor of dysphonic severity⁴⁶. However, there are inherent limitations of cepstral analysis, that is, it is affected by vowel types and vocal intensity⁴⁷, vocal tasks⁴⁸, vocal tract⁴⁹, and the algorithm of software packages^48,50. Vocal audibility can be examined both by spectral energy at different frequency bands and sound intensity, which can also be estimated from the acoustic signals.

During the COVID-19 pandemic two types of masks were commercially available in Australia, including standard surgical mask and KN95 mask (China GB2626-2006)⁵¹. KN95 mask provided similar protection characteristics to N95 mask⁵². The major filtering and fitting characteristics of the KN95 mask as provided by 3 M⁵² were as follows: Filter performance ≥ 95%; Flow rate = 85 L/min; Inhalation resistance ≤ 350 Pa; Exhalation resistance ≤ 250 Pa; and Total inward leakage < 8%. The total inward leakage indicates the amount of an aerosol that enters the mask via both filter penetration and face-seal leakage⁵². Presumably, the higher barrier level a mask can provide, the greater impact it would have on the voice signals. The aims of the present study were to (1) examine the acoustic characteristics of voice and speech whilst wearing either a surgical mask or a KN95 mask; and (2) compare the acoustic measures between the standard surgical mask and KN95 mask. We hypothesized that: (1) Low- and high-frequency spectral levels, HNR, CPPS, and vocal intensity would change during wearing these facemasks; and (2) Changes in these acoustic measures would be more pronounced with KN95 mask than with standard surgical mask.

Methods

Ethical approval

The voice and speech data analysed in this study was part of a larger project which was approved by the Human Research Ethics Committee of The University of Sydney (protocol number: 2020/399). Informed consent was obtained from all participants to participate in this study. Informed consent to publish was also obtained from a participant for publication of identifying information/image (Fig. 1) in an online open-access publication. The present study was implemented in accordance with relevant ethical guidelines and regulations. The measurement procedures used in this study conformed to the standards set by the latest revision of the Declaration of Helsinki.

Participants

Sixteen participants took part in this study (12 females, 4 males) with mean age = 43 years (range = 24–61). All were English speakers, non-smokers, and did not report any voice nor hearing problems at the time of the study. Participants were otolaryngologists (n = 2), practicing speech language pathologists (n = 13), and a registered nurse working in an Ear Nose and Throat clinic (n = 1).

Voice recordings

Due to social distancing measures during the COVID-19 pandemic, it was impossible for participants to have their voices recorded in the same recording environment. Therefore, voice recordings took place in a room at the practicing clinic of the participants with ambient noise ranging from 33.3 decibels (dBA) to 58.0 dBA. Participants were required to use their habitual voice to read the following standardised tasks: three repetitions of the sustained vowel /a/ for at least 10 s, the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) phrases⁵³, and the Rainbow Passage⁵⁴. These tasks were produced in three conditions with the speaker (1) not wearing a mask, (2) wearing a surgical mask, and (3) wearing a KN95 mask (Fig. 1). The order of conditions was randomised across speakers to minimize biases related to intra-speaker variability in phonation and potential compensation whilst wearing a mask. When wearing these masks, participants were required to use the highest level of fitting to ensure maximal barrier level. They were required to press the nose metal bar so that it fit tightly to the nose contour. The straps of the mask were securely placed behind the auricles and the lower side of the mask was pulled fully downward so that it covered the chin completely (Fig. 1). It has been known that in unfavourable/challenging speaking conditions, speakers may adapt a phonation style that helps improve clear phonation^55,56. Therefore, we required participants to maintain similar habitual voice in terms of pitch, loudness, and phonation type throughout recording sessions both with and without a mask to minimise intra-speaker variability in voice production.

All voice signals were captured using an AKG C520 ear-mounted microphone⁵⁷ placed at a constant distance of 6 cm, 45° off the mouth axis and were analog-to-digital converted using a professional external sound card (Roland Quadcapture⁵⁸) at 44.1 kHz and 16-bit resolution. The signals were processed and saved to a laptop computer using the Audacity sound editing software in *.wav format⁵⁹. Calibration of sound level in the voice signals was deemed unnecessary given that the data were used to test within-subject effects of mask and non-mask conditions.

Acoustic analysis

Voice samples were edited in Audacity to extract the middle 3 s of the sustained /a/ vowel, the 3rd CAPE-V phrase (CAPEV-3), and the 2nd and 3rd sentences of the Rainbow Passage (RP23). All acoustic data were measured using Praat version 6.0.39⁶⁰.

Mean spectral level in low (0–1 kHz) and high (1–8 kHz) frequency ranges

Spectral levels in the 0–1000 Hz and 1000–8000 Hz were measured in Praat for the /a/ vowel (averaged from three repeats) and RP23. 1000 Hz was the cut-off between the low- and high-frequency regions in this study as the spectral region above 1000 Hz has been frequently used in investigating the role of different spectral regions in speech perception²⁵. Consonant noise is mainly concentrated at frequency regions above this frequency⁶¹. Further, the 1000 Hz cut-off has been used in studies involving spectral characteristics of voice quality^62,63,64. The upper limit of 8 kHz was used as extended high frequency ranges above this frequency have minor value in speech perception⁶⁵. The protocols in Praat were as follows: From Analyse spectrum = > To LTAS, set bandwidth = 100 Hz and click OK. From Query = > Get mean, then frequency bands were set with averaging method being “dB”. The output was then copied to an Excel spreadsheet for analysis.

Low/high spectral energy ratio between 0–1 and 1–8 kHz (LH1000)

We also evaluated the low/high energy ratio (reflecting spectral slope) which is a ratio of spectral energy levels between the low and high frequency ranges to investigate how this would be affected given the impact of mask-wearing on the speech spectrum. The low/high ratio using a 1000 Hz cut-off value (LH1000) has been used frequently in voice and speech research and has been shown to reflect voice quality^62,63, vocal load⁶⁴, sentence prominence in speech⁶⁶, and the effects of language⁶⁷.

The low/high energy ratio between spectral areas below 1 Hz and between 1–8 kHz was measured for the /a/ vowel (averaged from three repeats) and RP23 using the long-term average spectra (LTAS) function in Praat. The command to obtain this measure in Praat was as follows: From Analyse spectrum = > To LTAS, set bandwidth = 100 Hz and click OK. From Query = > Get slope, set averaging method = dB, low band = 0–1000 Hz, high band = 1000–8000 Hz and click OK. The value that Praat software provided was measured in dB.

Harmonics-to-noise ratio

Praat (version 6.0.39) was also used to measure harmonics-to-noise ratio (HNR) from the sustained /a/ vowel. The 3-s vowel sample was open and highlighted in Praat editing window from which HNR was obtained using the command Voice report within the Pulses menu. Data was averaged from three repeats. Prior to measurement of HNR, all edited vowel samples were signal-typed by the first author (D.D.N.) and a research assistant using criteria recommended by Titze⁶⁸ and Sprecher et al.⁶⁹ This was conducted using narrow-band spectrograms generated in Praat using settings described in Sprecher et al.⁶⁹. Signal typing was performed visually by comparing each spectrogram picture with the exemplar signal types. Signal typing was deemed necessary because the measurement of HNR relies on reliable estimation of F0, which is only feasible in type 1 and type 2 signals⁶⁹.

Cepstral peak prominence smoothed

The voice cepstrum is obtained by a Fourier transform of the logarithm power spectrum⁷⁰. A cepstral peak is identified within the dominant ‘rahmonic’ corresponding to the fundamental period from which CPP is calculated as the amplitude between the peak and the regression line directly below it⁴⁵. Smoothing the individual cepstra before extracting the cepstral peak and calculating CPP can improve prediction accuracy¹⁸. CPP-smoothed (CPPS) was measured in Praat using settings as follows^71,72: Pitch floor (Hz) = 60, Time steps (s) = 0.002, Maximum frequency (Hz) = 5000, Pre-emphasis from (Hz) = 50, Time averaging window (s) = 0.01, Quefrency averaging window (s) = 0.001, Peak search pitch range (Hz) = 60–330, Tolerance (0–1) = 0.05, Interpolation = Parabolic, Subtract tilt before smoothing = No, Tilt line quefrency range (s) = 0.001–0.0 (= end), Line type = Straight, Fit method = Robust.

Vocal intensity

Vocal intensity (dB) was also measured from the vowel, the 3rd CAPEV phrase, and the 2nd and 3rd sentences of the Rainbow Passage using Praat with default settings. Intensity values were not calibrated to real sound pressure level as the purpose of the study was to examine within-speaker effects.

Quality check of voice recordings and reliability analysis

Because voice recordings took place in different clinic rooms with different levels of background noise, audio files were examined for signal-to-noise ratio (SNR) using a Praat script called Speech-to-noise ratio /Voice-to-noise ratio v.01.01⁷³. Only samples with a SNR greater than 30 dB were used for acoustic analyses⁷⁴.

The sound files of four participants in all conditions [n = 4 × 3 conditions (no-mask, surgical mask, KN95) = 12], were randomly selected and analysed a second time by a research assistant for HNR and LH1000 to calculate inter-rater reliability using Intraclass Correlation Coefficient (ICC, two-way mixed, consistency type). The results are shown in Table 1, which indicate excellent reliability. ICC was 1.00 for LH1000 as the measurement of this was fully automated using edited voice samples. The slightly lower ICC values for HNR resulted from possible differences between the raters in selecting (highlighting) the vowel segment for HNR measurement in Praat editing window.

Table 1 Inter-rater reliability of acoustic analyses.

Full size table

Statistical analyses

Data were managed in Microsoft Excel 365⁷⁵ and analysed using IBM SPSS Statistics v.25.0⁷⁶ and Prism v8.1.2⁷⁷ for Windows. One-way repeated-measures analysis of variance (ANOVA) was used to examine the effects across three conditions (no-mask, surgical mask, and KN95 mask) on acoustic measures. Significant main effects were evaluated with Bonferroni-adjusted tests. Prior to analyses, normal distribution of the data was examined using Kolmogorov–Smirnov tests⁷⁸. Mauchly’s test of sphericity was performed before ANOVA and, if sphericity assumptions were not met, a Greenhouse–Geisser adjustment was used. Effect size was calculated using partial Eta squared (η²). Effect sizes of 0.01, 0.1, and 0.25 indicated small, medium, and large effects, respectively⁷⁹. Where normality assumption was not met, the Friedman test was used to compare data across non-mask, surgical mask, and KN95 conditions. A significance level of 0.05 was used.

Results

Mean spectral levels at low and high frequency regions

Mean spectral levels in 0–1 kHz region

Figure 2 showed mean spectral levels at both frequency bands 0–1 kHz and 1–8 kHz. This figure shows that this spectral measure did not change across conditions for both vowel and connected speech. For sustained /a/ vowel phonation, no significant main effects of mask-wearing were found: F(2, 22) = 0.396, p = 0.678, partial η² = 0.035. For RP23, there was also no significant main effects of masks in the 0–1 kHz range F(1.235, 13.588) = 0.808, p = 0.410, partial η² = 0.068.

Mean spectral levels in 1–8 kHz region

Figure 2 shows mean spectral levels for both vowel and connected speech in the 1–8 kHz region. For vowel production, no significant main effects of wearing a facemask were observed F(2, 22) = 0.024, p = 0.963, partial η² = 0.002. However, for RP23, wearing a facemask affected mean spectral levels in the 1–8 kHz region: There was a significant main effect of mask-wearing on this measure F(1.173, 11.735) = 16.951, p = 0.001, partial η² = 0.629. Post-hoc tests showed that, compared with the non-mask condition, the KN95 mask attenuated the spectral levels in the 1–8 kHz region by 5.2 dB (p = 0.005) while the surgical mask attenuated the spectral levels in this region by 2.0 dB (p = 0.014).

Low/high spectral ratio (LH1000)

LH1000 was calculated for the /a/ vowel and RP23. Figure 3 shows mean and SD of LH1000 for both tasks. One-way repeated-measures ANOVA was calculated to compare data across non-mask, surgical mask, and KN95 mask. For the sustained vowel, no significant main effects were observed: F(2, 22) = 0.949, p = 0.402, partial η² = 0.079. In Fig. 3, LH1000 for the vowel produced did not change significantly across non-mask and mask conditions.

For RP23, a significant main effect was present: F(1.279, 14.073) = 84.346, p = 0.000, partial η² = 0.885. Figure 3 shows that LH1000 of RP23 was the lowest for the non-mask [mean (SD) = 23.0 (1.7) dB], higher for the surgical mask [mean (SD) = 25.5 (2.2) dB], and highest for the KN95 mask condition [mean (SD) = 28.2 (1.7) dB]. Pairwise Bonferroni-adjusted comparisons showed that wearing a KN95 mask increased the LH1000 of RP23 by 5.2 dB (p = 0.000) and wearing a surgical mask increased the LH1000 of RP by almost 2.5 dB (p = 0.000).