1 Introduction

In public places, the sound environment is important for speech communication in terms of speech transmission performance. In noisy places, this may correspondingly make speech communication more difficult. The audibility of announcements in information transmission is greatly affected by the background noise (BGN) in the space [1] and speech-specific characteristics such as speech rate [2]. The ability to accurately grasp spoken information in public places can have a significant impact on the convenience of the space. The, improving the quality of information transmission by speech is important for reasons such as increased safety and convenience, diversification of the information provided to users, and the need to create an environment that takes into account various types of disabilities. In particular, as shown in these references, information on the use of public transport, such as railways, is often broadcast over loudspeakers, and it is important to ensure the intelligibility of this information. Currently, announcements such as those described above are still widely used in broadcasts by a human voice or synthesized speech based on a waveform connection method [3], which is based on the recorded sound of a human voice. In addition to the broadcast volume, the speech speed also depends on the speaker, and there are no clear rules for the playback conditions of announcement broadcasts. For this reason, there are situations where the announcements are perceived as too loud or where almost nothing can be heard. As this method of synthetic speech is synthesized on the basis of the natural voice, it is difficult to create multilingual broadcasts, and in the event of a disaster, station staff have to deal with the situation by broadcasting in their own voices.

Artificial synthesized speech has recently been used in various places, with Apple's Siri, Amazon's Alexa, Microsoft's Cortana, and the Google Assistant being representative examples [4]. Although some algorithmic improvements have been studied to improve the intelligibility of broadcasts made by natural or synthetic voices in noisy environments, synthetic voices, which tend to have mechanical intonation different from that of natural voices, can be particularly difficult to understand [5,6,7,8,9]. Furthermore, Sato et al. [10] experimentally investigated the relationship between the improvement of intelligibility and noisiness by increasing the volume and pointed out that higher-volume loudspeaker broadcasts may cause discomfort, and that the increase in noisiness may be more pronounced than the improvement in intelligibility.

In recent years, the aging of the domestic population is progressing in Japan [11], and it has been increasingly noted that people are having trouble obtaining information on how to use and where to board public transportation. Currently, various efforts are being made to strengthen the communication of information concerning public transportation [12,13,14,15,16,17,18]. For example, in 2020, JR East adopted Toshiba's ToSpeak G3 text-to-speech software as a tablet terminal to be carried by train crews and station employees [19] and installed HOYA's ReadSpeaker text-to-speech software (hereinafter referred to as HOYA Broadcasting) in the concourses and platform of the Shinkansen [20]. Conventionally, ATOS (Autonomous decentralized Transport Operation control System) broadcasts and COSMOS (Computerized Safety, Maintenance and Operation Systems of Shinkansen) broadcasts using the waveform connection method have been used, with ATOS for conventional lines and COSMOS for the Shinkansen [21]. Recently, synthetic speech using the DNN method [22], as typified by HOYA broadcasts, is being introduced. By promoting the introduction of the announcements made with the DNN system, the time and cost associated with recording and re-recording by a human announcer can be reduced. In addition, according to the guidelines of the Japan Tourism Agency [23], in addition to multilingual broadcasts in four languages (Japanese, English, Chinese, and Korean), which are in high demand in Japan, it is possible not only to make emergency announcements using synthetic voice, but also to make use of the advantages of instantaneous speech synthesis to make train announcements with timetables different from that originally planned, and to make announcements that are applicable for only short periods of time. Thus, if DNN synthetic speech can be used effectively in stations, it will be possible to create announcements freely, regardless of the language or situation. However, in practice, the effectiveness of synthetic voice for use in stations has not been verified. For example, Tachibana [24] emphasized the importance of speech intelligibility in noisy environments, and there have been many studies on the optimal volume and speech rate of broadcasts in public spaces such as airports and train stations [1, 2, 24] and under high noise levels [25,26,27,28,29,30,31,32,33]. However, it is difficult to immediately apply these findings to the environments in train stations because the acoustic characteristics of the sound environment can differ depending on the location. Moreover, as non-combustibility, durability, and water resistance are emphasized when designing railway stations, building materials are often acoustically reflective, which results in a worse sound environment. Indeed, many measurements of noise levels in railway stations have already been carried out [34, 35], and Bandyopadhyay et al. [36] suggested that the sound pressure levels (SPLs) of BGN and loudspeaker sound on the platform can cause significant discomfort to users, as they are largely above acceptable daytime noise levels. Furthermore, although the Architectural Institute of Japan Environmental Standard [37] sets a speech rate of 5.5 mora/s, which is derived from the results of a study using broadcasts at this rate, there are many other studies that mention the possibility that the comprehensibility of information broadcasts may change depending on the speech rate [2, 32, 33]. Under these circumstances, the development of clear voice transmission that is efficient and does not cause discomfort in the sound environment of station premises is essential.

In this study, the influence of two factors, the SNR and the speech rate, on the auditory impression of announcements was examined. Then, the appropriate conditions for synthetic voice broadcasts were examined. Section 2 describes the outline and procedure for the experiments, while Sect. 3 presents the results of each experiment. From the results, the appropriate conditions for the broadcast of synthetic voice announcements at each location in a railway station are discussed in Sect. 4.

2 Methods

This study investigates the appropriate SNR and speech rate for synthetic voice announcements at railway stations for normal-hearing people, using announcements made by DNN synthesized speech. As BGN, two locations, the ticket gate and the platform of the station, were considered. They are listed in the Barrier-Free Improvement Guidelines of the Ministry of Land, Infrastructure and Transport [38] as places where acoustic facilities should be provided at railway stations, and noise level surveys have been conducted for both locations [39]. Previous studies [2, 32, 33] have also reported appropriate guidelines for the speech rate of announcements in public spaces in Japan. In order to investigate the effects of SNR and speech rate on auditory impression separately, the appropriate SNR range was first found by setting the speech rate considered appropriate beforehand, that is, without considering the acoustic characteristics of the space. Then, experiments were used to clarify the optimal speech rate at each location in the railway station. More specifically, referring to previous studies [1, 2] conducted using natural voice announcements, Experiment 1 was first conducted to examine the appropriate SNR. The obtained SNR results were then introduced, and Experiment 2 was conducted next to examine the appropriate speech rate.

Twelve students in their 20s participated in the experiment (6 male, 6 female). Participants were recruited by email to students at the Faculty of Science and Technology, Tokyo University of Science, and received no remuneration for their participation. In accordance with EN 50332-2 proposed by CENELEC as a sound pressure regulation for portable audio players, the study was not invasive, and written consent for the experiment was obtained from all participants. Experimental collaborators were briefed on the purpose of the study and the experimental methods, and anonymization and use of the data. Furthermore, prior to the study, participants were asked about their hearing and were assured that both ears were normal.

2.1 Experiment 1

Participants were instructed to listen to an information broadcast superimposed on BGN through headphones and to evaluate the announcement based on their listening impressions. Two evaluation items, “listening difficulty” and “noisiness”, were selected with reference to previous studies [1, 40]. A five-point Likert scale was used, with “1 = not at all (全く ~ ない)”, “2 = not very much (それほど ~ ない)”, “3 = slightly (多少 ~)”, “4 = much (だいぶ ~)”, and “5 = very much (非常に ~)”, for each of “listening difficulty” and “noisiness”. The participants were also asked to assume a situation in which they were trying to navigate by relying on audio information at a station they were using for the first time. The announcement used in this experiment was created using the WaveNet voice of Google text-to-speech [41], a speech synthesis application programming interface (API) provided by Google. The various conditions of the information broadcasts created are shown in Table 1. The speech rate was set to 7 mora/s with reference to previous studies [2, 32, 33] and HOYA broadcasts. Additionally, the words (station name, destination name, line number, train type, number of cars, etc.) included in each announcement were changed so that the participants could not predict the next sentence when listening to the announcement in the experiment. The BGN for this experiment was recorded using an earphone-type binaural microphone (Adphox BME-200) at several railway stations in the Tokyo metropolitan area. Sound sources in the range of noise levels likely to occur at each location were selected from the sound source data recorded for 15 s each time. Because of this, the frequency and time-varying characteristics of the BGN at each noise level were different from each other. The details of the experimental BGN and their frequency characteristics are shown in Table 2 and Fig. 1, respectively. The 3 kHz peak seen in the BGN at the ticket gate is the sound of the ticket machine. A precision sound level meter (Rion NL-52) was used to measure the BGN level during the recording. For the experiment, headphones (Sony MDR-M1ST), an audio interface (RME Fireface 802), and a laptop computer were used.

Table 1 Conditions of the generated announcement using WaveNet in Google text-to-speech
Table 2 Setting conditions and detailed content of the BGN
Fig. 1
figure 1

Frequency characteristics of the BGN at (a) the ticket gate and (b) the platform

Table 3 shows the auditory conditions presented to the participants in the experiment with BGN at the ticket gate. Each combination of sound sources was presented once at random. Then, participants evaluated a total of 72 conditions, with six levels of SNR for each of three levels of BGN for the conditions at the ticket gate, and four kinds of voice pitch (low male, low female, high male, and high female). In the experiment using the BGN of the platform, only one type of information broadcast was used for the female voice, as in the HOYA broadcast, because no significant differences for the gender of the voice were found in the experiment of the ticket gate (see below for details). The auditory conditions presented to the participants are shown in Table 4. Note that the upper limit of the SNR was set to +9 dB so that the information broadcast was kept below 90 dB for the 80 dB BGN of the platform. That is to say, participants evaluated a total of 28 conditions, with six levels of SNR (four levels in one part) for each of five levels of BGN for the conditions on the platform.

Table 3 Conditions of the auditory stimuli at the ticket gate
Table 4 Conditions of the auditory stimuli on the platform

2.2 Experiment 2

The participants and equipment used in this experiment were the same as in Experiment 1. Participants were instructed to evaluate their subjective impressions of the broadcast information and BGN, which were presented to them through headphones. Three evaluation items were selected: “listening difficulty” and “noisiness”, which were the same as the evaluation items in Experiment 1, plus “strangeness”, for evaluating the unnaturalness for the speech speed of the generated voice, with reference to a previous study [2]. A five-point Likert scale of “1 = not at all (全く ~ ない)”, “2 = not very much (それほど ~ ない)”, “3 = slightly (多少 ~)”, “4 = much (だいぶ ~)” and “5 = very much (非常に ~)” was used for the “listening difficulty” and “noisiness”, and for the evaluation of “strangeness”, the speed that was felt appropriate for a railway station information broadcast was set as the standard of “feels appropriate (ちょうどよい)”, while the speed of “−2 = feels slow (遅く感じる)”, “−1 = feels slightly slow (やや遅く感じる)”, “0 = feels appropriate (ちょうどよい)”, “1 = feels slightly fast (やや速く感じる)”, and “2 = feels fast (速く感じる)” were selected. The participants were asked to evaluate the announcements in the same way as in Experiment 1: they were asked to assume a situation in which they were trying to move around at a station they were using for the first time and were relying on audio information. The creation and recording methods for the announcements and BGN used in this experiment were also the same as in Experiment 1. The SNR was set as shown in Table 5, referring to the results of Experiment 1. However, as described below, an appropriate SNR could not be found for the 80 dB BGN of the platform, so it was excluded from this experiment. The auditory conditions are shown in Table 6. The participants evaluated a total of 28 conditions, with four levels of speech speed for each of three levels of BGN for the conditions at the ticket gate and four levels of BGN for those at the platform.

Table 5 The SNR setting conditions
Table 6 Auditory presentation conditions for conditions at the ticket gate and the platform

3 Results

3.1 Experiment 1

First, in order to examine the difference between male and female participants, Student's t-test was conducted for the listening difficulty and noisiness items, and no significant difference was found. Therefore, the relationship between the SNR and the listening difficulty and noisiness in the situation with the added BGN of the ticket gate is shown in Fig. 2 excluding the classification by participant gender. Error bars indicate standard errors.

Fig. 2
figure 2

Transition of “listening difficulty” under BGN levels of a 60 dB, b 65 dB, and c 70 dB, and transition of “noisiness” under BGN levels of d 60 dB, e 65 dB, and f 70 dB, in the ticket gate

A three-way analysis of variance (ANOVA) was conducted using gender of the voice, SNR level, and BGN level as factors, without including the gender of the participants, for each of the ratings of listening difficulty and noisiness. Then the main effects and interactions of SNR and BGN levels were found to be significant (p < 0.01). Multiple comparisons (Tukey’s honestly significant difference, HSD, test) showed significant differences between all conditions (p < 0.01). On the other hand, no main effect for the gender of the voice was found, suggesting that the influence of the gender of the voice on the evaluation of listening difficulty and noisiness is small. Therefore, the averages of the evaluated values obtained for the four types of voices were re-taken and are shown in Fig. 3.

Fig. 3
figure 3

Transition of a “listening difficulty” and b “noisiness” when the four voice gender scores are averaged

Figure 3 shows that listening difficulty decreases and noisiness increases as the SNR increases. However, when the announcement level is over 80 dB, the impression of noisiness becomes more pronounced, so increased SNR tends to increase the listening difficulty. In addition, when comparing the values for the same SNR, the higher the BGN level, the higher the evaluated loudness. In short, when the BGN level is high, even if the SNR is increased, it is difficult to reduce the listening difficulty, and the impression of noisiness is high.

The relationship among the SNR, listening difficulty, and noisiness for the situation where BGN is added to the platform is shown in Fig. 4.

Fig. 4
figure 4

Transition of a “listening difficulty” and b “noisiness”on the platform

A two-way ANOVA was conducted for the SNR and BGN levels as factors for each of the ratings of listening difficulty and noisiness. Then the main effects and interactions of the SNR and BGN levels were significant (p < 0.01). Multiple comparisons (Tukey’s HSD test) showed significant differences between all conditions (p < 0.01). The platform is a location where the BGN level is often higher than at the ticket gate, so when the BGN level is 75 dB or higher, the noisiness increases significantly before the listening difficulty is improved, even if the SNR is increased.

3.2 Experiment 2

The relationship between speech rate and listening difficulty, noisiness, and strangeness in the situation with added BGN at the ticket gate is shown in Fig. 5, while the relationship among these with added BGN on the platform is shown in Fig. 6.

Fig. 5
figure 5

Relationship between speech rate and a “listening difficulty”, b “noisiness”, and c “strangeness” in the ticket gate

Fig. 6
figure 6

Relationship between speech rate and a “listening difficulty”, b “noisiness”, and c “strangeness” on the platform

A two-way ANOVA was conducted on the listening difficulty ratings for both the ticket gate and the platform, with speech rate and BGN level as factors. The results for the main effects of the speech rate and BGN level were significant (p < 0.01). Multiple comparisons (Tukey’s HSD test) for speech rate also showed significant differences between all conditions (p < 0.01). The same two-way ANOVA was conducted on the noisiness ratings as above, and the main effect of the BGN level was significant (p < 0.01). The same two-way ANOVA was conducted for the rating of strangeness, and the main effects of the speech rate and BGN level were significant (p < 0.01). The results of multiple comparisons for speech rate also showed significant differences between all conditions (p < 0.01).

4 Discussion

Values within the range where the evaluated scores for both listening difficulty and noisiness were less than 2 or 2.5 were regarded as appropriate SNR, while the values outside of this range were regarded as inappropriate SNR. In Fig. 7, the range of the SNR where both listening difficulty and noisiness scores were less than 2 is shown in red, and that where the scores were less than 2.5 is shown in blue. Note that when the platform BGN was 80 dB, no scores under 2.0 and 2.5 were obtained.

Fig. 7
figure 7

Appropriate SNR in a the ticket gate and b the platform

Figure 7 shows the appropriate SNR at each location. Regarding the range of 60–70 dB of BGN at the ticket gate and platform, it can be said that the SNR can be set higher on the platform than at the ticket gate by reproducing the speech signal at a higher sound pressure level. This is because the noisiness of announcements tends to be perceived more easily at the ticket gate than on the platform when the SNR is increased. Given that the effect of the gender of the voice on the evaluated value is small, it is thought that the frequency characteristics of the BGN at the turnstiles may have an effect on the perceived loudness at the ticket gate. It is difficult to determine a single appropriate announcement level for all BGN levels both at the ticket gate and on the platform. However, since the range of BGN likely to occur at a station can be predicted from the number of passengers per day for the station and the speed of the entering trains [42], using Fig. 7 as a reference, it is possible to consider the appropriate announcement level for a station by selecting the BGN level. When the BGN level is over 70 dB at the ticket gate and 75 dB on the platform, the range of appropriate SNR is quite limited. So, it is important to promote the introduction of sound-absorbing materials and other improvements to prevent the noise level at each location from becoming too high, and to give priority to this over the consideration of announcement levels. In fact, various existing studies have shown how to consider building materials and sound absorption methods to shorten the reverberation time and reduce the ambient noise level at railway stations [43,44,45,46]. It is also effective to control an appropriate SNR by considering the spatial relationship between the target area and the loudspeakers [47].

Focusing on the experimental results for speech rate, Figs. 5c and 6c show that the evaluated values of unnaturalness in relation to speech rate are almost proportional to 6.5 mora/s. The tendency for the participants to perceive that the announcements on the platform were slightly faster than those at the ticket gate at 7.5 mora/s is assumed to be because they wanted to listen to the announcements more carefully in the platform situation.

Previous studies [1, 2] have reported that the appropriate SNR in station concourses should be around +8 dB for absorptive ceilings and more than +13 dB for reflective ceilings. Although the acoustic characteristics of the space were not taken into account in this experiment, a comparison of the results for the concourse and a ticket gate close to that location suggests that there is no significant difference in the appropriate SNR between the synthesized and natural voices. As for the speech rate, the scores for listening difficulty and strangeness were generally similar to each other, so that similar trends were observed for the appropriate speech rate between the synthetic and the natural voices. The evaluation value for noisiness was slightly higher in this experiment, but this may be because, unlike previous studies using loudspeakers, the voice was played from headphones, and in particular, as mentioned above, the ticket gate is a place where the noisiness of an announcement tends to be easily perceived.

In this study, the threshold of the evaluation value was set at 2 or 2.5, but it is possible to change the threshold according to the situation, for example, for emergency announcements of important content, a SNR that is slightly noisy is acceptable, as emphasis is on ease of hearing. Additionally, the results of this experiment provide the appropriate announcement levels for people with normal hearing, and it is necessary to take into account, for example, the elderly population, who are more likely to suffer from hearing loss. In addition, Figs. 5b and 6b show that there is no direct relationship between speech rate and noisiness, so it can be said that speech rate can be changed to suit the broadcast environment. It should also be noted that since this experiment was conducted in a stationary state, the appropriate range of speech rate may change, especially in places where people are often in a walking state, such as at an actual ticket gate. From the above, assuming that an appropriate SNR can be set, the standard speech rate of an information broadcast at a railway station should be 7.0 mora/s at the ticket gate and 6.5 mora/s on the platform to minimize the listening difficulty when stationary and to avoid sounding unnatural.

This study was conducted without considering the characteristics of the speech propagation environment. When applying the current experimental results to the actual station environment, there is a need to set the SNR and speech rate in consideration of the directivity and reverberation time of the loudspeaker in the actual environment. A previous study considering differences in reverberation time (with and without sound absorbing material) in station buildings [1] reported that a SNR of about 5 dB higher than in a sound-absorbing environment is required in a reflective environment. With regard to speech rate, the results of the present study are similar to those of a previous study [2] and are therefore considered to be independent of the presence or absence of sound-absorbing material. In this study, we also recorded BGN at the underground platforms, but the noise levels were lower than those at the above-ground platforms due to the large amount of sound-absorbing material installed on the ceilings of the underground platforms. Therefore, only the BGN of the above-ground platforms was used in this experiment.

5 Conclusion

This study attempts to determine an appropriate SNR and speech rate for synthetic voice announcements at railway stations.

The following results were obtained. First, the appropriate SNR varied depending on the broadcast location and BGN level. Second, it was found that increasing the SNR when the BGN level was high did not lead to an improvement in the listening comprehension of the announcement. Finally, it was confirmed to be possible to set standards for speech speed depending on the broadcast location and situation, and that the SNR and speech rate show similar trends between synthesized and natural voices.

Synthetic speech can contribute to improving information transmission in announcements by setting an appropriate SNR and speech rate for each location, in the same way as conventional human voices. Future work will focus on finding a method for improving intelligibility independent of the SNR by applying voice processing to announcements, and on addressing the same questions while also considering the acoustic characteristics of the space being examined.