Introduction

Speech research often requires detailed annotation on audio files to formulate statistical analysis about the information revealed by the sounds. Annotation tasks include aligning phoneme, syllable, word, and phrase boundaries so that researchers can extract phonetic and phonological information such as the duration, pitch, and intensity of specific regions of speech. Manual annotation of audio files is undoubtedly a tedious and resource-consuming task for speech researchers. Researchers need to listen to the audio repeatedly to label and align the boundaries manually. A 1-min audio file may require hours of annotation work. Moreover, multiple annotators need to work on the same audio annotation and compare the inter-annotator agreement to avoid human annotation errors.

Fortunately, with the rapid development of speech recognition technology since the 1970s, forced alignment, as a technique to align audio files and create auto-generated annotations, has evolved significantly, which makes automatic annotation possible and efficient. According to Pettarin (2018)’s collections, there are at least 15 open resource programs. Research has reported highly positive feedback on the performance of forced alignment tools, especially on phoneme boundaries (DiCanio et al., 2013; Hosom, 2009; Goldman, 2011; Gorman et al., 2011; Yuan and Liberman, 2015; Yuan et al., 2018; Yuan et al., 2014; McAuliffe et al., 2017; Sella, 2018; MacKenzie and Turton, 2020; Mahr et al., 2021; McAuliffe, 2021; Liu and Sóskuthy, 2022).

Combined with the easy accessibility of forced aligners in recent years, forced alignment has drawn increasing attention from speech researchers and resulted in growing applications of forced aligners in speech research. In this study, we evaluated forced aligners from the perspective of prosodic research for tonal languages, which has barely been discussed in the literature. Using a forced aligner in Mandarin prosodic analysis, we aim to provide an example of how to use forced alignment in speech research that investigates suprasegmental-level sound information and how to measure the performance of forced aligners when applied to a tonal language such as Mandarin Chinese.

The tool and input

The purpose of this study was to evaluate the automatic alignment performances for speech prosody research. We chose the Montreal Forced Aligner as the forced aligner tool and speech data collected from native adult speakers of Mandarin Chinese as the input.

Montreal Forced Aligner (MFA)

MFA is an alignment tool with high accessibility and compatibility (McAuliffe et al., 2017). It is open-source software with prebuilt executables for both Windows and Mac OSX. MFA has a pre-trained model for some languages, and it can be trained with other languages that are not included in the model.Footnote 1 As an update of the Prosodylab-Aligner (Gorman et al., 2011), MFA uses more advanced acoustic models based on triphones and is built using the Kaldi toolkit, offering “advantages over the HTK toolkit underlying most existing aligners” (McAuliffe et al., 2017).

Recent research by Mahr et al. (2021) and Liu and Sóskuthy (2022) both reported highly positive evaluation results on MFA. Mahr et al. (2021) evaluated the performances of popular forced alignment algorithms on children’s American English speech samples and reported that the aligners’ accuracy ranges from 67 to 86%, compared to the gold-standard human annotation. Among the five forced aligners in their consideration, MFA with speaker-adaptive training (McAuliffe et al., 2017) was the most accurate aligner across speech sound classes, with an average percentage accuracy of 86%. Liu and Sóskuthy (2022) evaluated the performances of MFA on four Chinese varieties (Canto, Shanghai, Beijing, and Tianjin) and also found strong agreement between human and machine-aligned phone boundaries, with 17 ms as the median onset displacement and little variation across the Chinese varieties.

Based on reported evaluation results, MFA showed excellent performance in aligning word and phone boundaries in various speech datasets (McAuliffe et al., 2017; McAuliffe, 2021) and produced higher quality alignments compared to other forced aligners (Gonzalez et al., 2020; Mahr et al., 2021).

The speech dataset

The dataset for the evaluation of MFA was established in the following way: A total of 33 adult speakers of Mandarin Chinese, aged 18 years or older, were recruited through social media and contributed speech samples. Fifteen of them (eight females and seven males) recorded their speeches through their phones or laptops and submitted their recordings through emails; the other 18 (nine females and nine males) recorded their speeches in a sound-proof recording booth with professional recording equipment.Footnote 2 Speech samples were recorded based on given target sentences. Fifty-six target sentences containing wh-words (e.g., shénme ‘what’) were included in the stimuli, with the number of syllables in the sentences ranging from 7 to 20. The target sentences varied in the position of the wh-word (i.e., subject or object) and the structure of the sentence (i.e., simple transitive sentences, simple ditransitive sentences, sentences with conditional clauses, and sentences with their subject left-dislocated), as shown in Table 1.

Table 1 Examples of given target sentences.

There were 24 unique sentences in Group 1, 8 unique sentences in Group 2, 8 unique sentences in Group 3, and 16 unique sentences in Group 4. Each target sentence was potentially ambiguous (Table 1), and participants were provided with two or four possible interpretations along with the target sentences. For each given meaning of the target sentence, participants were required to read out the target sentence if they thought it could be used to express that meaning. Otherwise, they could choose to say, “I do not think the target sentence can be used to express the given meaning.” Participants were allowed to update their recording sample of a sentence anytime by re-recording a sentence before submitting the audio files.

After the data collection, to prepare for the audio format required by MFA, we converted all audio files into WAV files using Praat (Boersma and Weenink, 2021) if they were in a different format. Since Kaldi (the toolkit that MFA was built on) can only process 16-bit files, MFA by default converts audio files that have higher bit depths than 16-bit into 16-bit. MFA supports both mono and stereo files but by default samples the files with a higher or lower sampling rate than 16 kHz into 16 kHz during the feature generation procedure; therefore, we did not take steps to normalize the sampling rates, bit depth and channels of all audio files.

We then trimmed the audio files so that only the speech on the target sentences of the experiment was kept. When there were multiple recordings of a target sentence, we only kept the last recording of the sentence in the dataset. After chopping the long sound files into separate audio files by sentence, for each speaker, 128 recorded audio files were included in the dataset, with 48 audio files from Group 1, 16 audio files from Group 2, 32 audio files from each of Group 3 and Group 4. One speaker’s submission (Speaker 5) was not included in the dataset due to incomplete recordings. In total, we collected 4096 audio files from 33 speakers with the duration of audio files ranging from 1 s to 4 s, in total resulting in a 216-min-long speech dataset.

We further coded the audio files to identify the audio files that do not match the given target sentences, such as missing words, repeating words, paraphrasing words, and adding extra words. Each audio file was coded with a label corresponding to the target sentence and specified reading, otherwise “null” if the speaker said “I do not think the target sentence can be used to express the given meaning” or “misread” if the sentence that the speaker recorded did not match the target sentence. When generating MFA alignment and evaluating the MFA alignment results, audio files with “null” or “misread” labels were excluded.

Although it is not spontaneous speech data collected from a natural conversation environment, this dataset has three main advantages. First, it contains minimal pairs of naturally occurring ambiguous sentences that are prosodically distinguished. Take (1), a sentence from Group 3 as an example.

(1)

Zhōng.guó-duì

shuí

dǎ-bù-guò

 

Chinese-team

who

also

beat-not-complement

 

a. wh-indefinite:

 

 (i) ‘The Chinese team can’t beat anyone.’ or

 

 (ii) ‘No one can beat the Chinese team.’

 

b. wh-interrogative:

 

 (iii) ‘Who is the team that the Chinese team also can’t beat?’ or

 

 (iv)‘Who is the team that also can’t beat the Chinese team?’

Sentences like (1) are often used by native speakers in social media to express (1a-i) when talking about the Chinese male soccer team losing all the games or to express (1a-ii) when talking about the Chinese ping-pong team winning all the games. Wu and Yun (2022) found that sentences like (1) can also induce interrogative readings as in (1a-iii) and (1a-iv). It is important to note that the various readings of (1) share the same underlying phonemic tones and syllables. However, these tones and syllables are realized with variations in order to achieve different readings. For example, the wh-word “shuí” is typically associated with a rising tone. However, it is commonly realized with a neutral tone or a tone that only slightly rises in pitch when it refers to wh-indefinite, as in (1a-i) and (1a-ii). Similarly, the adverb “yě” typically has a tone that falls before rising in pitch. Yet, in the wh-indefinite contexts, it is often produced with a shorter duration and a tone that only partially falls before rising. The four readings of (1) can also be distinguished by the length of the pause between the noun phrase and the wh-word, and between the wh-word and the adverb. Figure 1 displays an example of how these four meanings of (1) may be phonetically and prosodically different.

Fig. 1
figure 1

Example of the prosodic differences on the four meanings of (1).

Since the ambiguous sentences share the same underlying phonemic tones for different meanings (i.e., same transcript and same dictionary input to MFA) but can nevertheless be realized with different phonetics depending on the meaning, the data provide an excellent opportunity to evaluate how well MFA can handle some departure in production from the expected dictionary pronunciation of these words.

Second, the dataset contains sentence-level speech production with words that have a neutral tone and er-suffixation and thus can help evaluate whether the forced aligner can correctly align sentences with various word types. The stimuli included words with neutral tones, such as the perfective marker le, in addition to words with the canonical four tones in Mandarin. In most cases, one character corresponds to one syllable, but er-suffixation makes two characters correspond to one syllable. Words with neutral tones and er-suffixation are often less salient in prosody and may make syllable boundaries less clear, which poses challenges for forced alignment.

Third, it is a balanced dataset that has audio files from both lab recordings and non-lab recordings, and a similar amount of audio files from both female and male speakers. Thus, it provides a potential way to evaluate if the audio input quality and gender affect the alignment results.

Evaluations

Using the audio data introduced in section “The tool and input” as input, we conducted forced alignment using MFA at two different levels: syllable-by-syllable alignment and phrase-by-phrase alignment. Then we assessed each alignment result to evaluate the performance of MFA in terms of annotation accuracy and annotation speed.

Syllable-by-syllable alignment

Accuracy

MFA, like many other alignment tools, takes audio files (.wav), their accompanying transcript files, and a pronunciation dictionary (.txt) as input. Since there was no ready-to-use MFA pronunciation dictionary for Mandarin in the MFA version 1.0 repository, we followed the format of the prepared Mandarin dictionary example in the MFA tutorial guide and created a pinyin pronunciation dictionary for the alignment task (Table 2). Considering that one Chinese character corresponds to one syllable, we used Xiàndài Hànyǔ Chángyòngzì Biǎo3500 commonly used Chinese characters’, published by the Ministry of Education in China in 1988, to create the dictionary.Footnote 3 Table 2 presents examples of the entries in the dictionary. The first column in the dictionary was the syllable’s pinyin, and the second column contained the phonemes with tones attached to the nuclear vowels. The tones were presented in numbers: 1, 2, 3, and 4, meaning a high-level tone (a.k.a. the first tone), a rising tone (the second tone), a fall-rise tone (the third tone), and a falling tone (the fourth tone), respectively. For the readers’ information, we also provided one of the possible meanings associated with the pinyin pronunciation in Table 2.

Table 2 Sample of the syllable-level dictionary.

The transcripts were also presented in pinyin, with spaces between each syllable. Table 3 presents a sample of the transcript of a Group 3 type sentence that we used in this experiment. For the readers’ information, we also provided one of the possible meanings associated with the transcript in Table 3.

Table 3 Sample of the syllable-level transcript.

The output of MFA alignment is annotation files (.textgrid), where phonemes and syllables are time-aligned. We used the pinyin system in both input and output: Chinese characters were romanized and accompanied by a number to indicate their tones. Figure 2 displays an audio file with its MFA annotation. When generating MFA alignment output, audio files with “null” (i.e., when the speaker stated that “I do not think the target sentence can be used to express the given meaning”) and “misread” (i.e., when the speaker recorded a sentence that was different from the given stimuli) were not included.

Fig. 2
figure 2

Example of syllable-level annotation by MFA.

The human alignment output was created upon the MFA alignment by four well-trained human annotators (also authors of the article) reviewing the boundaries produced by MFA (i.e., boundaries on the “words” tier in Fig. 2) and adjusting boundaries and interval labels if the MFA-generated boundaries did not match the syllables as intended.

We randomly selected 1120 audio files and evaluated the syllable alignment results that MFA generated, roughly 27.34% of all collected data. Before calculating the accuracy, we first compared the number and the label of intervals in the human alignment and the MFA alignment on each audio file. If there was any mismatched number or label in the human alignment and the MFA alignment on the same audio file, the entire textgrid files were excluded from the final comparison. 130 audio files were excluded according to this criterion; in most cases, MFA results have more intervals due to spurious pauses (silent intervals).

In the end, the total number of syllable-level human-aligner pairs that were included in the comparisons is 10,487. To compare the MFA alignment output against human alignment output, we measured the absolute syllable boundary-time difference for every interval. The comparison results are shown in Table 4. Each cell of the table indicates the group of sentences produced by a specific speaker (F: female, M: male). Lab speaker N means that the speaker is the Nth participant who attended the lab recordings. All other recordings were conducted with the speakers’ laptops or phones. In the average row, the values are presented in milliseconds (ms); the value outside of the parentheses is the average absolute boundary difference and the value within the paratheses is the standard deviation (SD) for that specific group of data.

Table 4 Absolute syllable boundary-time differences.

As we can see from Table 4, MFA produced quite decent syllable-aligned results for Mandarin Chinese, with the average human-MFA alignment difference being 15.59 ms (SD = 30.41) The average human-MFA syllable-level alignment difference of a group of sentences by a specific speaker ranged from 2.94 ms to 28.58 ms. Although there were many boundaries that annotators did not need to adjust the MFA-generated ones, there were also many boundaries where annotators needed to make significant adjustments, yielding high standard deviations in the results. To dig into the causes of the high standard deviations, we further examined the 197 data points where the absolute time difference is larger than 100 ms. We found that most of these outliers involved the boundaries for wh-word (shuí, ‘who’), adverb (, ‘also’) and negation marker (meí, ‘not’), and words with third tones and neutral tones (e.g., zhě, wěi, , gěi, me, le, de, yǐng) in the Group 1 and Group 4. We will delve into a detailed discussion to explain why these certain words can pose challenges for forced alignment in section “Discussion”.

Moreover, following the conventional practice of forced aligner evaluation, we chose 25 ms as a threshold to transform the numerical data into binary data. This binary labeling, “YES” for MFA-human absolute time differences below 25 ms and “NO” otherwise, allowed us to calculate agreement rates. We then conducted chi-square tests on the binary data to assess statistical significance. The 25 ms proposed by McAuliffe et al. (2017) is the gold-standard threshold used to evaluate the phone-onset human-aligner difference and has been employed in research on forced aligner evaluation, including Mahr et al. (2021). The evaluation of the data revealed that 73.49% of human-MFA differences at the syllable level were within 25 ms of the gold standard, indicating a high level of agreement. This rate closely approximates the phone-level agreement rate (77% within 25 ms) reported by McAuliffe et al. (2017) and is significantly higher than the phone-level agreement rate (64% within 25 ms) reported by Mahr et al. (2021).Footnote 4 The results of the current test show that MFA can make decent syllable-level alignment for a tonal language like Mandarin, too. Furthermore, MFA is established to correctly align audio files of ambiguous sentences with underlying phonemic tones but different prosodic markings.

Additionally, the average syllable boundary-time difference between humans and MFA for audio files recorded through phones or laptops (“local recordings”) was 17.02 ms (SD = 31.41). It is numerically higher than the average human-MFA time difference for audio files recorded in a professional recording booth (“lab recordings”), which was 13.80 ms (SD = 29.02). 71.31% of the human-MFA difference for local recordings was smaller than 25 ms, while 76.20 % of the human-MFA difference for lab recordings was smaller than 25 ms. There is a significant difference between the local recordings and lab recordings on the 25 ms threshold via a chi-square test (p < .001). This alignment evaluation shows that the audio input from phones or laptops yields decent MFA syllable-alignment accuracy but is still not as good as the lab recordings, which is consistent with the findings of Sanker et al. (2021): local recordings are feasible and reliable to get segment boundaries but lab recordings are always better.

For all the recordings evaluated in this test, the overall average human-MFA difference for audio files from female speakers was 15.74 ms (SD: 32.29) with 73.66% of the difference smaller than 25 ms, and the overall average for male speakers was 15.48 ms (SD: 29.08) with 73.38% of the difference smaller than 25 ms, with no significant effect of gender observed (p > 0.05).

Efficiency

Previous research mainly focused on evaluating the alignment accuracy of forced aligners and rarely tested their efficiency. To estimate the MFA-aided syllable-level annotation efficiency, we timed one session of the syllable-level accuracy evaluation tasks to see how much syllable-level annotation work could be done with MFA in 30 min.

The timed assessment was conducted by two of the four expert human annotators who have been working on the accuracy evaluation. Human annotator X used a Lenovo Legion 5 15IMH05 laptop with Windows operating system, Intel Core i5-10300H CPU @ 2.50 GHz, and 8GB RAM, while human annotator Y used a MacBook Pro with macOS operating system, 1.4 GHz quad-core Intel Core i5, and 8GB RAM. Both laptops had robust computational and processing capabilities and were connected to high-speed internet. Before the test, MFA was pre-installed on the annotators' laptops, and the audio files, pronunciation dictionary, and transcripts were finalized and available in a shared online folder. The annotators were instructed to randomly select a set of audio files from the same speaker, which they had not evaluated, and copy them to their local working folder to start the timer.

Human annotator X chose 32 audio files from Speaker 9 Group 4, with three files labeled as “null” or “misread.” After spending 6 min copying the required files to a local folder and activating MFA, this annotator generated TextGrid files for the 29 non-null/non-misread audio files, each about three seconds long and totaling 100 s, and spent 3 min logging annotation notes. In the remaining 21 min, this annotator completed annotations for 14 out of the 30 audio files, finalizing the TextGrid files after manually adjusting the results generated by MFA. Human annotator Y selected 48 audio files from Speaker 9 Group 1, with no files labeled as “null” or “misread.” After spending 4 min copying the required files to a local folder and activating MFA, this annotator generated TextGrid files for all 48 audio files, each between two and three seconds long and totaling 131 s, and spent 3 min logging annotation notes. In the remaining 23 min, this annotator finalized annotations for 14 out of the 48 audio files.

It is important to note that the results of this efficiency test depict a relatively optimal scenario, given that the two expert human annotators had prior experience with Praat and MFA as well as the MFA-aided annotation workflow, and the laptops utilized had sufficient computational and processing capabilities. Nevertheless, as we have seen, a significant amount of time and resources are still saved for speech researchers with the help of MFA. According to the current efficiency test, MFA-aided annotation took about 30 to 40 min to complete annotation with both phone-level and syllable-level information for a 1-min-long recording. In contrast, literature reported that manual annotation at the phone level can take up to 13 h for a 1-min recording, which is 800 times the duration of the audio (Goldman, 2011; Schiel et al., 2012: 111).

Phrase-by-phrase alignment

Prosodic research often focuses on phrase-level intonation and pitch contours. Therefore, the annotation of audio files should include phrase boundaries. With the help of MFA, the tedious manual annotation work for prosodic research can be significantly simplified. Human annotators do not need to start by creating an empty annotation file (.textgrid) and draw boundary lines for phrases one by one. Instead, human annotators can take the syllable-level annotation like the one in Fig. 2 as a starting point and make two steps of small adjustments: first, change selected syllable boundaries to phrase-level boundaries, and second, change the labels at the words-tier to linguistics phrase labels that fit the prosody research purpose, such as “subject,” “pause”, and “wh.” The final product after this adjustment process is shown in Fig. 3.

Fig. 3
figure 3

Human annotation of phrase-level boundaries.

While the existence of syllable-level segmentation already simplifies the phrase-level annotation process, can it be further simplified? Can we make MFA generate phrase-level aligned boundaries directly? These questions motivated us to conduct another test with MFA to see whether it can generate phrase-level boundaries with decent alignment accuracy and efficiency.

The definition of phrases in prosody research depends on the research question and the target of analysis. For example, we can divide the sentence in (1) into three phrases: “pre-wh region,” “wh-word,” and “post-wh region” if the primary focus of the prosodic research is to compare the acoustic properties of the pre-wh region and the post-wh region. Alternatively, (1) can be divided into four phrases: “subject,” “wh,” “adverb,” and “verb-negation-complement” if the prosodic research intends to investigate the adverb and the verb phrase separately in the post-wh region. In this phrase-level alignment test, we used the latter phrasing format. We expect that MFA can exhibit similar phrase-level performance when the first phrasing format is used.

(1)

[pre-wh region]

[wh]

[post-wh region]

 
 

[subject]

[wh]

[adverb]

[verb-negation-complement]

 

Zhōng.guó-duì

shuí

dǎ-bù-guò

 

Chinese-team

who

also

beat-not-complement

 

a. wh-indefinite:

 

 (i) ‘The Chinese team can’t beat anyone.’ or

 

 (ii) ‘No one can beat the Chinese team.’

 

b. wh-interrogative:

 

 (iii) ‘Who is the team that the Chinese team also can’t beat?’ or

 

 (iv) ‘Who is the team that also can’t beat the Chinese team?’

The format of transcripts and dictionaries

The Mandarin Chinese transcripts and pronunciation dictionary used in section “Syllable-by-syllable” alignment follow a character-by-character (i.e., syllable-by-syllable fashion) fashion instead of a word-by-word fashion. It is due to the fundamental differences between English and Mandarin: there are no noticeable word boundaries in the Chinese writing system, while one Chinese character normally corresponds to one syllable.

To get phrase-by-phrase-aligned boundaries or at least word-by-word boundaries in Mandarin, we started to experiment with MFA by varying the transcript and dictionary input, testing different combinations on the audio files that were randomly selected from the dataset. For each combination, we tested audio files from Speaker 1, Speaker 8, and Speaker 12 across different groups of stimuli.

We first tried the combination of a phrase-level transcript (Table 5) and the syllable-level dictionary used in section “Syllable-by-syllable alignment” (Table 2).

Table 5 Sample of phrase-level transcripts.

Phrase-level transcripts were created manually according to the annotation purposes of our prosody project. For example, since the prosody project aims to analyze the prosodic features of wh-phrases, an ideal annotation product would be to consider a complex wh-phrase like nǎ-gè-duì ‘which/any team’ as one unit. As in Table 5, the phrase-level transcripts have an empty space in between two phrases. However, this kind of transcript and dictionary combination did not produce proper alignments. Figure 4 presents the test results of this combination on an audio file from Speaker 12, Group 2 recording, which refers to the sentence meaning ‘Wangxin did not send anything to Xufang the day before yesterday’.Footnote 5

Fig. 4
figure 4

MFA alignment results of a phrase-level transcript and the big-set syllable-level dictionary.

Note that the syllable-level dictionary used in section “Syllable-by-syllable” alignment was generated through the 3500 Commonly Used Chinese Character list. As the MFA tutorial guide also mentioned a way to generate a dictionary from a transcript, we were wondering if a dictionary generated from a small training set produces different phrase alignment results. So, we generated a syllable-level dictionary based on the syllable-level transcripts and referred to this dictionary as the “small-set” syllable-level dictionary, to distinguish it from the “big-set” syllable-level dictionary (i.e., the one we used in section “Syllable-by-syllable alignment”).Footnote 6 The two kinds of syllable-level dictionaries have the same format as in Table 2 but only differ in the size of the dictionary. However, the combination of the “small-set” syllable-level dictionary and the phrase-level transcript (Table 5) did not produce satisfactory results either. Figure 5 presents the test results of this combination on the same audio file we used to generate Fig. 4. The unsatisfactory alignment results which are shown in both Figs. 4 and 5 suggest that syllable-level pronunciation dictionaries are not suitable for phrase-level alignment. Therefore, we switched to phrase-level dictionaries in the following trials.

Fig. 5
figure 5

MFA alignment results of a phrase-level transcript and the small-set syllable-level dictionary.

In the next trials, we tested the combinations of phrase-level dictionaries (Table 6) and phrase-level transcripts (Table 5). When creating the phrase-level dictionaries, we explored two ways: one with each phoneme separated in the description of pronunciation and another with each syllable separated, as in the first row and second row of Table 6, respectively.

Table 6 Sample of entries in the phrase-level dictionaries.

Figures 6 and 7 present the test results of these two dictionaries combined with a phrase-level transcript on the same audio file that was used to generate Figs. 4 and 5. The combination of the phrase-level transcript and the phrase-level dictionary with phoneme-by-phoneme pronunciation seems to give the best alignment results among all the trials, as in Fig. 6. Therefore, we decided to further explore the potential of using this combination as the input to MFA for generating phrase-level boundaries for prosodic research.

Fig. 6
figure 6

MFA alignment results of a phrase-level transcript and a phrase-level dictionary (a phoneme-by-phoneme pronunciation).

Fig. 7
figure 7

MFA alignment results of a phrase-level transcript and a phrase-level dictionary (a syllable-by-syllable pronunciation fashion).

Accuracy

As in the previous assessment described in section “Accuracy”, we took audio files (.wav files) and their accompanying transcripts as well as pronunciation dictionaries (.txt files) as the input; and excluded audio files with “null” or “misread” labels. The only difference is that the transcripts and pronunciation dictionaries are both phrase level, in the same fashion as what we saw in the first row of Table 6. Figure 6 is an example of MFA phrase-level annotation results shown in Praat.

We used the same 1120 audio files used in the syllable-level evaluation and evaluated the phrase alignment results that MFA generated, roughly 27.34% of all collected data. Human annotation results were created based on MFA annotation by well-trained human annotators reviewing the boundaries produced by MFA (i.e., boundaries on the “words” tier in Fig. 6) and adjusting boundaries and interval labels if the MFA-generated boundaries did not match the phrases as intended.

We applied the same data trimming process and criteria to the phrase-level annotation results as we did to the syllable-level annotation results. There were 354 audio files involved with at least one mismatched number or label of intervals in the human alignment and the MFA alignment; in most cases, MFA results and human results have different numbers of silent intervals.

After excluding the files with mismatched sizes and mismatched labels, the total number of phrase-level human-MFA pairs that were included in the comparisons is 3944. The focus of comparison was the absolute phrase boundary-time difference of each phrase between the MFA alignment and human annotation. The comparison results are shown in Table 7. The same notational conventions for Table 4 are applied to Table 7.

Table 7 Absolute phrase boundary-time differences.

As we can see from Table 7, MFA produced quite decent phrase-aligned results for Mandarin Chinese, with the average human-MFA alignment difference being 22.49 ms (SD = 38.39) The average human-MFA phrase-level alignment difference of a group of sentences by a specific speaker ranged from 1.33 ms to 282.11 ms. In line with our observations at the syllable level, the annotation process also revealed instances where the phrase boundaries generated by the MFA did not require any adjustment by the annotators. However, there was also a portion of instances where significant adjustments were necessary, resulting in high standard deviations and variations in the average time differences among the results. To understand the reasons for the high standard deviations and variations, we analyzed 106 data points where the absolute phrase boundary-time difference exceeded 100 ms. Out of these, six data points in Speaker9_Group4_M had extremely high human-MFA time differences of up to 300–700 ms, which were manually checked and found to be an apparent mistake in MFA. The remaining outliers were primarily related to phrase boundaries with neutral tones and tone sandhi in Group 1 and Group 4 stimuli. These types of words can be challenging for forced alignment because they often have subtle variations in pronunciation and timing depending on the surrounding context. Additionally, these types of words may have different prosodic properties, such as pitch and intonation, which can further complicate the alignment process. We will provide an in-depth discussion of this matter in section “Discussion”'.

To determine the level of agreement between the MFA and human annotators at the phrase level, we followed the same data processing procedure as we did for syllable-level evaluation and transformed the time differences into binary data using a 25 ms threshold. We found that 65.57% of the human-MFA differences at the phrase level were within 25 ms of the gold standard. It is important to note that the number of human-MFA comparison pairs is influenced by the complexity of phrase boundaries, which may involve suprasegmental features such as intonation and pitch variations. As such, we cannot directly compare the overall average phrase alignment accuracy to the overall average syllable alignment accuracy reported in this study and previous studies. However, we can reasonably expect a lower human-MFA agreement at the phrase level compared to the syllable or phone level. Nevertheless, the observed 65.57% agreement rate at the phrase level is considered a very decent level of agreement, especially when compared to the phone-level agreement rate (64% within 25 ms) reported by Mahr et al. (2021). Moreover, using MFA still provides great help because creating all the phrase boundaries manually would take a great amount of time and effort. Instead of starting from scratch, letting MFA create the phrase boundaries automatically and later find and fix any errors would enhance the annotation efficiency to a great extent, as illustrated in section “Efficiency”.

Additionally, we found some influence of the recording quality on the accuracy rate of the phrase-by-phrase alignment. The overall average phrase boundary-time difference for local recordings was 25.86 ms (SD = 42.48) with 57.48% of the difference smaller than 25 ms. The overall average phrase boundary-time difference for lab recordings was 16.86 ms (SD = 38.87) with 74.16% of the difference for lab recordings. The local recordings demonstrated significantly lower phrase alignment accuracy compared to the lab recordings on the 25 ms threshold via a chi-square test (p < 0.001). The results of the phrase-by-phrase alignment analysis present a sharper contrast between lab recordings and local recordings than what we observed in the syllable-level alignment analysis. This finding further emphasizes that a more professional recording environment is preferable when researching prosodic phrase boundaries.

Among all audio files evaluated in this test, the overall average human-MFA difference for audio files from female speakers was 17.64 ms (SD: 32.31) with 71.95% of the difference smaller than 25 ms, and the overall average for male speakers was 24.13 ms (SD: 41.85) with 61.20% of the difference smaller than 25 ms. A significant effect of gender on the MFA accuracy was observed (p < 0.001). The interaction between the gender of the speaker and the alignment accuracy indicates that phrase-level boundary alignment is noticeably difficult in male speech. Since prosodic boundaries are sensitive to suprasegmental features such as pitch and intensity (Wagner and Watson, 2010), we conjecture that the detection of prosodic boundaries is challenging when the voice pitch is low. The phrase-by-phrase alignment results suggest that researchers should choose a more professional recording environment if the research concerns prosodic phrase boundaries.

Efficiency

To evaluate MFA-aided phrase-level annotation efficiency, we timed one session of the phrase-level accuracy evaluation tasks to see how much phrase-level annotation work could be done with MFA annotation in 30 min using phrase-level transcripts and dictionaries.

Two well-trained human annotators attended this timed evaluation. They were the same human annotators using the same laptops who participated in the time evaluation for syllable alignment in section “Efficiency”. Similar to the time evaluation in section “Efficiency”, MFA and required files were prepared before the efficiency test. They were asked to choose a set of audio files from the same speaker and to start the timer once they began to copy the selected audio files to the working folder.

Human annotator X selected the recordings from Lab Speaker 6, Group 3, where 32 audio files were in the folder with 13 audio files labeled “null” or “misread.” This human annotator spent 8 min copying required files to a local folder, activating MFA, and using MFA to generate all the TextGrid files for the 19 non-null or non-misread audio files (in total 54 s long with each file being about 3 s long) and 3 min logging annotation notes. During the remaining 19 min, this human annotator was able to complete annotations for 14 out of the 19 audio files, which means that the TextGrid files were finalized for these 14 files after human annotators’ manual adjustments to the annotation results generated by MFA. Human annotator Y selected the recordings from Lab Speaker 6, Group 2, where 16 audio files were in the folder with 0 audio files labeled “null” or “misread.” This human annotator spent 5 min copying required files to a local folder, activating MFA, and using MFA to generate all the TextGrid files for the 16 non-null or non-misread audio files (in total 50 s long with each file being about 3 s long) and 3 min logging annotation notes. During the remaining 22 min, this human annotator was able to complete annotations for 12 out of the 16 audio files.

Although we cannot make a direct comparison regarding efficiency between syllable-alignment and phrase-alignment, the two human annotators both reported that they thought a phrase-level dictionary and transcript was better than a syllable-level dictionary and transcript for improving annotation efficiency for a prosody project that requires phrase-by-phrase annotation. Human annotators’ feedback makes us optimistic about the application of MFA with a phrase-level transcript and dictionary in Mandarin prosody research.

Discussion

We evaluated the MFA-generated syllable-level results and phrase-level results manually to compare the human-MFA annotation difference, as reported in section “Syllable-by-syllable alignment”. While the results suggested that using MFA would make the annotation task much more effective and efficient, we found that it showed less satisfactory performance when complex or non-salient acoustic features were involved.

The most common error for MFA’s syllable-level alignment was found for syllables with a third tone, such as zhě (‘person’), wěi (‘person given name’), nǎ (‘which’), yě (‘also’), gěi (‘for’), diǎn (‘a little’), or those with a neutral tone, such as le (an aspect marker in Mandarin) and de (a complement marker). This finding is compatible with the unique acoustic features of third tones and neutral tones. Among the four basic tones in Mandarin, the third tone is different from the other three tones because it is internally a combination of falling-rising tones. Moreover, the third tone is often realized with a rising tone before another syllable with a third tone due to the tone change process called “third tone sandhi.” The acoustic characteristics of the third tone and tone sandhi are considerably complex, making the alignment performance less accurate than others. The characteristic of a neutral tone is that it is “de-focused”, meaning that you do not have to put extra stress on the syllable but only need to give it the same amount of stress as what has been given to the preceding syllable. Neutral tones are also often shorter than the other tones. These acoustic features of neutral tones (i.e., short and non-salient) can be the primary reason why MFA often shows alignment errors on the boundaries of syllables with neutral tones.

The most common error for MFA’s phrase-level alignment was found when phrases contained functional words. Functional words are words that have a grammatical purpose, such as classifiers (e.g., wèi, gè), negation markers (e.g., méi, bù), adverbs (e.g., ), determiners (e.g., diǎn), prepositions (e.g., gěi), aspect markers (e.g., guò), and adverbial clause markers (e.g., rúguǒ, dehuà). In Mandarin, frequently used functional words with regular tones often have reduced pronunciations in speech production. Monosyllabic functional words, such as ‘also’ typically have only one syllable and that syllable is often unstressed and phonetically reduced. Similarly, the second syllable in disyllabic functional words, such as rúguǒ ‘if’, is often pronounced with reduced stress and phonetic prominence (Lai et al., 2010; Yang, 2010, Třísková, 2016). Syllables with neutral tones and phrases with functional words share the same acoustic feature: being de-stressed. Therefore, MFA does not perform well in these de-stressed parts for both syllable-level and phrase-level alignment. The finding calls for improvement of forced aligners, including MFA concerning alignments for de-stressed speech.

Pronunciation variations and er-suffixation (also known as “Erhua” in the literature) are two other challenges for forced alignment at both the syllable and phrase levels. Examples of pronunciation variations include the wh-words shuí (‘who’) and (‘which’) are sometimes pronounced as sheí and něi in speech production due to dialectal influence. We observed that MFA shows lower accuracy for audio files where speakers pronounced the words with such variants. Er-suffixation refers to a phonological resyllabification in Mandarin. In the data we evaluated, the syllable er and the syllable diǎn (from Group 2 Stimuli) are often combined into one syllable in speech production by speakers. This phonological resyllabification process not only changes the syllable boundary but also influences the tone and phonetic realization of the er-suffixed forms (Huang, 2010). What makes the er-suffixation process more complicated is that it may be implemented differently in dialects of Mandarin (Jiang et al., 2019). This finding further underscores the necessity of forced aligners, including MFA, to take into account dialectal influences and speech variations among speakers. By recognizing and accounting for these variations, forced aligners can more accurately align speech signals with their corresponding text and improve the overall performance of automatic speech recognition systems.

Conclusion

This paper presents the first detailed evaluation reports of MFA regarding its alignment accuracy and efficiency at both the syllable level and phrase level for prosody research in Mandarin Chinese, a tonal language. In both syllable-alignment and phrase-alignment tasks, the average differences between human annotators and MFA were smaller than the gold standard (25 ms). Furthermore, MFA-aided annotation by human transcribers was at least 20 times faster than the manual annotation that has been previously reported. Although MFA showed less accuracy for the non-salient acoustic features, speech with non-standard pronunciations, and audio files with lower quality, its decent alignment accuracy and high efficiency indicate that it would be useful for simplifying the annotation process in prosody research, even when dealing with tonal languages like Mandarin. The recording quality significantly affects both MFA’s syllable-level and phrase-level alignment, emphasizing the need to control audio quality in prosody research when using MFA. Additionally, a gender effect was observed in MFA’s phrase-level alignment, highlighting the importance of addressing gender effects in prosody research. Our study proposes a new workflow for conducting prosodic research in tonal languages. This approach involves using MFA for phrase-level annotation, followed by necessary adjustments made by human annotators. By adopting this workflow, researchers can save considerable time and resources that would otherwise be required for manual human annotation. This will have a significant impact on the field of prosodic research, as it will enable researchers to investigate the role of prosody in tonal languages more efficiently and effectively.

The finding that de-stressed words and phrases, as well as pronunciation variations, pose challenges for MFA provides a reference for improving forced aligners. With continuous advancements in forced aligner technology, we expect a lower human-MFA syllable-/phrase-boundary difference and a higher percentage of the differences being smaller than the gold standard. As we were finalizing our paper, MFA released version 2.0 of its acoustic model, along with a Mandarin (Erhua) pronunciation dictionary. The new dictionary includes additional pronunciations, such as reduced variants commonly found in spontaneous speech, where segments or syllables are deleted. This update is expected to improve the performance of MFA, particularly in areas such as er-suffixation. We look forward to testing this conjecture in follow-up studies and expanding the current research into other tonal languages, thereby offering a systematic and comprehensive reference for the advancements in forced aligner technology.