1 Introduction

In recent years artificial intelligence has come into wide use in various fields [1, 2, 7, 12, 13, 17, 19]. Artificial intelligence currently encompasses a huge variety of subfields, ranging from the general learning and perception to the specific, such as playing chess, proving mathematical theorems, writing poetry, driving a car on a crowded street, and diagnosing diseases. Artificial intelligence is relevant to any intellectual task, it is truly a universal field. [15] One of the areas that is actively researched is natural language processing by speech recognition. However, machines are difficult to speak, hear and read human language. Therefore, natural language processing and speech recognition are some of the most difficult and important field in artificial intelligence [6]. One of the most popular speech recognition technologies is the IBM Watson API [9]. Among captions in which speech is converted into characters, captions including timing and speaker ID information are called as Informatized Caption [3,4,5]. And, this Informatized Caption can be generated using the IBM Watson API [10]. However, the IBM Watson API is more susceptible to incorrect recognition when there is some noise in the audio signal, especially for movies where background music or special sounds are used. To solve this problem, many researches have been done. However, in the previous researches [3,4,5], there is still some problem that the speaker cannot be distinguished well when there is a plurality of speakers who can speak the same word differently [3], and an assumption that the database with speaker pronunciation time information should be ready in advance [4, 5]. In this paper, a method of modifying incorrectly recognized word using original caption is proposed to enhance the timing performance while updating the database in real time using the Informatized Caption information.

2 Background

2.1 Informatized caption

As shown in Fig. 1, IBM Watson API [10] makes a caption that includes not only the speech recognition result word but also the timing information (e.g. start time, end time, and speaker ID information) of each word. This kind of caption is called as an Informatized Caption. However, when speech recognition is performed on a noisy sound, there are many incorrectly recognized words. Therefore, the timing information in the caption is not correct.

Fig. 1
figure 1

The concept of Informatized Caption Ts+

2.2 Linear estimation method

The problem to modify the incorrectly recognized words and their timing information with the accurate original caption is the problem of estimating the timing information for each incorrectly recognized word. One of possible methods is the linear estimation method based on the number of characters of words. For example, when there are two incorrectly recognized words B′ and C′ between the correctly recognized words A and D, the words B′ and C′ can be replaced by the correct words B and C by comparing with the original caption, and the timing information for B and C can be estimated by Eq. (1) where V(X) means the number of characters of the word X. Here, it can be assumed that tS(C) is equal to tE(B) to reduce the number of uncertain variables.

$$ {t}_E(B)={t}_S(B)+\left({t}_E(C)-{t}_S(B)\right)\times \frac{V(B)}{V(B)+V(C)} $$
(1)

where V(X) means the number of characters in the word of Informatized Caption.

3 Speaker pronunciation time database (S-DB)

3.1 Database design

Figure 2 shows the model of speaker pronunciation time database, S-DB, which consists of nodes for each speaker Sp. The node is composed by word ID Wpx, the average pronunciation time Dpx, and its appearance frequency Upx. The nodes for each speaker are managed in ascending order based on the alphanumeric order of words, and are connected to each other by the linked list data structure with NULL node at the end.

Fig. 2
figure 2

S-DB Model

3.2 Real-time update method of S-DB

The construction and update of S-DB are basically divided into three cases, the case that there are speaker Sp and word Wpk in S-DB, the case that there is speaker Sp in S-DB but not word Wpk in S-DB, and the case that there is not both speaker Sp and word Wpk in S-DB. The algorithm to update S-DB in real-time in each case is as follows.

figure e

3.2.1 Case that there is speaker Sp and word Wpk in S-DB

In this cas, the first step is to find the corresponding word Wpk in the linked list for the speaker Sp in S-DB. And then, the appearance frequency U is incremented by 1 and the average pronunciation time D is updated by the Eq. (2).

$$ D=\mathrm{D}\times \frac{\mathrm{U}-1}{\mathrm{U}}+\frac{t_E\left(\mathrm{X}\right)-{t}_S\left(\mathrm{X}\right)}{\mathrm{U}} $$
(2)

where tS(X) and tE(X) are the start and end time of pronunciation of word X, respectively.

3.2.2 Case that there is speaker Sp in S-DB, but not word Wpk in S-DB

In this case, the first step is to create a new node with the word Wpk in the linked list for the speaker Sp in S-DB. And then, 1 is assigned to the appearance frequency U of the new node and the pronunciation time of the word Wpk is updated for the initial average pronunciation time. Finally, insert sorting is performed for this linked list according to the alphanumeric order.

3.2.3 Case that there is not both speaker Sp and word Wpk in S-DB

To update new speaker in S-DB, the first step is to increment the number of speakers p by 1 and insert it to the existing speaker list of S-DB. And a new node is created in the linked list for the speaker Sp in S-DB. Then, 1 is assigned to the appearance frequency U of the new node and the pronunciation time of the word Wpk is updated for the initial average pronunciation time. Finally, insert sorting is performed for the speaker list according to the alphanumeric order.

4 Proposed algorithm

4.1 Algorithm to modify incorrectly recognized word and timing information using S-DB

Figure 3 shows the overall structure of the proposed system. When a user speaks, the speech is translated into a type of informatized caption by special Speech-to-Text engine such as IBM Watson which can generate not only the text information but also user and pronunciation timing information. Then, the timing information of this informatized caption is modified through the proposed algorithm with the original caption. Specifically, the inputs of the proposed algorithm are original caption, Informatized Caption, discrete voice signal, S-DB (e.g. Speaker ID, Word, Pronunciation time, and appearance frequency), the threshold value to recognize the start time and end time of a word in the discrete voice signal, and the search range in Informatized Caption to find the matched word with original caption and to modify the incorrectly recognized word and its timing information. The output of the proposed algorithm is modified Informatized Caption and its timing information. Here, S-DB updates are occurred only when there is no incorrectly recognized word. If an incorrectly recognized word is found between correct recognized words, the number of incorrectly recognized words is counted. And, the wrong words are replaced by the original caption, and the timing information is calculated as Fig. 4 or Algorithm 2 for each case.

Fig. 3
figure 3

The overall structure of the proposed system

Fig. 4
figure 4

Flowchart of algorithm to modify incorrectly recognized word and timing information using S-DB

figure f

4.2 Case 0: No incorrectly recognized word

In this case, the final output is the same with the output from IBM Watson API. Therefore, Informatized Caption is used just to update S-DB.

4.3 Case 1: The number of incorrectly recognized words is only one

figure g

In the case of Fig. 5, the number of incorrectly recognized words is only one. And, the required function is just to replace the wrong word B with the correct one and to find its timing information. If there is the minimum time tMIN between tE(A) to tS(C) whose value of discrete voice signal is more than the threshold T, it can be assumed as the start time of the correct word B. Otherwise, the start time of the correct word B is assumed as the same with the end time of the previous word tE(A). Likewise, if there is the maximum time tMAX between tE(A) to tS(C) whose value of discrete voice signal is more than the threshold T, it can be assumed as the end time of the correct word B. Otherwise, the end time of the correct word B is assumed as the same with the start time of the next word tS(C).

Fig. 5
figure 5

A segment of Informatized Caption which has only one incorrectly recognized word

4.4 Case 2: The number of incorrectly recognized words are two

figure h

In the case of Fig. 6, there are two consecutive and incorrectly recognized words B and C between the correct words A and D. Therefore, tS(B) and tE(C) can be estimated by using the same algorithm in case 1 (Algorithm 3). Here, tE(B) is assumed as the same with tS(C) for the simplicity of problem. And, the timing information of tE(B) or tS(C) can be estimated by using the linear estimation method or by using the previous pronunciation time information of the same speaker, D(S, B) and D(S, C) in the S-DB. If there is not both D(S, B) and D(S, C) in S-DB, i.e. pronunciation time of words B and C are unknown, linear estimation method is used such as Eq. (3).

Fig. 6
figure 6

A segment of Informatized Caption which has two consecutive and incorrectly recognized words

$$ {t}_E(B)={t}_S(B)+\left({t}_E(C)-{t}_S(B)\right)\times \frac{V(B)}{V(B)+V(C)} $$
(3)

where V(X) means the number of characters in the word of Informatized Caption.

If there is only D(S, B) in S-DB, tE(B) is estimated by adding the average pronunciation time D(S, B) to tS(B) such as Eq. (4).

$$ {t}_E(B)={t}_S(B)+D\left(S,B\right) $$
(4)

If there is only D(S, C) in S-DB, tE(B) is estimated by subtracting the average pronunciation time D(S, C) from tE(C) such as Eq. (5).

$$ {t}_E(B)={t}_S(B)+\left({t}_E(C)-{t}_S(B)-D\left(S,C\right)\right)={t}_E(C)-D\left(S,C\right) $$
(5)

If there is both D(S, B) and D(S, C) in S-DB, Eq. (6) using the ratio of D(S, B) and D(S, C) is used.

$$ {t}_E(B)={t}_S(B)+\left({t}_E(C)-{t}_S(B)\right)\times \frac{D\left(S,B\right)}{D\left(S,B\right)+D\left(S,C\right)} $$
(6)

Finally, tS(C) is estimated by using tE(B).

4.5 Case 3: The number of incorrectly recognized words are more than three

figure i

In the case of Fig. 7, there are more than three consecutive and incorrectly recognized words W1, …, WL between the correct words A and B. Therefore, tS(W1) and tE(WL) are estimated by using the same algorithm in case 1 (Algorithm 3). Here, tE(Wi) is assumed as the same with tS(Wi + 1) for i = 1,…,L-1 for the simplicity of problem. The timing information of tE(W1) to tS(WL) can be estimated by using the linear estimation method or by using the previous pronunciation time of the same speaker, D(S, W1), …, D(S, WL) in the S-DB. By the first scanning of the incorrectly recognized words, the total number of characters in all incorrectly recognized words and the total pronunciation time of words which belongs to S-DB are calculated. And then, by the second scanning, the timing information for each word is estimated by using Eq. (7) or Eq. (8). If the word is not included in the S-DB, the timing of the word is calculated using Eq. (7). And, if the word is included in the S-DB, it is calculated using Eq. (8).

$$ {t}_S\left({W}_{q+1}\right)={t}_S\left({W}_q\right)+\left(\frac{t_E\left({w}_r\right)-{t}_S\left({w}_q\right)-\left({\sum}_{\left(Z\in \left\{{W}_q,\cdots, {W}_r\right\}\right)\cap \left(Z\in S- DB\right)}D\left(S,Z\right)\right)}{\sum_{\left(Z\in \left\{{W}_q,\cdots, {W}_r\right\}\right)\cap \left(Z\notin S- DB\right)}V(Z)}\right)\times V\left({W}_q\right) $$
(7)

where Wq, Wq + 1,…, Wr are the incorrectly recognized words.

Fig. 7
figure 7

A segment of Informatized Caption which has more than three consecutive and incorrectly recognized words

$$ {t}_S\left({W}_{q+1}\right)={t}_S\left({W}_q\right)+D\left(S,{W}_q\right) $$
(8)

The above equations can also be applied to calculate tE(W1), tS(W2) tE(W2), …, tS(Wl).

4.6 Mathematical validation of the convergence of average pronunciation time as S-DB update

The pronunciation time of a word is generally varying by each person, his/her average speaking speed, and connecting word near it. For example, different persons can make different voice and different pronunciation time for the same word. Even for the same person, he/she can be hurry or relaxed when speaking. In addition, the same word spoken by the same person may have different pronunciation time by the connecting word which is located before or after the word such as soft consonant phenomenon. In this paper, we have assumed two assumptions to simplify the problem.

Assumption 1. The pronunciation time of a word is independent with the connecting words.

Assumption 2. The pronunciation time of a word is an independent and identically distributed (i. i. d.) random variable drawn from a distribution of expected value m and finite variance σ2.

Theorem 1. As the numbers of update in S-DB is increased with more pronunciations of the same word by the same speaker, the average pronunciation time of the word by the speaker (D value in S-DB) will be converged.

(Proof) Let us assume that there was already U(Wij) − 1 numbers of update in S-DB with the average pronunciation time Di(Wij) of a word Wij by a speaker Si.

$$ {D}_i\left({W}_{ij}\right)=\frac{\mathrm{X}\left({W}_{ij},1\right)+\mathrm{X}\left({W}_{ij},2\right)+\mathrm{X}\left({W}_{ij},3\right)+\cdots +\mathrm{X}\left({W}_{ij},\mathrm{n}\right)}{n} $$
(9)

where n = U(Wij) − 1 and X(Wij, n) is the n-th pronunciation time of the word Wij by a speaker Si.

Then, the next update with the current pronunciation time D(Wij) of the word Wij by a speaker Si is done by the following rule:

$$ {D}_i^{new}\left({W}_{ij}\right)={D}_i\left({W}_{ij}\right)\times \frac{\mathrm{U}\left({W}_{ij}\right)-1}{\mathrm{U}\left({W}_{ij}\right)}+\frac{\mathrm{End}\ \mathrm{time}-\mathrm{Start}\ \mathrm{time}}{\mathrm{U}\left({W}_{ij}\right)} $$
(10)
$$ \kern2.5em ={D}_i\left({W}_{ij}\right)\times \frac{\mathrm{U}\left({W}_{ij}\right)-1}{\mathrm{U}\left({W}_{ij}\right)}+\frac{\mathrm{X}\left({W}_{ij},\mathrm{U}\left({W}_{ij}\right)\right)}{\mathrm{U}\left({W}_{ij}\right)} $$
(11)
$$ \kern11em =\frac{\mathrm{X}\left({W}_{ij},1\right)+\mathrm{X}\left({W}_{ij},2\right)+\mathrm{X}\left({W}_{ij},3\right)+\cdots +\mathrm{X}\left({W}_{ij},\mathrm{U}\left({W}_{ij}\right)\right)}{\mathrm{U}\left({W}_{ij}\right)} $$
(12)

By the Central Limit Theorem [11, 14], the sampling distribution of the sample means, \( {D}_i^{new}\left({W}_{ij}\right) \), approaches to a normal distribution with the mean m and the variance \( \frac{\sigma^2}{\mathrm{U}\left({W}_{ij}\right)} \) as the sample size U(Wij) gets larger regardless of the original shape of population distribution. As the numbers of update U(Wij) in S-DB approaches to the infinity, the variance \( \frac{\sigma^2}{\mathrm{U}\left({W}_{ij}\right)} \) approaches to zero. Therefore, the pronunciation time D(Wij) of the word Wij by the speaker Si is converged to m which is the expected value of D(Wij). (Q.E.D.)

5 Experimental results

5.1 Experiment setup

There are lots of factors that reduce accuracy in speech recognition. Among many factors, the most critical one is noise. 7 types of noise have been used for experiments as Fig. 8. 6 colors of noise [16], gray noise, purple noise, blue noise, brown noise, pink noise and white noise, have been used for the first experiment, which is to show how different types of noise affect the proposed method. A kind of typical examples of white noise, rainy noise which was captured from rainy sound, has been used for the second experiment, which is to show how different levels(decibels) of noise affect the proposed method. The last experiment is to show whether the proposed method can work with various movie samples or not. In this last experiment, each movie sample includes natural noise basically. Therefore, any additional noise did not add.

Fig. 8
figure 8

Various types of noise (a) white noise, (b) pink noise, (c) brown noise, (d) blue noise, (e) purple noise, (f) gray noise, (g) rainy noise, and (h) an example of speech signal

The problem to modify the incorrectly recognized words and their timing information with the accurate original caption is the problem of estimating the timing information for each incorrectly recognized word. The correct timing information of each word in each test audio file was measured by checking the timing of each word during playing the audio file based on the wave display panel of Adobe CC 2019 software. The measurement values were represented with 2 places of decimals in seconds. To verify the accuracy of timing information from Informatized Caption, three kinds of experiments are performed. The measure of accuracy in the three experiments is like Eq. (13).

$$ Accuracy=\frac{\mid \left\{x|\ x\in CRW\ and\ \left|{t}_S(x)-{t}_S^{\ast}\right|<0.01\ and\ \left|{t}_E(x)-{t}_E^{\ast}\right|<0.01\right\}\mid }{Total\ number\ of\ test\ words}\times 100 $$
(13)

In Equation (13), where CRW means the set of correctly recognized words, | set A | means the number of elements in the set A, \( {t}_S^{\ast } \) and \( {t}_E^{\ast } \) means the correct start and end time of the word X (timing unit: seconds). The first and second experiment is done with a clean sound such as an English listening test audio [8] file which is recorded in a well-structured noise-free environment. Detailed information for each test audio file is shown in Table 1. Each of the test files is mixed with noise whose signal-to-noise ratio [18] is 20, 15, 10 and 5(dB) as shown in Fig. 9. The first experiment is based on the colors of noise. The six noises were synthesized according to different levels of signal-to-noise ratio to make a total of 96 samples. The second experiment is done with a total of 16 samples based on the sound of rain, a typical example of white noise that can be heard in real life.

Table 1 Information on 4 test audio files used in the experiment 1 and 2
Fig. 9
figure 9

Test data samples for experiment 1, (a) original speech signal, (b) speech signal with white noise (SNRdB = 20 dB), (c) speech signal with white noise (SNRdB = 15 dB), (d) speech signal with white noise (SNRdB = 10 dB), and (e) speech signal with white noise (SNRdB = 5 dB)

The last experiment was for the performance in various real noisy environments and done with three genres of films for the accuracy test of timing information from Informatized Caption after speech recognition using IBM Watson API. The genre of films consists of 5 animation movies dubbing in a well-structured noise-free environment, 5 horror and action movies with various sound effects, and 5 musical movies with many background music. The sample selection conditions for each film genre are as follows. Animation chose a noise-free sample. Horror and action chose samples with a noise like staccato. Musical chose a constant sample of noise like background music. The time periods extracted for this experiment were 6 to 20 s which reflect the characteristics of each genre. Detailed information for each movie is shown in Table 2.

Table 2 Information on movie sample data by genre for the experiment 3

5.2 Experimental results

5.2.1 Experiment 1: Samples with different colors of noise

Figure 10 shows the results of accuracy of the proposed algorithm according to different levels of each colors of noise. The vertical axis shows signal-to-noise ratio and the horizontal axis shows timing accuracy. The average accuracy of each experiment is 73.10% for white noise, 80.80% for pink noise, 81.20% for brown noise, 79.71% for blue noise, 79.96% for purple noise, and 81.60% for gray noise. Regardless of the levels of noise, the accuracy of white noise was the lowest. Since the proposed algorithm uses the results of speech recognition, the result of the proposed algorithm may be less accurate when the noise level is too much high and similar to the voice level.

Fig. 10
figure 10

Accuracy of timing information according to colors of noise, (a) Experimental result for Sample A, (b) Experimental result for Sample B, (c) Experimental result for Sample C, and (d) Experimental result for Sample D

5.2.2 Experiment 2: A sample with different noises

Figure 11 shows the result of experiments with 4 test audio files after applying 4 different noises to each sample. The vertical axis represents the accuracy and the horizontal axis represents the signal-to-noise ratio of mixed audio files with varying intensity of noises. Experimental results show that the accuracy of timing information of Informatized Caption by IBM Watson API is more than 84.80% for English listening test audio files when noise is small. And, the accuracy decreases significantly as noise increases. The timing accuracy of proposed algorithm shows more than 95.70% for the same English listening test audio files. However, the accuracy of the proposed algorithm is also lowered when the noise level is very high and similar to the voice level since the proposed algorithm compensates the result of IBM Watson API.

Fig. 11
figure 11

Accuracy of timing information according to varying intensity of noises, (a) Experimental result for Sample A, (b) Experimental result for Sample B, (c) Experimental result for Sample C, and (d) Experimental result for Sample D

5.2.3 Experiment 3: Various movie samples

The accuracy of timing information of each sample movie for each genre is shown in Table 3 (Animation movies), Table 4 (Horror and action movies), and Table 5 (Musical movies). The average accuracy of IBM Watson method was 42.95% for all movies and 53.65% for animation, horror and action movies. Since most of time period in musical movies include background music and singing voices, the performance of speech recognition by IBM Watson is basically very low and consequently, the accuracy for musical movies are the lowest. The average accuracy of the proposed method is 66.35% for all movies and 81.09% for animation, horror and action movies. The proposed method shows more than 1.5 times of performance compared with the IBM Watson method.

Table 3 Timing information accuracy of Informatized Caption in animation movies
Table 4 Timing information accuracy of Informatized Caption in horror and action movies
Table 5 Timing information accuracy of Informatized Caption in musical movies

6 Concluding remarks

IBM Watson API is one of the most popular speech recognition technologies in these days. It can perform not only the translation process from speech to text, but also give some useful data about word timing and speaker ID information which are called as Informatized Caption. However, the IBM Watson API is very weak for the noisy audios and consequently, it is not good for the use in movie audios where background music or special sound effects are mixed together.

In this paper, a novel method of modifying incorrectly recognized word using original caption is proposed to enhance the timing performance of Informatized Caption while updating S-DB database in real time. Mathematical validation is done for the convergence of average pronunciation time in S-DB as S-DB update continually. Experimental results show that improved performance on speech recognition can be achieved in the presence of various types of noise. And three kinds of experiments have been performed to verify the accuracy of timing information by the proposed method. The average accuracy of the proposed method is 81.09% for animation, horror and action movies and 66.35% for all movies. And this result means that the proposed method shows more than 1.5 times of performance compared with the original IBM Watson method.