Real-time Informatized caption enhancement based on speaker pronunciation time database

IBM Watson is one of the representative tools for speech recognition system which can automatically generate not only speech-to-text information but also speaker ID and timing information, which is called as Informatized Caption. However, if there is some noise in the voice signal to the IBM Watson API, the recognition performance is significantly decreased. It can be easily found in movies with background music and special sound effects. This paper aims to improve the inaccuracy problem of current Informatized Captions in noisy environments. In this paper, a method of modifying incorrectly recognized words and a method of enhancing timing accuracy while updating database in real time are suggested based on the original caption and Informatized Caption information. Experimental results shows that the proposed method can give 81.09% timing accuracy for the case of 10 representative animation, horror and action movies.

Original caption of a word X T s + (X) Informatized Caption of a word X t S (X) Start time of a word X t E (X) End time of a word X S-DB Speaker pronunciation time database S p p-th speaker W pk k-th word of p-th speaker in S-DB

Introduction
In recent years artificial intelligence has come into wide use in various fields [1,2,7,12,13,17,19]. Artificial intelligence currently encompasses a huge variety of subfields, ranging from the general learning and perception to the specific, such as playing chess, proving mathematical theorems, writing poetry, driving a car on a crowded street, and diagnosing diseases. Artificial intelligence is relevant to any intellectual task, it is truly a universal field. [15] One of the areas that is actively researched is natural language processing by speech recognition.
However, machines are difficult to speak, hear and read human language. Therefore, natural language processing and speech recognition are some of the most difficult and important field in artificial intelligence [6]. One of the most popular speech recognition technologies is the IBM Watson API [9]. Among captions in which speech is converted into characters, captions including timing and speaker ID information are called as Informatized Caption [3][4][5]. And, this Informatized Caption can be generated using the IBM Watson API [10]. However, the IBM Watson API is more susceptible to incorrect recognition when there is some noise in the audio signal, especially for movies where background music or special sounds are used. To solve this problem, many researches have been done. However, in the previous researches [3][4][5], there is still some problem that the speaker cannot be distinguished well when there is a plurality of speakers who can speak the same word differently [3], and an assumption that the database with speaker pronunciation time information should be ready in advance [4,5]. In this paper, a method of modifying incorrectly recognized word using original caption is proposed to enhance the timing performance while updating the database in real time using the Informatized Caption information.

Informatized caption
As shown in Fig. 1, IBM Watson API [10] makes a caption that includes not only the speech recognition result word but also the timing information (e.g. start time, end time, and speaker ID information) of each word. This kind of caption is called as an Informatized Caption. However, when speech recognition is performed on a noisy sound, there are many incorrectly recognized words. Therefore, the timing information in the caption is not correct.

Linear estimation method
The problem to modify the incorrectly recognized words and their timing information with the accurate original caption is the problem of estimating the timing information for each incorrectly recognized word. One of possible methods is the linear estimation method based on the number of characters of words. For example, when there are two incorrectly recognized words B′ and C′ between the correctly recognized words A and D, the words B′ and C′ can be replaced by the correct words B and C by comparing with the original caption, and the timing information for B and C can be estimated by Eq. (1) where V(X) means the number of characters of the word X. Here, it can be assumed that t S (C) is equal to t E (B) to reduce the number of uncertain variables.
where V(X) means the number of characters in the word of Informatized Caption.

Speaker pronunciation time database (S-DB)
3.1 Database design Figure 2 shows the model of speaker pronunciation time database, S-DB, which consists of nodes for each speaker S p . The node is composed by word ID W px , the average pronunciation time D px , and its appearance frequency U px . The nodes for each speaker are managed in ascending order based on the alphanumeric order of words, and are connected to each other by the linked list data structure with NULL node at the end.

Real-time update method of S-DB
The construction and update of S-DB are basically divided into three cases, the case that there are speaker S p and word W pk in S-DB, the case that there is speaker S p in S-DB but not word W pk in S-DB, and the case that there is not both speaker S p and word W pk in S-DB. The algorithm to update S-DB in real-time in each case is as follows. In this cas, the first step is to find the corresponding word W pk in the linked list for the speaker S p in S-DB. And then, the appearance frequency U is incremented by 1 and the average pronunciation time D is updated by the Eq. (2).
where t S (X) and t E (X) are the start and end time of pronunciation of word X, respectively.

3.2.2
Case that there is speaker S p in S-DB, but not word W pk in S-DB In this case, the first step is to create a new node with the word W pk in the linked list for the speaker S p in S-DB. And then, 1 is assigned to the appearance frequency U of the new node and the pronunciation time of the word W pk is updated for the initial average pronunciation time. Finally, insert sorting is performed for this linked list according to the alphanumeric order.

3.2.3
Case that there is not both speaker S p and word W pk in S-DB To update new speaker in S-DB, the first step is to increment the number of speakers p by 1 and insert it to the existing speaker list of S-DB. And a new node is created in the linked list for the speaker S p in S-DB. Then, 1 is assigned to the appearance frequency U of the new node and the pronunciation time of the word W pk is updated for the initial average pronunciation 4 Proposed algorithm 4.1 Algorithm to modify incorrectly recognized word and timing information using S-DB Figure 3 shows the overall structure of the proposed system. When a user speaks, the speech is translated into a type of informatized caption by special Speech-to-Text engine such as IBM Watson which can generate not only the text information but also user and pronunciation timing information. Then, the timing information of this informatized caption is modified through the proposed algorithm with the original caption. Specifically, the inputs of the proposed algorithm are original caption, Informatized Caption, discrete voice signal, S-DB (e.g. Speaker ID, Word, Pronunciation time, and appearance frequency), the threshold value to recognize the start time and end time of a word in the discrete voice signal, and the search range in Informatized Caption to find the matched word with original caption and to modify the incorrectly recognized word and its timing information. The output of the proposed algorithm is modified Informatized Caption and its timing information. Here, S-DB updates are occurred only when there is no incorrectly recognized word. If an incorrectly recognized word is found between correct recognized words, the number of incorrectly recognized words is counted. And, the wrong words are replaced by the original caption, and the timing information is calculated as Fig. 4 or Algorithm 2 for each case.

Case 0: No incorrectly recognized word
In this case, the final output is the same with the output from IBM Watson API. Therefore, Informatized Caption is used just to update S-DB.

Case 1: The number of incorrectly recognized words is only one
In the case of Fig. 5, the number of incorrectly recognized words is only one. And, the required function is just to replace the wrong word B with the correct one and to find its timing information. If there is the minimum time t MIN between t E (A) to t S (C) whose value of discrete voice signal is more than the threshold T, it can be assumed as the start time of the correct word B. Otherwise, the start time of the correct word B is assumed as the same with the end time of the previous word t E (A). Likewise, if there is the maximum time t MAX between t E (A) to t S (C) whose value of discrete voice signal is more than the threshold T, it can be assumed as the end time of the correct word B. Otherwise, the end time of the correct word B is assumed as the same with the start time of the next word t S (C). In the case of Fig. 6, there are two consecutive and incorrectly recognized words B and C between the correct words A and D. Therefore, t S (B) and t E (C) can be estimated by using the same algorithm in case 1 (Algorithm 3). Here, t E (B) is assumed as the same with t S (C) for the simplicity of problem. And, the timing information of t E (B) or t S (C) can be estimated by using the linear estimation method or by using the previous pronunciation time information of the same speaker, D(S, B) and D(S, C) in the S-DB. If there is not both D(S, B) and D(S, C) in S-DB, i.e. pronunciation time of words B and C are unknown, linear estimation method is used such as Eq. (3).  4).
If there is only D(S, C) in S-DB, t E (B) is estimated by subtracting the average pronunciation time D(S, C) from t E (C) such as Eq. (5).
If there is both D(S, B) and D(S, C) in S-DB, Eq. (6) using the ratio of D(S, B) and D(S, C) is used.
Finally, t S (C) is estimated by using t E (B).

Case 3:
The number of incorrectly recognized words are more than three   In the case of Fig. 7, there are more than three consecutive and incorrectly recognized words W 1 , …, W L between the correct words A and B. Therefore, t S (W 1 ) and t E (W L ) are estimated by using the same algorithm in case 1 (Algorithm 3). Here, t E (W i ) is assumed as the same with t S (W i + 1 ) for i = 1,…,L-1 for the simplicity of problem. The timing information of t E (W 1 ) to t S (W L ) can be estimated by using the linear estimation method or by using the previous pronunciation time of the same speaker, D(S, W 1 ), …, D(S, W L ) in the S-DB. By the first scanning of the incorrectly recognized words, the total number of characters in all incorrectly recognized words and the total pronunciation time of words which belongs to S-DB are calculated. And then, by the second scanning, the timing information for each word is estimated by using Eq. (7) or Eq. (8). If the word is not included in the S-DB, the timing of the word is calculated using Eq. (7). And, if the word is included in the S-DB, it is calculated using Eq. (8).
where W q , W q + 1 ,…, W r are the incorrectly recognized words.

Original caption (T) T = T s + ?
Informatized caption (T s + ) The above equations can also be applied to calculate t E (W 1 ), t S (W 2 ) t E (W 2 ), …, t S (W l ).

Mathematical validation of the convergence of average pronunciation time as S-DB update
The pronunciation time of a word is generally varying by each person, his/her average speaking speed, and connecting word near it. For example, different persons can make different voice and different pronunciation time for the same word. Even for the same person, he/she can be hurry or relaxed when speaking. In addition, the same word spoken by the same person may have different pronunciation time by the connecting word which is located before or after the word such as soft consonant phenomenon. In this paper, we have assumed two assumptions to simplify the problem. Assumption 1. The pronunciation time of a word is independent with the connecting words. Assumption 2. The pronunciation time of a word is an independent and identically distributed (i. i. d.) random variable drawn from a distribution of expected value m and finite variance σ 2 . Theorem 1. As the numbers of update in S-DB is increased with more pronunciations of the same word by the same speaker, the average pronunciation time of the word by the speaker (D value in S-DB) will be converged.
(Proof) Let us assume that there was already U(W ij ) − 1 numbers of update in S-DB with the average pronunciation time D i (W ij ) of a word W ij by a speaker S i .  Fig. 7 A segment of Informatized Caption which has more than three consecutive and incorrectly recognized words where n = U(W ij ) − 1 and X(W ij , n) is the n-th pronunciation time of the word W ij by a speaker S i. Then, the next update with the current pronunciation time D(W ij ) of the word W ij by a speaker S i is done by the following rule: By the Central Limit Theorem [11,14], the sampling distribution of the sample means, D new i W ij À Á , approaches to a normal distribution with the mean m and the variance σ 2 U Wij ð Þ as the sample size U(W ij ) gets larger regardless of the original shape of population distribution. As the numbers of update U(W ij ) in S-DB approaches to the infinity, the variance σ 2 U W ij ð Þ approaches to zero. Therefore, the pronunciation time D(W ij ) of the word W ij by the speaker S i is converged to m which is the expected value of D(W ij ). (Q.E.D.)

Experiment setup
There are lots of factors that reduce accuracy in speech recognition. Among many factors, the most critical one is noise. 7 types of noise have been used for experiments as Fig. 8. 6 colors of noise [16], gray noise, purple noise, blue noise, brown noise, pink noise and white noise, have been used for the first experiment, which is to show how different types of noise affect the proposed method. A kind of typical examples of white noise, rainy noise which was captured from rainy sound, has been used for the second experiment, which is to show how different levels(decibels) of noise affect the proposed method. The last experiment is to show whether the proposed method can work with various movie samples or not. In this last experiment, each movie sample includes natural noise basically. Therefore, any additional noise did not add.
The problem to modify the incorrectly recognized words and their timing information with the accurate original caption is the problem of estimating the timing information for each incorrectly recognized word. The correct timing information of each word in each test audio file was measured by checking the timing of each word during playing the audio file based on the wave display panel of Adobe CC 2019 software. The measurement values were represented with 2 places of decimals in seconds. To verify the accuracy of timing information from Informatized Caption, three kinds of experiments are performed. The measure of accuracy in the three experiments is like Eq. (13).
Accuracy ¼ j xj x∈CRW and t S x ð Þ−t * S < 0:01 and t E x ð Þ−t * Total number of test words Â 100 ð13Þ In Equation (13), where CRW means the set of correctly recognized words, | set A | means the number of elements in the set A, t * S and t * E means the correct start and end time of the word X  (timing unit: seconds). The first and second experiment is done with a clean sound such as an English listening test audio [8] file which is recorded in a well-structured noise-free environment. Detailed information for each test audio file is shown in Table 1. Each of the test files is mixed with noise whose signal-to-noise ratio [18] is 20, 15, 10 and 5(dB) as shown in Fig. 9.
The first experiment is based on the colors of noise. The six noises were synthesized according to different levels of signal-to-noise ratio to make a total of 96 samples. The second experiment is done with a total of 16 samples based on the sound of rain, a typical example of white noise that can be heard in real life. The last experiment was for the performance in various real noisy environments and done with three genres of films for the accuracy test of timing information from Informatized Caption after speech recognition using IBM Watson API. The genre of films consists of 5 animation movies dubbing in a well-structured noise-free environment, 5 horror and action movies with various sound effects, and 5 musical movies with many background music. The sample selection conditions for each film genre are as follows. Animation chose a noise-free sample. Horror and action chose samples with a noise like staccato. Musical chose a constant sample of noise like background music. The time periods extracted for this experiment were 6 to 20 s which reflect the  characteristics of each genre. Detailed information for each movie is shown in Table 2.  Figure 10 shows the results of accuracy of the proposed algorithm according to different levels of each colors of noise. The vertical axis shows signal-to-noise ratio and the horizontal axis shows timing accuracy. The average accuracy of each experiment is 73.10% for white noise, 80.80% for pink noise, 81.20% for brown noise, 79.71% for blue noise, 79.96% for purple noise, and 81.60% for gray noise. Regardless of the levels of noise, the accuracy of white noise was the lowest. Since the proposed algorithm uses the results of speech recognition, the result of the proposed algorithm may be less accurate when the noise level is too much high and similar to the voice level. Figure 11 shows the result of experiments with 4 test audio files after applying 4 different noises to each sample. The vertical axis represents the accuracy and the horizontal axis represents the signalto-noise ratio of mixed audio files with varying intensity of noises. Experimental results show that the accuracy of timing information of Informatized Caption by IBM Watson API is more than 84.80% for English listening test audio files when noise is small. And, the accuracy decreases significantly as noise increases. The timing accuracy of proposed algorithm shows more than 95.70% for the same English listening test audio files. However, the accuracy of the proposed algorithm is also lowered when the noise level is very high and similar to the voice level since the proposed algorithm compensates the result of IBM Watson API.

Experiment 3: Various movie samples
The accuracy of timing information of each sample movie for each genre is shown in Table 3 (Animation movies), Table 4 (Horror and action movies), and Table 5 (Musical movies). The average accuracy of IBM Watson method was 42.95% for all movies and 53.65% for animation, horror and action movies. Since most of time period in musical movies include background music and singing voices, the performance of speech recognition by IBM Watson is basically very low and consequently, the accuracy for musical movies are the lowest. The average accuracy of the proposed method is 66.35% for all movies and 81.09% for animation, horror and action movies. The proposed method shows more than 1.5 times of performance compared with the IBM Watson method.

Concluding remarks
IBM Watson API is one of the most popular speech recognition technologies in these days. It can perform not only the translation process from speech to text, but also give some useful data about word timing and speaker ID information which are called as Informatized Caption. However, the IBM Watson API is very weak for the noisy audios and consequently, it is not good for the use in movie audios where background music or special sound effects are mixed together. In this paper, a novel method of modifying incorrectly recognized word using original caption is proposed to enhance the timing performance of Informatized Caption while updating S-DB database in real time. Mathematical validation is done for the convergence of average pronunciation time in S-DB as S-DB update continually. Experimental results show that improved performance on speech recognition can be achieved in the presence of various types of noise. And three kinds of experiments have been performed to verify the accuracy of timing information by the proposed method. The average accuracy of the proposed method is 81.09% for animation, horror and action movies and 66.35% for all movies. And this result means that the proposed method shows more than 1.5 times of performance compared with the original IBM Watson method. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.