Abstract
IBM Watson is one of the representative tools for speech recognition system which can automatically generate not only speech-to-text information but also speaker ID and timing information, which is called as Informatized Caption. However, if there is some noise in the voice signal to the IBM Watson API, the recognition performance is significantly decreased. It can be easily found in movies with background music and special sound effects. This paper aims to improve the inaccuracy problem of current Informatized Captions in noisy environments. In this paper, a method of modifying incorrectly recognized words and a method of enhancing timing accuracy while updating database in real time are suggested based on the original caption and Informatized Caption information. Experimental results shows that the proposed method can give 81.09% timing accuracy for the case of 10 representative animation, horror and action movies.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In recent years artificial intelligence has come into wide use in various fields [1, 2, 7, 12, 13, 17, 19]. Artificial intelligence currently encompasses a huge variety of subfields, ranging from the general learning and perception to the specific, such as playing chess, proving mathematical theorems, writing poetry, driving a car on a crowded street, and diagnosing diseases. Artificial intelligence is relevant to any intellectual task, it is truly a universal field. [15] One of the areas that is actively researched is natural language processing by speech recognition. However, machines are difficult to speak, hear and read human language. Therefore, natural language processing and speech recognition are some of the most difficult and important field in artificial intelligence [6]. One of the most popular speech recognition technologies is the IBM Watson API [9]. Among captions in which speech is converted into characters, captions including timing and speaker ID information are called as Informatized Caption [3,4,5]. And, this Informatized Caption can be generated using the IBM Watson API [10]. However, the IBM Watson API is more susceptible to incorrect recognition when there is some noise in the audio signal, especially for movies where background music or special sounds are used. To solve this problem, many researches have been done. However, in the previous researches [3,4,5], there is still some problem that the speaker cannot be distinguished well when there is a plurality of speakers who can speak the same word differently [3], and an assumption that the database with speaker pronunciation time information should be ready in advance [4, 5]. In this paper, a method of modifying incorrectly recognized word using original caption is proposed to enhance the timing performance while updating the database in real time using the Informatized Caption information.
2 Background
2.1 Informatized caption
As shown in Fig. 1, IBM Watson API [10] makes a caption that includes not only the speech recognition result word but also the timing information (e.g. start time, end time, and speaker ID information) of each word. This kind of caption is called as an Informatized Caption. However, when speech recognition is performed on a noisy sound, there are many incorrectly recognized words. Therefore, the timing information in the caption is not correct.
2.2 Linear estimation method
The problem to modify the incorrectly recognized words and their timing information with the accurate original caption is the problem of estimating the timing information for each incorrectly recognized word. One of possible methods is the linear estimation method based on the number of characters of words. For example, when there are two incorrectly recognized words B′ and C′ between the correctly recognized words A and D, the words B′ and C′ can be replaced by the correct words B and C by comparing with the original caption, and the timing information for B and C can be estimated by Eq. (1) where V(X) means the number of characters of the word X. Here, it can be assumed that tS(C) is equal to tE(B) to reduce the number of uncertain variables.
where V(X) means the number of characters in the word of Informatized Caption.
3 Speaker pronunciation time database (S-DB)
3.1 Database design
Figure 2 shows the model of speaker pronunciation time database, S-DB, which consists of nodes for each speaker Sp. The node is composed by word ID Wpx, the average pronunciation time Dpx, and its appearance frequency Upx. The nodes for each speaker are managed in ascending order based on the alphanumeric order of words, and are connected to each other by the linked list data structure with NULL node at the end.
3.2 Real-time update method of S-DB
The construction and update of S-DB are basically divided into three cases, the case that there are speaker Sp and word Wpk in S-DB, the case that there is speaker Sp in S-DB but not word Wpk in S-DB, and the case that there is not both speaker Sp and word Wpk in S-DB. The algorithm to update S-DB in real-time in each case is as follows.
3.2.1 Case that there is speaker Sp and word Wpk in S-DB
In this cas, the first step is to find the corresponding word Wpk in the linked list for the speaker Sp in S-DB. And then, the appearance frequency U is incremented by 1 and the average pronunciation time D is updated by the Eq. (2).
where tS(X) and tE(X) are the start and end time of pronunciation of word X, respectively.
3.2.2 Case that there is speaker Sp in S-DB, but not word Wpk in S-DB
In this case, the first step is to create a new node with the word Wpk in the linked list for the speaker Sp in S-DB. And then, 1 is assigned to the appearance frequency U of the new node and the pronunciation time of the word Wpk is updated for the initial average pronunciation time. Finally, insert sorting is performed for this linked list according to the alphanumeric order.
3.2.3 Case that there is not both speaker Sp and word Wpk in S-DB
To update new speaker in S-DB, the first step is to increment the number of speakers p by 1 and insert it to the existing speaker list of S-DB. And a new node is created in the linked list for the speaker Sp in S-DB. Then, 1 is assigned to the appearance frequency U of the new node and the pronunciation time of the word Wpk is updated for the initial average pronunciation time. Finally, insert sorting is performed for the speaker list according to the alphanumeric order.
4 Proposed algorithm
4.1 Algorithm to modify incorrectly recognized word and timing information using S-DB
Figure 3 shows the overall structure of the proposed system. When a user speaks, the speech is translated into a type of informatized caption by special Speech-to-Text engine such as IBM Watson which can generate not only the text information but also user and pronunciation timing information. Then, the timing information of this informatized caption is modified through the proposed algorithm with the original caption. Specifically, the inputs of the proposed algorithm are original caption, Informatized Caption, discrete voice signal, S-DB (e.g. Speaker ID, Word, Pronunciation time, and appearance frequency), the threshold value to recognize the start time and end time of a word in the discrete voice signal, and the search range in Informatized Caption to find the matched word with original caption and to modify the incorrectly recognized word and its timing information. The output of the proposed algorithm is modified Informatized Caption and its timing information. Here, S-DB updates are occurred only when there is no incorrectly recognized word. If an incorrectly recognized word is found between correct recognized words, the number of incorrectly recognized words is counted. And, the wrong words are replaced by the original caption, and the timing information is calculated as Fig. 4 or Algorithm 2 for each case.
4.2 Case 0: No incorrectly recognized word
In this case, the final output is the same with the output from IBM Watson API. Therefore, Informatized Caption is used just to update S-DB.
4.3 Case 1: The number of incorrectly recognized words is only one
In the case of Fig. 5, the number of incorrectly recognized words is only one. And, the required function is just to replace the wrong word B with the correct one and to find its timing information. If there is the minimum time tMIN between tE(A) to tS(C) whose value of discrete voice signal is more than the threshold T, it can be assumed as the start time of the correct word B. Otherwise, the start time of the correct word B is assumed as the same with the end time of the previous word tE(A). Likewise, if there is the maximum time tMAX between tE(A) to tS(C) whose value of discrete voice signal is more than the threshold T, it can be assumed as the end time of the correct word B. Otherwise, the end time of the correct word B is assumed as the same with the start time of the next word tS(C).
4.4 Case 2: The number of incorrectly recognized words are two
In the case of Fig. 6, there are two consecutive and incorrectly recognized words B and C between the correct words A and D. Therefore, tS(B) and tE(C) can be estimated by using the same algorithm in case 1 (Algorithm 3). Here, tE(B) is assumed as the same with tS(C) for the simplicity of problem. And, the timing information of tE(B) or tS(C) can be estimated by using the linear estimation method or by using the previous pronunciation time information of the same speaker, D(S, B) and D(S, C) in the S-DB. If there is not both D(S, B) and D(S, C) in S-DB, i.e. pronunciation time of words B and C are unknown, linear estimation method is used such as Eq. (3).
where V(X) means the number of characters in the word of Informatized Caption.
If there is only D(S, B) in S-DB, tE(B) is estimated by adding the average pronunciation time D(S, B) to tS(B) such as Eq. (4).
If there is only D(S, C) in S-DB, tE(B) is estimated by subtracting the average pronunciation time D(S, C) from tE(C) such as Eq. (5).
If there is both D(S, B) and D(S, C) in S-DB, Eq. (6) using the ratio of D(S, B) and D(S, C) is used.
Finally, tS(C) is estimated by using tE(B).
4.5 Case 3: The number of incorrectly recognized words are more than three
In the case of Fig. 7, there are more than three consecutive and incorrectly recognized words W1, …, WL between the correct words A and B. Therefore, tS(W1) and tE(WL) are estimated by using the same algorithm in case 1 (Algorithm 3). Here, tE(Wi) is assumed as the same with tS(Wi + 1) for i = 1,…,L-1 for the simplicity of problem. The timing information of tE(W1) to tS(WL) can be estimated by using the linear estimation method or by using the previous pronunciation time of the same speaker, D(S, W1), …, D(S, WL) in the S-DB. By the first scanning of the incorrectly recognized words, the total number of characters in all incorrectly recognized words and the total pronunciation time of words which belongs to S-DB are calculated. And then, by the second scanning, the timing information for each word is estimated by using Eq. (7) or Eq. (8). If the word is not included in the S-DB, the timing of the word is calculated using Eq. (7). And, if the word is included in the S-DB, it is calculated using Eq. (8).
where Wq, Wq + 1,…, Wr are the incorrectly recognized words.
The above equations can also be applied to calculate tE(W1), tS(W2) tE(W2), …, tS(Wl).
4.6 Mathematical validation of the convergence of average pronunciation time as S-DB update
The pronunciation time of a word is generally varying by each person, his/her average speaking speed, and connecting word near it. For example, different persons can make different voice and different pronunciation time for the same word. Even for the same person, he/she can be hurry or relaxed when speaking. In addition, the same word spoken by the same person may have different pronunciation time by the connecting word which is located before or after the word such as soft consonant phenomenon. In this paper, we have assumed two assumptions to simplify the problem.
Assumption 1. The pronunciation time of a word is independent with the connecting words.
Assumption 2. The pronunciation time of a word is an independent and identically distributed (i. i. d.) random variable drawn from a distribution of expected value m and finite variance σ2.
Theorem 1. As the numbers of update in S-DB is increased with more pronunciations of the same word by the same speaker, the average pronunciation time of the word by the speaker (D value in S-DB) will be converged.
(Proof) Let us assume that there was already U(Wij) − 1 numbers of update in S-DB with the average pronunciation time Di(Wij) of a word Wij by a speaker Si.
where n = U(Wij) − 1 and X(Wij, n) is the n-th pronunciation time of the word Wij by a speaker Si.
Then, the next update with the current pronunciation time D(Wij) of the word Wij by a speaker Si is done by the following rule:
By the Central Limit Theorem [11, 14], the sampling distribution of the sample means, \( {D}_i^{new}\left({W}_{ij}\right) \), approaches to a normal distribution with the mean m and the variance \( \frac{\sigma^2}{\mathrm{U}\left({W}_{ij}\right)} \) as the sample size U(Wij) gets larger regardless of the original shape of population distribution. As the numbers of update U(Wij) in S-DB approaches to the infinity, the variance \( \frac{\sigma^2}{\mathrm{U}\left({W}_{ij}\right)} \) approaches to zero. Therefore, the pronunciation time D(Wij) of the word Wij by the speaker Si is converged to m which is the expected value of D(Wij). (Q.E.D.)
5 Experimental results
5.1 Experiment setup
There are lots of factors that reduce accuracy in speech recognition. Among many factors, the most critical one is noise. 7 types of noise have been used for experiments as Fig. 8. 6 colors of noise [16], gray noise, purple noise, blue noise, brown noise, pink noise and white noise, have been used for the first experiment, which is to show how different types of noise affect the proposed method. A kind of typical examples of white noise, rainy noise which was captured from rainy sound, has been used for the second experiment, which is to show how different levels(decibels) of noise affect the proposed method. The last experiment is to show whether the proposed method can work with various movie samples or not. In this last experiment, each movie sample includes natural noise basically. Therefore, any additional noise did not add.
The problem to modify the incorrectly recognized words and their timing information with the accurate original caption is the problem of estimating the timing information for each incorrectly recognized word. The correct timing information of each word in each test audio file was measured by checking the timing of each word during playing the audio file based on the wave display panel of Adobe CC 2019 software. The measurement values were represented with 2 places of decimals in seconds. To verify the accuracy of timing information from Informatized Caption, three kinds of experiments are performed. The measure of accuracy in the three experiments is like Eq. (13).
In Equation (13), where CRW means the set of correctly recognized words, | set A | means the number of elements in the set A, \( {t}_S^{\ast } \) and \( {t}_E^{\ast } \) means the correct start and end time of the word X (timing unit: seconds). The first and second experiment is done with a clean sound such as an English listening test audio [8] file which is recorded in a well-structured noise-free environment. Detailed information for each test audio file is shown in Table 1. Each of the test files is mixed with noise whose signal-to-noise ratio [18] is 20, 15, 10 and 5(dB) as shown in Fig. 9. The first experiment is based on the colors of noise. The six noises were synthesized according to different levels of signal-to-noise ratio to make a total of 96 samples. The second experiment is done with a total of 16 samples based on the sound of rain, a typical example of white noise that can be heard in real life.
The last experiment was for the performance in various real noisy environments and done with three genres of films for the accuracy test of timing information from Informatized Caption after speech recognition using IBM Watson API. The genre of films consists of 5 animation movies dubbing in a well-structured noise-free environment, 5 horror and action movies with various sound effects, and 5 musical movies with many background music. The sample selection conditions for each film genre are as follows. Animation chose a noise-free sample. Horror and action chose samples with a noise like staccato. Musical chose a constant sample of noise like background music. The time periods extracted for this experiment were 6 to 20 s which reflect the characteristics of each genre. Detailed information for each movie is shown in Table 2.
5.2 Experimental results
5.2.1 Experiment 1: Samples with different colors of noise
Figure 10 shows the results of accuracy of the proposed algorithm according to different levels of each colors of noise. The vertical axis shows signal-to-noise ratio and the horizontal axis shows timing accuracy. The average accuracy of each experiment is 73.10% for white noise, 80.80% for pink noise, 81.20% for brown noise, 79.71% for blue noise, 79.96% for purple noise, and 81.60% for gray noise. Regardless of the levels of noise, the accuracy of white noise was the lowest. Since the proposed algorithm uses the results of speech recognition, the result of the proposed algorithm may be less accurate when the noise level is too much high and similar to the voice level.
5.2.2 Experiment 2: A sample with different noises
Figure 11 shows the result of experiments with 4 test audio files after applying 4 different noises to each sample. The vertical axis represents the accuracy and the horizontal axis represents the signal-to-noise ratio of mixed audio files with varying intensity of noises. Experimental results show that the accuracy of timing information of Informatized Caption by IBM Watson API is more than 84.80% for English listening test audio files when noise is small. And, the accuracy decreases significantly as noise increases. The timing accuracy of proposed algorithm shows more than 95.70% for the same English listening test audio files. However, the accuracy of the proposed algorithm is also lowered when the noise level is very high and similar to the voice level since the proposed algorithm compensates the result of IBM Watson API.
5.2.3 Experiment 3: Various movie samples
The accuracy of timing information of each sample movie for each genre is shown in Table 3 (Animation movies), Table 4 (Horror and action movies), and Table 5 (Musical movies). The average accuracy of IBM Watson method was 42.95% for all movies and 53.65% for animation, horror and action movies. Since most of time period in musical movies include background music and singing voices, the performance of speech recognition by IBM Watson is basically very low and consequently, the accuracy for musical movies are the lowest. The average accuracy of the proposed method is 66.35% for all movies and 81.09% for animation, horror and action movies. The proposed method shows more than 1.5 times of performance compared with the IBM Watson method.
6 Concluding remarks
IBM Watson API is one of the most popular speech recognition technologies in these days. It can perform not only the translation process from speech to text, but also give some useful data about word timing and speaker ID information which are called as Informatized Caption. However, the IBM Watson API is very weak for the noisy audios and consequently, it is not good for the use in movie audios where background music or special sound effects are mixed together.
In this paper, a novel method of modifying incorrectly recognized word using original caption is proposed to enhance the timing performance of Informatized Caption while updating S-DB database in real time. Mathematical validation is done for the convergence of average pronunciation time in S-DB as S-DB update continually. Experimental results show that improved performance on speech recognition can be achieved in the presence of various types of noise. And three kinds of experiments have been performed to verify the accuracy of timing information by the proposed method. The average accuracy of the proposed method is 81.09% for animation, horror and action movies and 66.35% for all movies. And this result means that the proposed method shows more than 1.5 times of performance compared with the original IBM Watson method.
Abbreviations
- X:
-
word
- Ts(X):
-
Original caption of a word X
- Ts+(X):
-
Informatized Caption of a word X
- tS(X):
-
Start time of a word X
- tE(X):
-
End time of a word X
- S-DB:
-
Speaker pronunciation time database
- Sp :
-
p-th speaker
- Wpk :
-
k-th word of p-th speaker in S-DB
- Up(Wpk):
-
Appearance frequency of k-th word of p-th speaker
- Dp(Wpk):
-
Average pronunciation time of k-th word of p-th speaker
- V(X):
-
The number of characters in a word X
- D(S, X):
-
Average pronunciation time of a word X of speaker S.
References
Alsamhi SH, Ma O, Ansari MS (2018) Artificial intelligence-based techniques for emerging robotics communication: a survey and future perspectives. arXiv preprint arXiv:1804.09671
Ban F, Wu D, Hei Y (2018) Combined forecasting model of urban water consumption based on adaptive filtering and BP neural network. International Journal of Social and Humanistic Computing 3(1):34–45. https://doi.org/10.1504/IJSHC.2018.095011
Choi YS, Park HM, Son YS, Jung JW (2017) Informatized caption enhancement based on IBM Watson API. Proceedings of KIIS Autumn Conference 27(2):105–106
Choi YS, Son YS, Jung JW (2018) Informatized caption enhancement based on IBM Watson API and speaker pronunciation time-DB. Computer Science & Information Technology – computer science conference proceedings :105-110
Choi YS, Son YS, Jung JW (2018) A method to enhance Informatized caption from IBM Watson API using speaker pronunciation time-DB. International Journal on Natural Language Computing 7(1):1–11
Chowdhury GG (2003) Natural language processing. Annu Rev Inf Sci Technol 37(1):51–89. https://doi.org/10.1002/aris.1440370103
Drigas AS, Argyri K, Vrettaros J (2009) Decade review (1999-2009): progress of application of artificial intelligence tools in student diagnosis. International Journal of Social and Humanistic Computing 1(2):175–191. https://doi.org/10.1504/IJSHC.2009.031006
English listening test audios by Korea Institute for Curriculum and Evaluation. http://www.kice.re.kr/main.do?s=suneung
Ferrucci DA (2012) Introduction to "this is Watson". IBM J Res Dev 56.3(4):1–15. https://doi.org/10.1147/JRD.2012.2184356
IBM Cloud Documentation. https://console.bluemix.net/docs/
Kipnis C, Varadhan SRS (1986) Central limit theorem for additive functionals of reversible Markov processes and applications to simple exclusions. Commun Math Phys 104(1):1–19. https://doi.org/10.1007/BF01210789
Kiumarsi B, Vamvoudakis KG, Modares H, Lewis FL (2018) Optimal and autonomous control using reinforcement learning: a survey. IEEE transactions on neural networks and learning systems 29(6):2042–2062. https://doi.org/10.1109/TNNLS.2017.2773458
Mata J, de Miguel I, Durán RJ et al (2017) Artificial intelligence (AI) methods in optical networks: a comprehensive survey. Optical switching and networking 28:43–57. https://doi.org/10.1016/j.osn.2017.12.006
Rosenblatt M (1956) A central limit theorem and a strong mixing condition. Proc Natl Acad Sci 42(1):43–47. https://doi.org/10.1073/pnas.42.1.43
Russell SJ, Norvig P (2016) Artificial intelligence: a modern approach. Pearson Education Limited, Malaysia
Shahamiri SR, Salim SSB (2014) Real-time frequency-based noise-robust automatic speech recognition using multi-nets artificial neural networks: a multi-views multi-learners approach. Neurocomputing 129:199–207. https://doi.org/10.1016/j.neucom.2013.09.040
Shickel B, Tighe PJ, Bihorac A, Rashidi P (2017) Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE Journal of Biomedical and Health Informatics 22(5):1589–1604. https://doi.org/10.1109/JBHI.2017.2767063
Stallings W (2006) Data and computer communications eighth edition. Prentice Hall, New Jersey, pp.92–96
Tan WK, Hassanpour S, Rundell SD et al (2018) Comparison of natural language processing rules-based and machine-learning systems to identify lumbar spine imaging findings related to low Back pain. Acad Radiol 25:1422–1432. https://doi.org/10.1016/j.acra.2018.03.008
Acknowledgments
This research was partially supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2020-2020-0-01789) supervised by the IITP(Institute of Information & Communications Technology Planning & Evaluation), the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2020R1F1A1074974), the KIAT(Korea Institute for Advancement of Technology) grant funded by the Korea Government(MOTIE : Ministry of Trade Industry and Energy). (No. N0001884, HRD program for Embedded Software R&D), the AURI(Korea Association of University, Research institute and Industry) grant funded by the Korea Government(MSS : Ministry of SMEs and Startups). (No.S2938281, HRD program for Enterprise linkages R&D), the MSIT(Ministry of Science and ICT), Korea, under the National Program for Excellence in SW supervised by the IITP(Institute of Information & Communications Technology Planning & Evaluation)(2016-0-00017).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Choi, YS., Kang, JG., Joo, J.W.J. et al. Real-time Informatized caption enhancement based on speaker pronunciation time database. Multimed Tools Appl 79, 35667–35688 (2020). https://doi.org/10.1007/s11042-020-09590-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09590-2