Abstract
Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, languageindependent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve the problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several limitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.
Similar content being viewed by others
References
Deng L. Dynamic Speech Models: Theory, Algorithm, and Application. Morgan & Claypool, 2006.
Furui S. History and development of speech recognition. In Speech Technology: Theory and Application, Chen F, Jokinen K (eds.), New York: Springer, 2010, pp.1–18.
Chapaneri S V. Spoken digits recognition using weighted MFCC and improved features for dynamic time warping. International Journal of Computer Application, 2012, 40(3): 6–12.
Cox R V, Kamm C A, Rabiner L R, Schroeter J, Wilpon J G. Speech and language processing for next-millennum communications services. Proc. the IEEE, 2000, 88(8): 1314–1337.
Marti A, Cobos M, Lopez J J. Evaluating the influence of source separation methods in robust automatic speech recognition with a specific cocktail-party training. Audio Engineering Society Convention, 2012. https://secure.aes.org/forum/pubs/conventions/?elib=16273, Mar. 2014.
Levis J, Suvorov R. Automatic speech recognition. In The Encyclopedia of Applied Linguistics, Chapelle C A (ed.), Blackwell Publishing Ltd., 2012.
Feng J, Ramabhadran B, Hansen J H L, Williams J D. Trends in speech and language processing. IEEE Signal Processing Magazine, 2012, 29(1): 177–179.
Talking N Y. In the news. IEEE Intelligent Systems, 2012, 27(2): 2–7.
Kim C, Seo K D. Robust DTW-based recognition algorithm for hand-held consumer devices. IEEE Trans. Consumer Electronics, 2005, 51(2): 699–709.
Vintsyuk T K. Speech discrimination by dynamic programming. Cybernetics, 1968, 4(1): 52–57.
Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics, Speech and Signal Processing, 1978, 26(1): 43–49.
Myers C, Rabiner L R, Rosenberg A. Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Trans. Acoustics, Speech and Signal Processing, 1980, 28(6): 623–635.
Deller J R, Hansen J H L. Proakis J G. Discrete-Time Processing of Speech Signals. Wiley-IEEE Press, 1999.
Abdulla W H, Chow D, Sin G. Cross-words reference template for DTW-based speech recognition systems. In Proc. TENCON, Oct. 2003, pp.1576–1579.
Adami A G, Mihaescu R, Reynolds D A, Godfrey J J. Modeling prosodic dynamics for speaker recognition. In Proc. ICASSP, Apr. 2003, pp.788–791.
Nair N U, Sreenivas T V. Multi pattern dynamic time warping for automatic speech recognition. In Proc. TENCON, Nov. 2008, pp.1–6.
Muda L, Begam M, Elamvazuthi I. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. Journal of Computing, 2010, 2(3): 138–143.
Sheikhan M, Gharavian D, Ashoftedel F. Using DTW neural-based MFCC warping to improve emotional speech recognition. Neural Computing & Applications, 2012, 21(7): 1765–1773.
Wang J, Wang J, Mo M H, Tu C I, Lin S C. The design of a speech interactivity embedded module and its applications for mobile consumer devices. IEEE Trans. Consumer Electronics, 2008, 54(2): 870–876.
Sun J, Sun Y, Abida K, Karray F. A novel template matching approach to speaker-independent Arabic spoken digit recognition. In Proc. AIS, June 2012, pp.192–199.
Berndt D J, Clifford J. Using dynamic time warping to find patterns in time series. In Proc. AAAI Workshop on Knowledge Discovery in Databases, July 1994, pp.359–370.
Keogh E J, Pazzani M J. Scaling up dynamic time warping to massive datasets. In Proc. the 3rd European Conf. PKDD, Sept. 1999, pp.1–11.
Müller M. Information Retrieval for Music and Motion. Heidelberg, New York: Springer-Verlag, 2007.
Kim S W, Park S, Chu W W. An index-based approach for similarity search supporting time warping in large sequence databases. In Proc. Int. Conf. Data Engineering, Apr. 2001, pp.607–614.
Zhu Y, Shasha D. Warping indexes with envelope transforms for query by humming. In Proc. SIGMOD, June 2003, pp.181–192.
Müller M, Mattes H, Kurth F. An efficient multiscale approach to audio synchronization. In Proc. the 7th ISMIR, Oct. 2006, pp.192-197.
Sakurai Y, Yoshikawa M, Faloutsos C. FTW: Fast similarity search under the time warping distance. In Proc. the 24th PODS, June 2005, pp.326–337.
Papapetrou P, Athitsos V, Potamias M, Kollios G, Gunopulos D. Embedding-based subsequence matching in time-series databases. ACM Trans. Database Systems, 2011, 36(3): Article No.17.
Shanker A P, Rajagopalan A N. Off-line signature verification using DTW. Pattern Recognition Letters, 2007, 28(12): 1407–1414.
Jeong Y S, Jeong M K, Omitaomu O A. Weighted dynamic time warping for time series classification. Pattern Recognition, 2011, 44(9): 2231–2240.
Karray F O, De Silva C. Soft Computing and Intelligent Systems Design: Theory, Tools and Applications. Addison-Wesley, 2004.
Keogh E. Exact indexing of dynamic time warping. In Proc. VLDB, Aug. 2002, pp.406–417.
Young S, Evermann G, Gales M et al. The HTK Book (for HTK Version 3.4). Cambridge, UK: Cambridge University Engineering Department, 2006.
Livio M. The Golden Ratio: The Story of PHI, the World’s Most Astonishing Number. Broadway Books, 2003.
Lu A, Maciejewski R, Ebert D S, Volume composition using eye tracking data. In Proc. the 8th EuroVis, Jan. 2006,pp.115–122.
Rabiner L R, Juang B H. Fundamentals of Speech Recognition. Englewood Cliffs, New Jersey: Prentice-Hall, 1993.
Lévy C, Linarμes G, Nocera P. Comparison of several acoustic modeling techniques and decoding algorithms for embedded speech recognition systems. In Proc. Workshop on DSP in Mobile and Vehicular Systems, Apr. 2003.
Author information
Authors and Affiliations
Corresponding authors
Additional information
This work was supported by the Research Plan Project of National University of Defense Technology under Grant No. JC13-06-01, and the OCRit Project made possible by the Global Leadership Round in Genomics&Life Sciences Grant (GL2).
Electronic supplementary material
Below is the link to the electronic supplementary material.
ESM 1
(PDF 128 kb)
Rights and permissions
About this article
Cite this article
Zhang, XL., Luo, ZG. & Li, M. Merge-Weighted Dynamic Time Warping for Speech Recognition. J. Comput. Sci. Technol. 29, 1072–1082 (2014). https://doi.org/10.1007/s11390-014-1491-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-014-1491-0