Abstract
Hidden Markov models (HMM) have proved their success in several research areas, especially in speech recognition field. However, the major drawback of HMM classifier, is its sensitiveness to some initial parameters such as the number of states, which need to be tuned carefully. In fact, it is well known that the number of states suitable for a certain utterance may not perform as well for other utterances of the same class. To deal with this problem, and in order to take into consideration some levels of data variability, we investigate a new hybrid framework for speech recognition, in which we integrate the HMM classifier within the K-nearest neighbors (KNN) architecture. In this framework, we propose to build several HMMs differing in their numbers of states to represent each class of data, and to use KNN rule to decide the K nearest models and the most represented class using Viterbi likelihood as similarity measurement. In order to remove ambiguity during the decision step, we propose two different methods. The proposed framework is evaluated using the UCI Spoken Arabic Digit dataset. The obtained results show the effectiveness of our approach either when compared to HMM and KNN baseline or when compared to previous works on the same dataset.
Similar content being viewed by others
References
Abdel-Hamid, O., Mohamed, A. R., Jiang, H., & Penn, G. (2012). Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In the IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4277–4280). Kyoto: IEEE
AlKhateeb, J. H., Khelifi, F., Jiang, J., & Ipson, S. S. (2009). A new approach for off-line handwritten Arabic word recognition using KNN classifier. In IEEE international conference on signal and image processing applications (ICSIPA) (pp. 191–194). Kuala Lumpur: IEEE.
Al-Qatab, B. A., & Ainon, R. N. (2010). Arabic speech recognition using hidden Markov model toolkit (HTK). In International symposium in information technology (ITSim) (Vol. 2, pp. 557–562). Kuala Lumpur: IEEE.
Biem, A. (2003). A model selection criterion for classification: Application to hmm topology optimization. In Seventh international conference on document analysis and recognition (pp. 104–108). Edinburgh: IEEE.
Bougamouza, F., Hazmoune, S., & Benmohammed, M. (2016). Using Mel Frequency Cepstral Coefficient method for online Arabic characters handwriting recognition. In 5th international conference on multimedia computing and systems (ICMCS) (pp. 87–92). Rabat: IEEE.
Cavalin, P. R., Sabourin, R., & Suen, C. Y. (2012). LoGID: An adaptive framework combining local and global incremental learning for dynamic selection of ensembles of HMMs. Pattern Recognition, 45(9), 3544–3556.
Chebotar, Y., & Waters, A. (2016). Distilling knowledge from ensembles of neural networks for speech recognition. In Interspeech, San Francisco (pp. 3439–3443).
Clarkson, P., & Moreno, P. J. (1999). On the use of support vector machines for phonetic classification. In IEEE international conference on acoustics, speech, and signal processing (Vol. 2, pp. 585–588). Phoenix: IEEE.
Cohen, I., Sebe, N., Garg, A., Chen, L. S., & Huang, T. S. (2003). Facial expression recognition from video sequences: temporal and static modeling. Computer Vision and Image Understanding, 91(1), 160–187.
Deselaers, T., Heigold, G., & Ney, H. (2007). Speech recognition with state-based nearest neighbour classifiers. In Interspeech-2007, Antwerp (pp. 2093–2096.
Dhanashri, D., & Dhonde, S. B. (2017). Isolated word speech recognition system using deep neural networks. In Proceedings of the international conference on data engineering and communication technology (pp. 9–17). Singapore: Springer.
Ding, J., & Chang, C. W. (2016). An adaptive hidden Markov model-based gesture recognition approach using Kinect to simplify large-scale video data processing for humanoid robot imitation. Multimedia Tools and Applications, 75(23), 15537–15551.
En-Naimani, Z., Lazaar, M., & Ettaouil, M. (2014). Hybrid system of optimal self organizing maps and hidden Markov model for Arabic digits recognition. WSEAS Transactions on Systems, 13(60), 606–616.
Fix, E., & Hodges, J. L. (1951). Discriminatory analysis, nonparametric discrimination: Consistency properties, Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.
Ganapathiraju, A., Hamaker, J., & Picone, J. (2000). Hybrid SVM/HMM architectures for speech recognition. In Sixth international conference on spoken language processing, Beijing (Vol. 4, pp. 504–507).
Ganapathiraju, A., Hamaker, J. E., & Picone, J. (2004). Applications of support vector machines to speech recognition. IEEE Transactions on Signal Processing, 52(8), 2348–2355.
Geiger, J., Schenk, J., Wallhoff, F., & Rigoll, G. (2010). Optimizing the number of states for HMM-based on-line handwritten whiteboard recognition. In International conference on frontiers in handwriting recognition (ICFHR) (pp. 107–112). Kolkata: IEEE.
Gunter, S., & Bunke, H. (2003). Optimizing the number of states, training iterations and Gaussians in an HMM-based handwritten word recognizer. In Seventh international conference on document analysis and recognition, (pp. 472–476). Edinburgh: IEEE.
Hai, N. T., Van Thuyen, N., Mai, T. T., & Van Toi, V. (2015). MFCC-DTW algorithm for speech recognition in an intelligent wheelchair. In 5th international conference on biomedical engineering in Vietnam (pp. 417–421). Cham: Springer.
Hammami, N., & Bedda, M. (2010). Improved tree model for Arabic speech recognition. In 3rd IEEE international conference on computer science and information technology (ICCSIT) (Vol. 5, pp. 521–526). IEEE.
Hammami, N., Bedda, M., & Farah, N. (2011). HMM parameters estimation based on cross-validation for Spoken Arabic Digits recognition. In International conference on communications, computing and control applications (CCCA) (pp. 1–4). Hammamet: IEEE.
Hammami, N., Bedda, M., Farah, N., & Lakehal-Ayat, R. O. (2013). Spoken Arabic Digits recognition based on (GMM) for e-Quran voice browsing: Application for blind category. In Taibah University international conference on advances in information technology for the Holy Quran and its sciences (32519), (pp. 123–127). Medina: IEEE.
Hammami, N., Bedda, M., & Nadir, F. (2012). The second-order derivatives of MFCC for improving spoken Arabic digits recognition using tree distributions approximation model and HMMs. In International conference on communications and information technology (ICCIT), (pp. 1–5). IEEE.
Hazmoune, S., Bougamouza, F., Mazouzi, S., & Benmohammed, M. (2013a). A novel speech recognition approach based on multiple modeling by hidden Markov models. In International Conference on Computer Applications Technology (ICCAT), 2013 (pp. 1–6). Sousse: IEEE.
Hazmoune, S., Bougamouza, F., Mazouzi, S., & Benmohammed, M. (2013b). Contributions to HMM-based speech recognition systems. International Journal of Computational Linguistics Research, 4(1), 38–47.
Jiang, Z., Ding, X., Peng, L., & Liu, C. (2012). Analyzing the information entropy of states to optimize the number of states in an HMM-based off-line handwritten Arabic word recognizer. In 21st International Conference on Pattern Recognition (ICPR) (pp. 697–700). Istanbul: IEEE.
Khelifa, M. O., Elhadj, Y. M., Abdellah, Y., & Belkasmi, M. (2017). Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system. International Journal of Speech Technology, 20(4), 937–949.
Kwong, S., Chau, C. W., Man, K. F., & Tang, K. S. (2001). Optimisation of HMM topology and its model parameters by genetic algorithms. Pattern Recognition, 34(2), 509–522.
Lee, H. K., & Kim, J. H. (1999). An HMM-based threshold model approach for gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(10), 961–973.
Lichman, M. (2013). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml.
Luo, X. (2011). Chinese speech recognition based on a hybrid SVM and HMM architecture. In International symposium on neural networks (pp. 629–635). Berlin: Springer.
Ma, C., Randolph, M. A., & Drish, J. (2001). A support vector machines-based rejection technique for speech recognition. In IEEE international conference on acoustics, speech, and signal processing. proceedings (ICASSP’01) (Vol. 1, pp. 381–384). Salt Lake City: IEEE.
Masmoudi, S., Frikha, M., Chtourou, M., & Hamida, A. B. (2011). Efficient MLP constructive training algorithm using a neuron recruiting approach for isolated word recognition system. International Journal of Speech Technology, 14(1), 1–10.
Matsui, T., Kanno, T., & Furui, S. (1996). Speaker recognition using HMM composition in noisy environments. Computer Speech & Language, 10(2), 107–116.
Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. Journal of Computing, 2(3), 138–143.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Rabiner, L. R., & Juang, B. (1986). An introduction to hidden Markov models. IEEE ASSP magazine, 3(1), 4–16.
Rabiner, L. R., & Juang, B. H. (1992). Hidden Markov models for speech recognition—strengths and limitations. In P. Laface & R. De Mori (Eds.), Speech recognition and understanding (pp. 3–29). Berlin: Springer.
Ramĺrez, M., Sotaquirá, M., De La Cruz, A., Maria, E., Avellaneda, G., & Ochoa, A. (2016). An automatic speech recognition system for helping visually impaired children to learn Braille. In XXI symposium on signal processing, images and artificial vision (STSIVA) (pp. 1–4). Bucaramanga: IEEE.
Rao, K. S., Reddy, V. R., & Maity, S. (2015). Language identification using spectral and prosodic features. Berlin: Springer.
Schmidt, M., Schels, M., & Schwenker, F. (2010). A hidden markov model based approach for facial expression recognition in image sequences. In F. Schwenker, N. El Gayar (Eds.), IAPR workshop on artificial neural networks in pattern recognition (pp. 149–160). Berlin: Springer.
Sun, J., Sun, J., Abida, K., & Karray, F. (2012). A novel template matching approach to speaker-independent arabic spoken digit recognition. In M. Kamel, F. Karray, & H. Hagras (Eds.), Autonomous and Intelligent Systems (pp. 192–199). Berlin: Springer.
Thubthong, N., & Kijsirikul, B. (2001). Support vector machines for Thai phoneme recognition. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 9(06), 803–813.
Wang, Q., & Ju, S. (2008). A mixed classifier based on combination of HMM and KNN. In Fourth international conference on natural computation, ICNC’08 (Vol. 4, pp. 38–42). Washington, DC: IEEE.
Wang, X. H., Liu, A., & Zhang, S. Q. (2015). New facial expression recognition based on FSVM and KNN. Optik-International Journal for Light and Electron Optics, 126(21), 3132–3134.
Xu, C. (2014). Model constrution in Speech recognition on time and space sampling point of view. In IEEE 9th Conference on industrial electronics and applications (ICIEA) (pp. 1095–1097). IEEE.
Xu, Y., Siohan, O., Simcha, D., Kumar, S., & Liao, H. (2015). Exemplar-based large vocabulary speech recognition using k-nearest neighbors. In International conference on acoustics, speech and signal processing (ICASSP), (pp. 5167–5171). IEEE.
Zarrouk, E., Ayed, Y. B., & Gargouri, F. (2014). Hybrid continuous speech recognition systems by HMM, MLP and SVM: A comparative study. International Journal of Speech Technology, 17(3), 223–233.
Zeinali, H., Sameti, H., & Burget, L. (2017). HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(7), 1421–1435.
Zhang, X., Povey, D., & Khudanpur, S. (2015). A diversity-penalizing ensemble training method for deep learning. In INTERSPEECH (pp. 3590–3594).
Zhang, X., Sun, J., & Luo, Z. (2014). One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions. PLoS ONE, 9(2), e85458. https://doi.org/10.1371/journal.pone.0085458.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hazmoune, S., Bougamouza, F., Mazouzi, S. et al. A new hybrid framework based on Hidden Markov models and K-nearest neighbors for speech recognition. Int J Speech Technol 21, 689–704 (2018). https://doi.org/10.1007/s10772-018-9535-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-9535-4