Abstract
This paper focuses on impact of phonetic inaccuracies of acoustic training data on performance of automatic speech recognition system. This is especially important if the training data is created in automated way. In this case, the data often contains errors in a form of wrong phonetic transcriptions. A series of experiments simulating various common errors in phonetic transcriptions based on parts of GlobalPhone data set (for Croatian, Czech and Russian) is conducted. These experiments show the influence of various errors on different languages and acoustic models (Gaussian mixture models, deep neural networks). The impact of errors is also shown for real data obtained by our automated ASR creation process for Belarusian. The results show that the best performance is achieved by using the most accurate data; however, certain amount of errors (up to 5%) does have relatively small impact on speech recognition accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6 (2004)
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)
Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Trans. Audio, Speech Lang. Proc. (2012)
Hansen, M.S., Kozerke, S., Pruessmann, K.P., Boesiger, P., Pedersen, E.M., Tsao, J.: On the influence of training data quality in k-t BLAST reconstruction. Magn. Reson. Med. 52(5), 1175–1183 (2004)
Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice Hall, Upper Saddle River (2001)
Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: Proceedings of the IEEE International Conference on Acoustics. Speech and Signal Processing, Detroit, Michigan, vol. I, pp. 181–184, May 1995
Mateju, L., Cerva, P., Zdansky, J.: Investigation into the use of deep neural networks for LVCSR of Czech. In: 2015 IEEE International Workshop of Electronics, Control, Measurement, Signals and their application to Mechatronics (ECMSM), pp. 1–4 (2015)
Nouza, J., Zdansky, J., Cerva, P.: System for automatic collection, annotation and indexing of Czech broadcast speech with full-text search. In: 2010 15th IEEE Mediterranean Electrotechnical Conference, Melecon 2010, pp. 202–205, April 2010
Nouza, J., Safarik, R., Cerva, P.: Asr for South Slavic languages developed in almost automated way. In: INTERSPEECH, pp. 3868–3872 (2016)
Nouza, J.e.a.: Speech-to-text technology to transcribe and disclose 100,000+ hours of bilingual documents from historical Czech and Czechoslovak radio archive. In: INTERSPEECH, pp. 964–968. ISCA (2014)
Safarik, R., Mateju, L.: Impact of phonetic annotation precision on automatic speech recognition systems. In: 2016 39th International Conference on Telecommunications and Signal Processing (TSP), pp. 311–314, June 2016
Schultz, T.: Globalphone: A multilingual speech and text database developed at Karlsruhe university. In: Proceedings of the ICSLP, pp. 345–348 (2002)
Sundaram, R., Picone, J.: Effects on transcription errors on supervised learning in speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 Proceedings, ICASSP 2004, vol. 1, p. I-169. IEEE (2004)
Acknowledgements
This work was supported by the Technology Agency of the Czech Republic (Project No. TA04010199) and by the Student Grant Scheme 2017 of the Technical University in Liberec.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Safarik, R., Mateju, L. (2017). The Impact of Inaccurate Phonetic Annotations on Speech Recognition Performance. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_45
Download citation
DOI: https://doi.org/10.1007/978-3-319-64206-2_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)