Abstract
Phonetic analysis of speech, in general, requires the alignment of audio samples to its phonetic transcription. This task could be performed manually for a couple of files, but as the corpus grows large it becomes unfeasibly time-consuming, which emphasizes the need for computational tools that perform such speech-phonemes forced alignment automatically. Therefore, due to the scarce availability of phonetic alignment tools for Brazilian Portuguese (BP), this work describes the evolution process towards creating a free phonetic alignment tool for BP using Kaldi, a toolkit that has been the state of the art for open-source speech recognition. Five acoustic models were trained with Kaldi and tested in phonetic alignment, where the evaluation took place in terms of the phone boundary metric. The results show that its performance is similar to some Kaldi-based aligners for other languages, and superior to an outdated phonetic aligner for BP based on HTK toolkit.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anastasakos, T., McDonough, J., Makhoul, J.: Speaker adaptive training: a maximum likelihood approach to speaker normalization. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1043–1046 (1997)
Anastasakos, T., Mcdonough, J., Schwartz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proceedings of the ICSLP, pp. 1137–1140 (1996)
Batista, C., Cunha, R., Batista, P., Klautau, A., Neto, N.: Utterance copy in formant-based speech synthesizers using LSTM neural networks. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 90–95, October 2019. https://doi.org/10.1109/BRACIS.2019.00025
Batista, C., Dias, A.L., Sampaio Neto, N.: Baseline acoustic models for Brazilian Portuguese using Kaldi tools. In: Proceedings of IberSPEECH, pp. 77–81 (2018). https://doi.org/10.21437/IberSPEECH.2018-17
Bigi, B., Hirst, D.: Speech phonetization alignment and syllabification (SPPAS): a tool for the automatic analysis of speech prosody. In: Proceedings of Speech Prosody, pp. 1–4, May 2012. https://www.isca-speech.org/archive/sp2012/papers/sp12_019.pdf
Boersma, P., Weenink, D.: Praat: doing phonetics by computer (version 6.1.15) [computer program] (2020). https://www.fon.hum.uva.nl/praat/
Brognaux, S., Roekhaut, S., Drugman, T., Beaufort, R.: Train&align: a new online tool for automatic phonetic alignment. In: IEEE Workshop on Spoken Language Technology, pp. 416–421 (2012). https://doi.org/10.1109/SLT.2012.6424260
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, Hoboken (2000)
Gales, M.J.F.: Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998). https://doi.org/10.1006/csla.1998.0043
GitHub: Kaldi speech recognition toolkit (2018). https://github.com/kaldi-asr/kaldi
GitHub: Frequencywords (2020). https://github.com/hermitdave/FrequencyWords
GitHub: GNU Aspell (2020). https://github.com/GNUAspell/aspell
Goldman, J.P.: EasyAlign: an automatic phonetic alignment tool under Praat. In: Proceedings of Interspeech, pp. 3233–3236 (2011). https://archive-ouverte.unige.ch/unige:18188
Gopinath, R.A.: Maximum likelihood modeling with Gaussian distributions for classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 2, pp. 661–664, May 1998. https://doi.org/10.1109/ICASSP.1998.675351
Gorman, K., Howell, J., Wagner, M.: Prosodylab-aligner: a tool for forced alignment of laboratory speech. Can. Acoust. 39(3), 192–193 (2011). https://jcaa.caa-aca.ca/index.php/jcaa/article/view/2476
Grupo FalaBrasil: Ferramentas para alinhamento fonético em português brasileiro (2020). https://gitlab.com/fb-align/
Grupo FalaBrasil: NLP: Gerador de ferramentas para processamento de linguagem natural (2020). https://gitlab.com/fb-nlp/nlp-generator
Grupo FalaBrasil: Recursos prontos para processamento de linguagem natural em português brasileiro (2020). https://gitlab.com/fb-nlp/nlp-resources
Guiroy, S., Cordoba, R., Villegas, A.: Application of the Kaldi toolkit for continuous speech recognition using hidden-Markov models and deep neural networks. In: Proceedings of IberSPEECH 2016, pp. 187–196 (2016). https://iberspeech2016.inesc-id.pt/wp-content/uploads/2017/01/OnlineProceedings_IberSPEECH2016.pdf
Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice Hall PTR, Upper Saddle River (2001)
Jiampojamarn, S., Kondrak, G., Sherif, T.: Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, Rochester, New York, pp. 372–379. Association for Computational Linguistics, April 2007. http://www.aclweb.org/anthology/N/N07/N07-1047
Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using Kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_29
LDC: CSLU: Spoltech Brazilian Portuguese version 1.0 (2018). https://catalog.ldc.upenn.edu/LDC2006S16
LDC: West point Brazilian Portuguese speech (2018). https://catalog.ldc.upenn.edu/LDC2008S04
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using Kaldi. In: Proceedings of Interspeech, pp. 498–502, August 2017. https://doi.org/10.21437/Interspeech.2017-1386
Miao, Y., Zhang, H., Metze, F.: Speaker adaptive training of deep neural network acoustic models using I-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1938–1949 (2015)
Neto, N., Patrick, C., Klautau, A., Trancoso, I.: Free tools and resources for Brazilian Portuguese speech recognition. J. Braz. Comput. Soc. 17(1), 53–68 (2010). https://doi.org/10.1007/s13173-010-0023-1
Ochshorn, R.M., Hawkins, M.: Gentle forced aligner [computer program] (2020). https://github.com/lowerquality/gentle
opensubtitles.org: Opensubtitles (2020). https://www.opensubtitles.org/
PCD Legal: PCD legal: Acessível para todos (2018). http://www.pcdlegal.com.br/
Povey, D.: Chain models (2020). https://kaldi-asr.org/doc/chain.html
Povey, D.: Kaldi documentations (2020). https://kaldi-asr.org/doc/index.html
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop (2011)
PUC-Rio: Centro de estudos em telecomunicações (CETUC) (2018). http://www.cetuc.puc-rio.br/
Siravenha, A., Neto, N., Macedo, V., Klautau, A.: Uso de regras fonológicas com determinação de vogal tônica para conversão grafema-fone em Português Brasileiro. In: 7th International Information and Telecommunication Technologies Symposium (2008)
Souza, G., Neto, N.: An automatic phonetic aligner for Brazilian Portuguese with a Praat interface. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS (LNAI), vol. 9727, pp. 374–384. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41552-9_38
Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967). https://doi.org/10.1109/TIT.1967.1054010
Young, S., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University Engineering Department, Version 3.4 (2006)
Yuan, J., Liberman, M.: Speaker identification on the SCOTUS corpus. J. Acoust. Soc. Am. 123(5), 3878–3881 (2008). https://doi.org/10.1121/1.2935783
Acknowledgment
We gratefully acknowledge NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The authors also would like to thank CAPES and CNPq research funding agencies, and Federal University of Pará (UFPA) under Edital n\(^\circ \) 06/2019 – PIBIC/PROPESP for the financial support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Dias, A.L., Batista, C., Santana, D., Neto, N. (2020). Towards a Free, Forced Phonetic Aligner for Brazilian Portuguese Using Kaldi Tools. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12319. Springer, Cham. https://doi.org/10.1007/978-3-030-61377-8_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-61377-8_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61376-1
Online ISBN: 978-3-030-61377-8
eBook Packages: Computer ScienceComputer Science (R0)