Towards a Free, Forced Phonetic Aligner for Brazilian Portuguese Using Kaldi Tools

Dias, Ana Larissa; Batista, Cassio; Santana, Daniel; Neto, Nelson

doi:10.1007/978-3-030-61377-8_44

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12319))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

1225 Accesses
4 Citations

Abstract

Phonetic analysis of speech, in general, requires the alignment of audio samples to its phonetic transcription. This task could be performed manually for a couple of files, but as the corpus grows large it becomes unfeasibly time-consuming, which emphasizes the need for computational tools that perform such speech-phonemes forced alignment automatically. Therefore, due to the scarce availability of phonetic alignment tools for Brazilian Portuguese (BP), this work describes the evolution process towards creating a free phonetic alignment tool for BP using Kaldi, a toolkit that has been the state of the art for open-source speech recognition. Five acoustic models were trained with Kaldi and tested in phonetic alignment, where the evaluation took place in terms of the phone boundary metric. The results show that its performance is similar to some Kaldi-based aligners for other languages, and superior to an outdated phonetic aligner for BP based on HTK toolkit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anastasakos, T., McDonough, J., Makhoul, J.: Speaker adaptive training: a maximum likelihood approach to speaker normalization. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1043–1046 (1997)
Google Scholar
Anastasakos, T., Mcdonough, J., Schwartz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proceedings of the ICSLP, pp. 1137–1140 (1996)
Google Scholar
Batista, C., Cunha, R., Batista, P., Klautau, A., Neto, N.: Utterance copy in formant-based speech synthesizers using LSTM neural networks. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 90–95, October 2019. https://doi.org/10.1109/BRACIS.2019.00025
Batista, C., Dias, A.L., Sampaio Neto, N.: Baseline acoustic models for Brazilian Portuguese using Kaldi tools. In: Proceedings of IberSPEECH, pp. 77–81 (2018). https://doi.org/10.21437/IberSPEECH.2018-17
Bigi, B., Hirst, D.: Speech phonetization alignment and syllabification (SPPAS): a tool for the automatic analysis of speech prosody. In: Proceedings of Speech Prosody, pp. 1–4, May 2012. https://www.isca-speech.org/archive/sp2012/papers/sp12_019.pdf
Boersma, P., Weenink, D.: Praat: doing phonetics by computer (version 6.1.15) [computer program] (2020). https://www.fon.hum.uva.nl/praat/
Brognaux, S., Roekhaut, S., Drugman, T., Beaufort, R.: Train&align: a new online tool for automatic phonetic alignment. In: IEEE Workshop on Spoken Language Technology, pp. 416–421 (2012). https://doi.org/10.1109/SLT.2012.6424260
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
Article Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, Hoboken (2000)
MATH Google Scholar
Gales, M.J.F.: Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998). https://doi.org/10.1006/csla.1998.0043
Article Google Scholar
GitHub: Kaldi speech recognition toolkit (2018). https://github.com/kaldi-asr/kaldi
GitHub: Frequencywords (2020). https://github.com/hermitdave/FrequencyWords
GitHub: GNU Aspell (2020). https://github.com/GNUAspell/aspell
Goldman, J.P.: EasyAlign: an automatic phonetic alignment tool under Praat. In: Proceedings of Interspeech, pp. 3233–3236 (2011). https://archive-ouverte.unige.ch/unige:18188
Gopinath, R.A.: Maximum likelihood modeling with Gaussian distributions for classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 2, pp. 661–664, May 1998. https://doi.org/10.1109/ICASSP.1998.675351
Gorman, K., Howell, J., Wagner, M.: Prosodylab-aligner: a tool for forced alignment of laboratory speech. Can. Acoust. 39(3), 192–193 (2011). https://jcaa.caa-aca.ca/index.php/jcaa/article/view/2476
Google Scholar
Grupo FalaBrasil: Ferramentas para alinhamento fonético em português brasileiro (2020). https://gitlab.com/fb-align/
Grupo FalaBrasil: NLP: Gerador de ferramentas para processamento de linguagem natural (2020). https://gitlab.com/fb-nlp/nlp-generator
Grupo FalaBrasil: Recursos prontos para processamento de linguagem natural em português brasileiro (2020). https://gitlab.com/fb-nlp/nlp-resources
Guiroy, S., Cordoba, R., Villegas, A.: Application of the Kaldi toolkit for continuous speech recognition using hidden-Markov models and deep neural networks. In: Proceedings of IberSPEECH 2016, pp. 187–196 (2016). https://iberspeech2016.inesc-id.pt/wp-content/uploads/2017/01/OnlineProceedings_IberSPEECH2016.pdf
Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice Hall PTR, Upper Saddle River (2001)
Google Scholar
Jiampojamarn, S., Kondrak, G., Sherif, T.: Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, Rochester, New York, pp. 372–379. Association for Computational Linguistics, April 2007. http://www.aclweb.org/anthology/N/N07/N07-1047
Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using Kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_29
Chapter Google Scholar
LDC: CSLU: Spoltech Brazilian Portuguese version 1.0 (2018). https://catalog.ldc.upenn.edu/LDC2006S16
LDC: West point Brazilian Portuguese speech (2018). https://catalog.ldc.upenn.edu/LDC2008S04
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using Kaldi. In: Proceedings of Interspeech, pp. 498–502, August 2017. https://doi.org/10.21437/Interspeech.2017-1386
Miao, Y., Zhang, H., Metze, F.: Speaker adaptive training of deep neural network acoustic models using I-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1938–1949 (2015)
Article Google Scholar
Neto, N., Patrick, C., Klautau, A., Trancoso, I.: Free tools and resources for Brazilian Portuguese speech recognition. J. Braz. Comput. Soc. 17(1), 53–68 (2010). https://doi.org/10.1007/s13173-010-0023-1
Article Google Scholar
Ochshorn, R.M., Hawkins, M.: Gentle forced aligner [computer program] (2020). https://github.com/lowerquality/gentle
opensubtitles.org: Opensubtitles (2020). https://www.opensubtitles.org/
PCD Legal: PCD legal: Acessível para todos (2018). http://www.pcdlegal.com.br/
Povey, D.: Chain models (2020). https://kaldi-asr.org/doc/chain.html
Povey, D.: Kaldi documentations (2020). https://kaldi-asr.org/doc/index.html
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop (2011)
Google Scholar
PUC-Rio: Centro de estudos em telecomunicações (CETUC) (2018). http://www.cetuc.puc-rio.br/
Siravenha, A., Neto, N., Macedo, V., Klautau, A.: Uso de regras fonológicas com determinação de vogal tônica para conversão grafema-fone em Português Brasileiro. In: 7th International Information and Telecommunication Technologies Symposium (2008)
Google Scholar
Souza, G., Neto, N.: An automatic phonetic aligner for Brazilian Portuguese with a Praat interface. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS (LNAI), vol. 9727, pp. 374–384. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41552-9_38
Chapter Google Scholar
Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967). https://doi.org/10.1109/TIT.1967.1054010
Article MATH Google Scholar
Young, S., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University Engineering Department, Version 3.4 (2006)
Google Scholar
Yuan, J., Liberman, M.: Speaker identification on the SCOTUS corpus. J. Acoust. Soc. Am. 123(5), 3878–3881 (2008). https://doi.org/10.1121/1.2935783
Article Google Scholar

Download references

Acknowledgment

We gratefully acknowledge NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The authors also would like to thank CAPES and CNPq research funding agencies, and Federal University of Pará (UFPA) under Edital n\(^\circ \) 06/2019 – PIBIC/PROPESP for the financial support.

Author information

Authors and Affiliations

Institute of Exact and Natural Sciences, Federal University of Pará, Augusto Corrêa 1, Belém, 66075–110, Brazil
Ana Larissa Dias, Cassio Batista, Daniel Santana & Nelson Neto

Authors

Ana Larissa Dias
View author publications
You can also search for this author in PubMed Google Scholar
Cassio Batista
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Santana
View author publications
You can also search for this author in PubMed Google Scholar
Nelson Neto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ana Larissa Dias .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Ricardo Cerri
Federal University of ABC, Santo Andre, Brazil
Ronaldo C. Prati

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dias, A.L., Batista, C., Santana, D., Neto, N. (2020). Towards a Free, Forced Phonetic Aligner for Brazilian Portuguese Using Kaldi Tools. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12319. Springer, Cham. https://doi.org/10.1007/978-3-030-61377-8_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-61377-8_44
Published: 13 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61376-1
Online ISBN: 978-3-030-61377-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics