Skip to main content

Towards a Free, Forced Phonetic Aligner for Brazilian Portuguese Using Kaldi Tools

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2020)

Abstract

Phonetic analysis of speech, in general, requires the alignment of audio samples to its phonetic transcription. This task could be performed manually for a couple of files, but as the corpus grows large it becomes unfeasibly time-consuming, which emphasizes the need for computational tools that perform such speech-phonemes forced alignment automatically. Therefore, due to the scarce availability of phonetic alignment tools for Brazilian Portuguese (BP), this work describes the evolution process towards creating a free phonetic alignment tool for BP using Kaldi, a toolkit that has been the state of the art for open-source speech recognition. Five acoustic models were trained with Kaldi and tested in phonetic alignment, where the evaluation took place in terms of the phone boundary metric. The results show that its performance is similar to some Kaldi-based aligners for other languages, and superior to an outdated phonetic aligner for BP based on HTK toolkit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Anastasakos, T., McDonough, J., Makhoul, J.: Speaker adaptive training: a maximum likelihood approach to speaker normalization. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1043–1046 (1997)

    Google Scholar 

  2. Anastasakos, T., Mcdonough, J., Schwartz, R., Makhoul, J.: A compact model for speaker-adaptive training. In: Proceedings of the ICSLP, pp. 1137–1140 (1996)

    Google Scholar 

  3. Batista, C., Cunha, R., Batista, P., Klautau, A., Neto, N.: Utterance copy in formant-based speech synthesizers using LSTM neural networks. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 90–95, October 2019. https://doi.org/10.1109/BRACIS.2019.00025

  4. Batista, C., Dias, A.L., Sampaio Neto, N.: Baseline acoustic models for Brazilian Portuguese using Kaldi tools. In: Proceedings of IberSPEECH, pp. 77–81 (2018). https://doi.org/10.21437/IberSPEECH.2018-17

  5. Bigi, B., Hirst, D.: Speech phonetization alignment and syllabification (SPPAS): a tool for the automatic analysis of speech prosody. In: Proceedings of Speech Prosody, pp. 1–4, May 2012. https://www.isca-speech.org/archive/sp2012/papers/sp12_019.pdf

  6. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (version 6.1.15) [computer program] (2020). https://www.fon.hum.uva.nl/praat/

  7. Brognaux, S., Roekhaut, S., Drugman, T., Beaufort, R.: Train&align: a new online tool for automatic phonetic alignment. In: IEEE Workshop on Spoken Language Technology, pp. 416–421 (2012). https://doi.org/10.1109/SLT.2012.6424260

  8. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420

    Article  Google Scholar 

  9. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, Hoboken (2000)

    MATH  Google Scholar 

  10. Gales, M.J.F.: Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998). https://doi.org/10.1006/csla.1998.0043

    Article  Google Scholar 

  11. GitHub: Kaldi speech recognition toolkit (2018). https://github.com/kaldi-asr/kaldi

  12. GitHub: Frequencywords (2020). https://github.com/hermitdave/FrequencyWords

  13. GitHub: GNU Aspell (2020). https://github.com/GNUAspell/aspell

  14. Goldman, J.P.: EasyAlign: an automatic phonetic alignment tool under Praat. In: Proceedings of Interspeech, pp. 3233–3236 (2011). https://archive-ouverte.unige.ch/unige:18188

  15. Gopinath, R.A.: Maximum likelihood modeling with Gaussian distributions for classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 2, pp. 661–664, May 1998. https://doi.org/10.1109/ICASSP.1998.675351

  16. Gorman, K., Howell, J., Wagner, M.: Prosodylab-aligner: a tool for forced alignment of laboratory speech. Can. Acoust. 39(3), 192–193 (2011). https://jcaa.caa-aca.ca/index.php/jcaa/article/view/2476

    Google Scholar 

  17. Grupo FalaBrasil: Ferramentas para alinhamento fonético em português brasileiro (2020). https://gitlab.com/fb-align/

  18. Grupo FalaBrasil: NLP: Gerador de ferramentas para processamento de linguagem natural (2020). https://gitlab.com/fb-nlp/nlp-generator

  19. Grupo FalaBrasil: Recursos prontos para processamento de linguagem natural em português brasileiro (2020). https://gitlab.com/fb-nlp/nlp-resources

  20. Guiroy, S., Cordoba, R., Villegas, A.: Application of the Kaldi toolkit for continuous speech recognition using hidden-Markov models and deep neural networks. In: Proceedings of IberSPEECH 2016, pp. 187–196 (2016). https://iberspeech2016.inesc-id.pt/wp-content/uploads/2017/01/OnlineProceedings_IberSPEECH2016.pdf

  21. Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, 1st edn. Prentice Hall PTR, Upper Saddle River (2001)

    Google Scholar 

  22. Jiampojamarn, S., Kondrak, G., Sherif, T.: Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, Rochester, New York, pp. 372–379. Association for Computational Linguistics, April 2007. http://www.aclweb.org/anthology/N/N07/N07-1047

  23. Kipyatkova, I., Karpov, A.: DNN-based acoustic modeling for Russian speech recognition using Kaldi. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 246–253. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43958-7_29

    Chapter  Google Scholar 

  24. LDC: CSLU: Spoltech Brazilian Portuguese version 1.0 (2018). https://catalog.ldc.upenn.edu/LDC2006S16

  25. LDC: West point Brazilian Portuguese speech (2018). https://catalog.ldc.upenn.edu/LDC2008S04

  26. McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., Sonderegger, M.: Montreal forced aligner: trainable text-speech alignment using Kaldi. In: Proceedings of Interspeech, pp. 498–502, August 2017. https://doi.org/10.21437/Interspeech.2017-1386

  27. Miao, Y., Zhang, H., Metze, F.: Speaker adaptive training of deep neural network acoustic models using I-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1938–1949 (2015)

    Article  Google Scholar 

  28. Neto, N., Patrick, C., Klautau, A., Trancoso, I.: Free tools and resources for Brazilian Portuguese speech recognition. J. Braz. Comput. Soc. 17(1), 53–68 (2010). https://doi.org/10.1007/s13173-010-0023-1

    Article  Google Scholar 

  29. Ochshorn, R.M., Hawkins, M.: Gentle forced aligner [computer program] (2020). https://github.com/lowerquality/gentle

  30. opensubtitles.org: Opensubtitles (2020). https://www.opensubtitles.org/

  31. PCD Legal: PCD legal: Acessível para todos (2018). http://www.pcdlegal.com.br/

  32. Povey, D.: Chain models (2020). https://kaldi-asr.org/doc/chain.html

  33. Povey, D.: Kaldi documentations (2020). https://kaldi-asr.org/doc/index.html

  34. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop (2011)

    Google Scholar 

  35. PUC-Rio: Centro de estudos em telecomunicações (CETUC) (2018). http://www.cetuc.puc-rio.br/

  36. Siravenha, A., Neto, N., Macedo, V., Klautau, A.: Uso de regras fonológicas com determinação de vogal tônica para conversão grafema-fone em Português Brasileiro. In: 7th International Information and Telecommunication Technologies Symposium (2008)

    Google Scholar 

  37. Souza, G., Neto, N.: An automatic phonetic aligner for Brazilian Portuguese with a Praat interface. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS (LNAI), vol. 9727, pp. 374–384. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41552-9_38

    Chapter  Google Scholar 

  38. Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967). https://doi.org/10.1109/TIT.1967.1054010

    Article  MATH  Google Scholar 

  39. Young, S., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University Engineering Department, Version 3.4 (2006)

    Google Scholar 

  40. Yuan, J., Liberman, M.: Speaker identification on the SCOTUS corpus. J. Acoust. Soc. Am. 123(5), 3878–3881 (2008). https://doi.org/10.1121/1.2935783

    Article  Google Scholar 

Download references

Acknowledgment

We gratefully acknowledge NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. The authors also would like to thank CAPES and CNPq research funding agencies, and Federal University of Pará (UFPA) under Edital n\(^\circ \) 06/2019 – PIBIC/PROPESP for the financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ana Larissa Dias .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dias, A.L., Batista, C., Santana, D., Neto, N. (2020). Towards a Free, Forced Phonetic Aligner for Brazilian Portuguese Using Kaldi Tools. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12319. Springer, Cham. https://doi.org/10.1007/978-3-030-61377-8_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61377-8_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61376-1

  • Online ISBN: 978-3-030-61377-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics