Skip to main content

Deep Learning Approaches for Sung Vowel Classification

  • Conference paper
  • First Online:
Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART 2024)

Abstract

Phoneme classification is an important part of automatic speech recognition systems. However, attempting to classify phonemes during singing has been significantly less studied. In this work, we investigate sung vowel classification, a subset of the phoneme classification problem. Many prior approaches that attempt to classify spoken or sung vowels rely upon spectral feature extraction, such as formants or Mel-frequency cepstral coefficients. We explore classifying sung vowels with deep neural networks trained directly on raw audio. Using VocalSet, a singing voice dataset performed by professional singers, we compare three neural models and two spectral models for classifying five sung Italian vowels performed in a variety of vocal techniques. We find that our neural models achieved accuracies between 68.4% and 79.6%, whereas our spectral models failed to discern vowels. Of the neural models, we find that a fine-tuned transformer performed the strongest; however, a convolutional or recurrent model may provide satisfactory results in resource-limited scenarios. This result implies that neural approaches trained directly on raw audio, without extracting spectral features, are viable approaches for singing phoneme classification and deserve further exploration.

Parker Carlson–Work done while at Oregon State University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://sox.sourceforge.net/sox.html.

  2. 2.

    Each split of singers is either (two male, three female) or (three male, two female).

  3. 3.

    tensorflow.org.

  4. 4.

    github.com/yizhilll/MERT/blob/main/scripts/MERT_demo_inference.py.

  5. 5.

    pytorch.org.

  6. 6.

    github.com/huggingface/transformers.

References

  1. Avramidis, K., Kratimenos, A., Garoufis, C., Zlatintsi, A., Maragos, P.: Deep convolutional and recurrent networks for polyphonic instrument classification from monophonic raw audio waveforms. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3010–3014 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413479

  2. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020). https://doi.org/10.48550/arXiv.2006.11477

  3. Burrows, D.: Singing and saying. J. Musicology 7(3), 390–402 (1989). https://doi.org/10.2307/763607, http://www.jstor.org/stable/763607

  4. Dabike, G.R., Barker, J.: Automatic lyric transcription from karaoke vocal tracks: resources and a baseline system. In: Proceedings of Interspeech, pp. 579–583. International Speech Communication Association (ISCA) (2019). https://doi.org/10.21437/Interspeech.2019-2378

  5. De Wet, F., Weber, K., Boves, L., Cranen, B., Bengio, S., Bourlard, H.: Evaluation of formant-like features on an automatic vowel classification task. J. Acoust. Soc. Am. 116(3), 1781–1792 (2004). https://doi.org/10.1121/1.1781620

    Article  Google Scholar 

  6. Demirel, E., Ahlbäck, S., Dixon, S.: Automatic lyrics transcription using dilated convolutional neural networks with self-attention. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020). https://doi.org/10.1109/IJCNN48605.2020.9207052

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423

  8. Gales, M., Young, S.: Application of hidden Markov models in speech recognition. Found. Trends Sig. Process. 1(3), 195–304 (2008). https://doi.org/10.1561/2000000004

    Article  Google Scholar 

  9. Garofolo, J.S., et al.: Timit acoustic-phonetic continous speech corpus (1993). https://doi.org/10.35111/17gk-bn40

  10. Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). https://doi.org/10.1007/11550907_126

    Chapter  Google Scholar 

  11. Gupta, C., Yılmaz, E., Li, H.: Automatic lyrics alignment and transcription in polyphonic music: does background music help? In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 496–500. IEEE (2020). https://doi.org/10.1109/TASLP.2022.3190742

  12. He, Y., et al.: Streaming end-to-end speech recognition for mobile devices. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385 (2019). https://doi.org/10.1109/ICASSP.2019.8682336

  13. Hillenbrand, J., Gayvert, R.T.: Vowel classification based on fundamental frequency and formant frequencies. J. Speech Lang. Hear. Res. 36(4), 694–700 (1993). https://doi.org/10.1044/jshr.3604.694

    Article  Google Scholar 

  14. Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 3451–3460 (2021). https://doi.org/10.1109/TASLP.2021.3122291

    Article  Google Scholar 

  15. Jha, M.V., Rao, P.: Assessing vowel quality for singing evaluation. In: National Conference on Communications (NCC 2012), pp. 1–5. IEEE (2012). https://doi.org/10.1109/NCC.2012.6176860

  16. Karpagavalli, S., Chandra, E.: A review on automatic speech recognition architecture and approaches. J. Sig. Process. Image Process. Pattern Recogn. 9(4), 393–404 (2016). https://doi.org/10.14257/ijsip.2016.9.4.34

  17. Kermanshahi, M.A., Akbari, A., Nasersharif, B.: Transfer learning for end-to-end ASR to deal with low-resource problem in Persian language. In: 26th International Computer Conference, Computer Society of Iran (CSICC), pp. 1–5 (2021). https://doi.org/10.1109/CSICC52343.2021.9420540

  18. Khunarsal, P., Lursinsap, C., Raicharoen, T.: Singing voice recognition based on matching of spectrogram pattern. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1595–1599 (2009). https://doi.org/10.1109/IJCNN.2009.5179014

  19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations (ICLR) (2015). https://doi.org/10.48550/arXiv.1412.6980

  20. Korkmaz, Y., Boyacı, A., Tuncer, T.: Turkish vowel classification based on acoustical and decompositional features optimized by genetic algorithm. Appl. Acoust. 154, 28–35 (2019). https://doi.org/10.1016/j.apacoust.2019.04.027

    Article  Google Scholar 

  21. Korkmaz, Y., Boyaci, A.: Classification of Turkish vowels based on formant frequencies. In: Proceedings of the International Conference on Artificial Intelligence and Data Processing (IDAP), pp. 1–4 (2018). https://doi.org/10.1109/IDAP.2018.8620877

  22. Kruspe, A.M.: Training phoneme models for singing with “songified” speech data. In: Proceedings of the 16th International Conference on Music Information Retrieval (ISMIR), pp. 336–342 (2015). https://archives.ismir.net/ismir2015/paper/000034.pdf

  23. Kruspe, A.M.: Application of automatic speech recognition technologies to singing. Ph.D. thesis, TU Technische Universität, Ilmenau, Germany (2018). https://www.db-thueringen.de/receive/dbt_mods_00035065

  24. Li, Y., et al.: MERT: acoustic music understanding model with large-scale self-supervised training (2023). https://doi.org/10.48550/arXiv.2306.00107

  25. Mesaros, A., Virtanen, T.: Automatic recognition of lyrics in singing. EURASIP J. Audio, Speech, Music Process. 2010, 1–11 (2010). https://doi.org/10.1155/2010/546047

    Article  Google Scholar 

  26. van den Oord, A., et al.: Wavenet: a generative model for raw audio (2016). https://doi.org/10.48550/arXiv.1609.03499

  27. Ou, L., Gu, X., Wang, Y.: Transfer learning of wav2vec 2.0 for automatic lyric transcription. In: Proceedings of the 23rd International Conference on Music Information Retrieval (ISMIR), pp. 891–899 (2022). https://archives.ismir.net/ismir2022/paper/000107.pdf

  28. Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Proceedings of Interspeech (2013). https://doi.org/10.21437/Interspeech.2013-438

  29. Santana, I.A.P., et al.: Music4all: a new music database and its applications. In: International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 399–404. IEEE (2020). https://doi.org/10.1109/IWSSIP48289.2020.9145170

  30. Smule, I.: DAMP-MVP: Digital Archive of Mobile Performances - Smule Multilingual Vocal Performance 300x30x2 (2018). https://doi.org/10.5281/zenodo.2747436

  31. Story, B.H.: Vowel acoustics for speaking and singing. Acta Acust. Acust. 90(4), 629–640 (2004)

    Google Scholar 

  32. Sundberg, J.: Articulatory differences between spoken and sung vowels in singers. STL-QPSR, KTH 1(1969), 33–46 (1969)

    Google Scholar 

  33. Sundberg, J.: Articulatory interpretation of the “singing formant”. J. Acoust. Soc. Am. 55(4), 838–844 (1974). https://doi.org/10.1121/1.1914609

  34. Teytaut, Y., Roebel, A.: Phoneme-to-audio alignment with recurrent neural networks for speaking and singing voice. In: Proceedings of Interspeech. pp. 61–65. International Speech Communication Association (ISCA) (2021). https://doi.org/10.21437/interspeech.2021-1676

  35. Tieleman, T., Hinton, G.: Lecture 6e rmsprop: divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural Networks for Machine Learning, vol. 4, pp. 26–31 (2012). https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

  36. Verma, P., Berger, J.: Audio transformers: transformer architectures for large scale audio understanding. adieu convolutions (2021). https://doi.org/10.48550/arXiv.2105.00335

  37. Weiss, R., Brown, W., Jr., Moris, J.: Singer’s formant in sopranos: fact or fiction? J. Voice 15(4), 457–468 (2001). https://doi.org/10.1016/s0892-1997(01)00046-7

    Article  Google Scholar 

  38. Wilkins, J., Seetharaman, P., Wahl, A., Pardo, B.: Vocalset: a singing voice dataset. In: Proceedings of the 19th International Conference on Music Information Retrieval (ISMIR), pp. 468–474 (2018). https://doi.org/10.5281/zenodo.1193957

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Parker Carlson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Carlson, P., Donnelly, P.J. (2024). Deep Learning Approaches for Sung Vowel Classification. In: Johnson, C., Rebelo, S.M., Santos, I. (eds) Artificial Intelligence in Music, Sound, Art and Design. EvoMUSART 2024. Lecture Notes in Computer Science, vol 14633. Springer, Cham. https://doi.org/10.1007/978-3-031-56992-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56992-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56991-3

  • Online ISBN: 978-3-031-56992-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics