Abstract
Phoneme classification is an important part of automatic speech recognition systems. However, attempting to classify phonemes during singing has been significantly less studied. In this work, we investigate sung vowel classification, a subset of the phoneme classification problem. Many prior approaches that attempt to classify spoken or sung vowels rely upon spectral feature extraction, such as formants or Mel-frequency cepstral coefficients. We explore classifying sung vowels with deep neural networks trained directly on raw audio. Using VocalSet, a singing voice dataset performed by professional singers, we compare three neural models and two spectral models for classifying five sung Italian vowels performed in a variety of vocal techniques. We find that our neural models achieved accuracies between 68.4% and 79.6%, whereas our spectral models failed to discern vowels. Of the neural models, we find that a fine-tuned transformer performed the strongest; however, a convolutional or recurrent model may provide satisfactory results in resource-limited scenarios. This result implies that neural approaches trained directly on raw audio, without extracting spectral features, are viable approaches for singing phoneme classification and deserve further exploration.
Parker Carlson–Work done while at Oregon State University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Each split of singers is either (two male, three female) or (three male, two female).
- 3.
- 4.
- 5.
- 6.
References
Avramidis, K., Kratimenos, A., Garoufis, C., Zlatintsi, A., Maragos, P.: Deep convolutional and recurrent networks for polyphonic instrument classification from monophonic raw audio waveforms. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3010–3014 (2021). https://doi.org/10.1109/ICASSP39728.2021.9413479
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020). https://doi.org/10.48550/arXiv.2006.11477
Burrows, D.: Singing and saying. J. Musicology 7(3), 390–402 (1989). https://doi.org/10.2307/763607, http://www.jstor.org/stable/763607
Dabike, G.R., Barker, J.: Automatic lyric transcription from karaoke vocal tracks: resources and a baseline system. In: Proceedings of Interspeech, pp. 579–583. International Speech Communication Association (ISCA) (2019). https://doi.org/10.21437/Interspeech.2019-2378
De Wet, F., Weber, K., Boves, L., Cranen, B., Bengio, S., Bourlard, H.: Evaluation of formant-like features on an automatic vowel classification task. J. Acoust. Soc. Am. 116(3), 1781–1792 (2004). https://doi.org/10.1121/1.1781620
Demirel, E., Ahlbäck, S., Dixon, S.: Automatic lyrics transcription using dilated convolutional neural networks with self-attention. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020). https://doi.org/10.1109/IJCNN48605.2020.9207052
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Gales, M., Young, S.: Application of hidden Markov models in speech recognition. Found. Trends Sig. Process. 1(3), 195–304 (2008). https://doi.org/10.1561/2000000004
Garofolo, J.S., et al.: Timit acoustic-phonetic continous speech corpus (1993). https://doi.org/10.35111/17gk-bn40
Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). https://doi.org/10.1007/11550907_126
Gupta, C., Yılmaz, E., Li, H.: Automatic lyrics alignment and transcription in polyphonic music: does background music help? In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 496–500. IEEE (2020). https://doi.org/10.1109/TASLP.2022.3190742
He, Y., et al.: Streaming end-to-end speech recognition for mobile devices. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385 (2019). https://doi.org/10.1109/ICASSP.2019.8682336
Hillenbrand, J., Gayvert, R.T.: Vowel classification based on fundamental frequency and formant frequencies. J. Speech Lang. Hear. Res. 36(4), 694–700 (1993). https://doi.org/10.1044/jshr.3604.694
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 3451–3460 (2021). https://doi.org/10.1109/TASLP.2021.3122291
Jha, M.V., Rao, P.: Assessing vowel quality for singing evaluation. In: National Conference on Communications (NCC 2012), pp. 1–5. IEEE (2012). https://doi.org/10.1109/NCC.2012.6176860
Karpagavalli, S., Chandra, E.: A review on automatic speech recognition architecture and approaches. J. Sig. Process. Image Process. Pattern Recogn. 9(4), 393–404 (2016). https://doi.org/10.14257/ijsip.2016.9.4.34
Kermanshahi, M.A., Akbari, A., Nasersharif, B.: Transfer learning for end-to-end ASR to deal with low-resource problem in Persian language. In: 26th International Computer Conference, Computer Society of Iran (CSICC), pp. 1–5 (2021). https://doi.org/10.1109/CSICC52343.2021.9420540
Khunarsal, P., Lursinsap, C., Raicharoen, T.: Singing voice recognition based on matching of spectrogram pattern. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1595–1599 (2009). https://doi.org/10.1109/IJCNN.2009.5179014
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations (ICLR) (2015). https://doi.org/10.48550/arXiv.1412.6980
Korkmaz, Y., Boyacı, A., Tuncer, T.: Turkish vowel classification based on acoustical and decompositional features optimized by genetic algorithm. Appl. Acoust. 154, 28–35 (2019). https://doi.org/10.1016/j.apacoust.2019.04.027
Korkmaz, Y., Boyaci, A.: Classification of Turkish vowels based on formant frequencies. In: Proceedings of the International Conference on Artificial Intelligence and Data Processing (IDAP), pp. 1–4 (2018). https://doi.org/10.1109/IDAP.2018.8620877
Kruspe, A.M.: Training phoneme models for singing with “songified” speech data. In: Proceedings of the 16th International Conference on Music Information Retrieval (ISMIR), pp. 336–342 (2015). https://archives.ismir.net/ismir2015/paper/000034.pdf
Kruspe, A.M.: Application of automatic speech recognition technologies to singing. Ph.D. thesis, TU Technische Universität, Ilmenau, Germany (2018). https://www.db-thueringen.de/receive/dbt_mods_00035065
Li, Y., et al.: MERT: acoustic music understanding model with large-scale self-supervised training (2023). https://doi.org/10.48550/arXiv.2306.00107
Mesaros, A., Virtanen, T.: Automatic recognition of lyrics in singing. EURASIP J. Audio, Speech, Music Process. 2010, 1–11 (2010). https://doi.org/10.1155/2010/546047
van den Oord, A., et al.: Wavenet: a generative model for raw audio (2016). https://doi.org/10.48550/arXiv.1609.03499
Ou, L., Gu, X., Wang, Y.: Transfer learning of wav2vec 2.0 for automatic lyric transcription. In: Proceedings of the 23rd International Conference on Music Information Retrieval (ISMIR), pp. 891–899 (2022). https://archives.ismir.net/ismir2022/paper/000107.pdf
Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Proceedings of Interspeech (2013). https://doi.org/10.21437/Interspeech.2013-438
Santana, I.A.P., et al.: Music4all: a new music database and its applications. In: International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 399–404. IEEE (2020). https://doi.org/10.1109/IWSSIP48289.2020.9145170
Smule, I.: DAMP-MVP: Digital Archive of Mobile Performances - Smule Multilingual Vocal Performance 300x30x2 (2018). https://doi.org/10.5281/zenodo.2747436
Story, B.H.: Vowel acoustics for speaking and singing. Acta Acust. Acust. 90(4), 629–640 (2004)
Sundberg, J.: Articulatory differences between spoken and sung vowels in singers. STL-QPSR, KTH 1(1969), 33–46 (1969)
Sundberg, J.: Articulatory interpretation of the “singing formant”. J. Acoust. Soc. Am. 55(4), 838–844 (1974). https://doi.org/10.1121/1.1914609
Teytaut, Y., Roebel, A.: Phoneme-to-audio alignment with recurrent neural networks for speaking and singing voice. In: Proceedings of Interspeech. pp. 61–65. International Speech Communication Association (ISCA) (2021). https://doi.org/10.21437/interspeech.2021-1676
Tieleman, T., Hinton, G.: Lecture 6e rmsprop: divide the gradient by a running average of its recent magnitude. In: COURSERA: Neural Networks for Machine Learning, vol. 4, pp. 26–31 (2012). https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Verma, P., Berger, J.: Audio transformers: transformer architectures for large scale audio understanding. adieu convolutions (2021). https://doi.org/10.48550/arXiv.2105.00335
Weiss, R., Brown, W., Jr., Moris, J.: Singer’s formant in sopranos: fact or fiction? J. Voice 15(4), 457–468 (2001). https://doi.org/10.1016/s0892-1997(01)00046-7
Wilkins, J., Seetharaman, P., Wahl, A., Pardo, B.: Vocalset: a singing voice dataset. In: Proceedings of the 19th International Conference on Music Information Retrieval (ISMIR), pp. 468–474 (2018). https://doi.org/10.5281/zenodo.1193957
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Carlson, P., Donnelly, P.J. (2024). Deep Learning Approaches for Sung Vowel Classification. In: Johnson, C., Rebelo, S.M., Santos, I. (eds) Artificial Intelligence in Music, Sound, Art and Design. EvoMUSART 2024. Lecture Notes in Computer Science, vol 14633. Springer, Cham. https://doi.org/10.1007/978-3-031-56992-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-56992-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56991-3
Online ISBN: 978-3-031-56992-0
eBook Packages: Computer ScienceComputer Science (R0)