Audio super-resolution via vision transformer

Nisticò, Simona; Palopoli, Luigi; Romano, Adele Pia

doi:10.1007/s10844-023-00833-w

Simona Nisticò¹,
Luigi Palopoli¹ &
Adele Pia Romano¹

181 Accesses
Explore all metrics

Abstract

Audio super-resolution refers to techniques that improve the audio signals quality, usually by exploiting bandwidth extension methods, whereby audio enhancement is obtained by expanding the phase and the spectrogram of the input audio traces. These techniques are therefore much significant for all those cases where audio traces miss relevant parts of the audible spectrum. In several cases, the given input signal contains the low-band frequencies (the easiest to capture with low-quality recording instruments) whereas the high-band must be generated. In this paper, we illustrate techniques implemented into a system for bandwidth extension that works on musical tracks and generates the high-band frequencies starting from the low-band ones. The system, called ViT Super-resolution (\(\textit{ViT-SR}\)), features an architecture based on a Generative Adversarial Network and Vision Transformer model. In particular, two versions of the architecture will be presented in this paper, that work on different input frequency ranges. Experiments, which are accounted for in the paper, prove the effectiveness of our approach. In particular, the objective has been attained to demonstrate that it is possible to faithfully reconstruct the high-band signal of an audio file having only its low-band spectrum available as the input, therewith including the usually difficult to synthetically generate harmonics occurring in the audio tracks, which significantly contribute to the final perceived sound quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Article 04 June 2022

Learning a Deep Convolutional Network for Image Super-Resolution

Deepfake generation and detection, a survey

Article 08 January 2022

Availability of data and material

Not Applicable.

Notes

We would like to thank one of the anonymous Reviewers for pointing out this method to us.
The source code for \(\textit{ViT-SR Small}\) and \(\textit{ViT-SR}\) is freely available at https://github.com/simona-nistico/ViT-SR.

References

Andreev, P., Alanov, A., Ivanov, O., & Vetrov, D. (2022). Hifi++: A unified framework for neural vocoding, bandwidth extension and speech enhancement. Preprint retrieved from http://arxiv.org/abs/2203.13086
Charleston, S., & Azimi-Sadjadi, M. R. (1996). Reduced order Kalman filtering for the enhancement of respiratory sounds. IEEE Transactions on Biomedical Engineering, 43(4), 421–424.
Article Google Scholar
Chen, X., & Yang, J. (2021). Speech bandwidth extension based on Wasserstein generative adversarial network. In: 2021 IEEE 21st International Conference on Communication Technology (ICCT) (pp. 1356–1362). IEEE.
Choi, H.-S., Kim, J.-H., Huh, J., Kim, A., Ha, J.-W., & Lee, K. (2018). Phase-aware speech enhancement with deep complex u-net. In International Conference on Learning Representations.
Dai, J., Zhang, Y., Xie, P., & Xu, X. (2021). Super-resolution for music signals using generative adversarial networks. In 2021 IEEE 4th International Conference on Big Data and Artificial Intelligence (BDAI) (pp. 1–5). IEEE.
Defferrard, M., Benzi, K., Vandergheynst, P., & Bresson, X. (2016) FMA: A dataset for music analysis. Preprint retrieved from arXiv:1612.01840
Deng, J., Schuller, B., Eyben, F., Schuller, D., Zhang, Z., Francois, H., & Oh, E. (2020). Exploiting time-frequency patterns with LSTM-RNNS for low-bitrate audio restoration. Neural Computing and Applications, 32(4), 1095–1107.
Article Google Scholar
Erell, A., & Weintraub, M. (1990).Estimation using log-spectral-distance criterion for noise-robust speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (pp. 853–856). IEEE.
Fujimura, T., & Miyazaki, R. (2022). Removal of musical noise using deep speech prior. Applied Acoustics, 194, 108772.
Article Google Scholar
Gong, Y., Chung, Y.-A., & Glass, J. (2021). AST: Audio Spectrogram Transformer.
Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.-M., & Hu, S.-M. (2021). Attention mechanisms in computer vision: A survey. Preprint retrieved from http://arxiv.org/abs/2111.07624
https://huggingface.co/docs/transformers/index
Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A. M., Hoffman, M. D., Dinculescu, M., & Eck, D. (2018). Music transformer. Preprint retrieved from http://arxiv.org/abs/1809.04281
Johnson, D. H. (2006). Signal-to-noise ratio. Scholarpedia, 1(12), 2088.
Article Google Scholar
Kim, J., Englebienne, G., Truong, K. P., & Evers, V. (2017). Deep temporal models using identity skip-connections for speech emotion recognition. In E. A. Laurent Amsaleg & B. Huet (Eds.) Proceedings of the 25th ACM International Conference on Multimedia (pp. 1006–1013).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Preprint retrieved from http://arxiv.org/abs/1412.6980
Kolesnikov, A., Dosovitskiy, A., Weissenborn, D., Heigold, G., Uszkoreit, J., Beyer, L., Minderer, M., Dehghani, M., Houlsby, N., Gelly, S., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale.
Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022–17033.
Google Scholar
Kuleshov, V., Enam, S. Z., & Ermon, S. (2017). Audio super resolution using neural networks. Preprint retrieved from http://arxiv.org/abs/1708.00853
Li, K., & Lee, C.-H. (2015). A deep neural network approach to speech bandwidth expansion. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4395–4399). IEEE.
Liu, Y. (2021). Recovery of lossy compressed music based on CNN super-resolution and GAN. In 2021 IEEE 3rd International Conference on Frontiers Technology of Information and Computer (ICFTIC) (pp. 623–629). IEEE.
Liu, S., Keren, G., Parada-Cabaleiro, E., & Schuller, B. (2021). N-HANS: A neural network-based toolkit for in-the-wild audio enhancement. Multimedia Tools and Applications, 80(18), 28365–28389.
Article Google Scholar
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. Preprint retrieved from http://arxiv.org/abs/1711.05101
Mandel, M., Tal, O., & Adi, Y. (2023). Aero: Audio super resolution in the spectral domain. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE.
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Kathryn Huff, J. B. (Ed.), Proceedings of the 14th Python in Science Conference, (Vol. 8, pp. 18–25). Citeseer.
McKinley, S., & Levine, M. (1998). Cubic spline interpolation. College of the Redwoods, 45(1), 1049–1060.
Google Scholar
Nisticò, S., Palopoli, L., & Romano, A. P. (2022). Audio super-resolution via vision transformer. In International Symposium on Methodologies for Intelligent Systems (pp. 378–387). Springer.
Nogales, A., Donaher, S., & García-Tejedor, Á. (2023). A deep learning framework for audio restoration using convolutional/deconvolutional deep autoencoders. Expert Systems with Applications, 120586.
Oyedotun, O. K., Al Ismaeil, K., & Aouada, D. (2022). Why is everyone training very deep neural network with skip connections? IEEE Transactions on Neural Networks and Learning Systems.
Podder, P., Khan, T. Z., Khan, M. H., & Rahman, M. M. (2014). Comparative performance analysis of hamming, hanning and blackman window. International Journal of Computer Applications,96(18).
Prasad, N., & Kumar, T. K. (2016). Bandwidth extension of speech signals: A comprehensive review. International Journal of Intelligent Systems and Applications, 8(2), 45–52.
Article Google Scholar
Rethage, D., Pons, J., & Serra, X. (2018). A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5069–5073). IEEE.
Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the IRE, 37(1), 10–21. https://doi.org/10.1109/jrproc.1949.232969
Article MathSciNet Google Scholar
Smaragdis, P., & Raj, B. (2007). Example-driven bandwidth expansion. In 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 135–138). IEEE.
Su, J., Wang, Y., Finkelstein, A., & Jin, Z. (2021). Bandwidth extension is all you need. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 696–700). IEEE.
Su, J., Wang, Y., Finkelstein, A., & Jin, Z. (2021). Bandwidth extension is all you need. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 696–700). IEEE.
Wang, H., & Wang, D. (2020). Time-frequency loss for CNN based speech super-resolution. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 861–865). IEEE.
Wang, H., & Wang, D. (2021). Towards robust speech super-resolution. IEEE/ACM Transactions on Audio, Speech, and Language Processing,29, 2058–2066.
Wang, D., & Chen, J. (2018). Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), 1702–1726.
Article MathSciNet Google Scholar
Wang, J.-C., Lee, H.-P., Wang, J.-F., & Lin, C.-B. (2008). Robust environmental sound recognition for home automation. IEEE Transactions on Automation Science and Engineering, 5(1), 25–31.
Article Google Scholar
Westhausen, N. L., & Meyer, B. T. (2020). Dual-signal transformation LSTM network for real-time noise suppression.
Yamamoto, R., Song, E., & Kim, J.-M. (2020) Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6199–6203). IEEE.

Download references

Acknowledgements

The authors gratefully thank the anonymous Reviewers for their much useful comments and suggestions that allowed to significantly improve the quality of the paper.

Funding

This work has been partially supported by PNRR FAIR - Future AI Research (PE00000013), Spoke 9 - Green-aware AI, under the PNNR program funded by EU in the context of NextGenerationEU.

Author information

Authors and Affiliations

DIMES Department, University of Calabria, Rende, Italy
Simona Nisticò, Luigi Palopoli & Adele Pia Romano

Authors

Simona Nisticò
View author publications
You can also search for this author in PubMed Google Scholar
Luigi Palopoli
View author publications
You can also search for this author in PubMed Google Scholar
Adele Pia Romano
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Simona Nisticò, Luigi Palopoli and Adele Romano. The first draft of the manuscript was written by all authors and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Simona Nisticò.

Ethics declarations

Ethics approval

Not Applicable.

Consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is the extended version of the paper S. Nisticò, L. Palopoli, A. P. Romano, “Audio Super-Resolution via Vision Transformer" appearing in the proceedings of the ISMIS conference, Cosenza, 2022.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nisticò, S., Palopoli, L. & Romano, A.P. Audio super-resolution via vision transformer. J Intell Inf Syst (2023). https://doi.org/10.1007/s10844-023-00833-w

Download citation

Received: 13 July 2023
Revised: 23 November 2023
Accepted: 28 November 2023
Published: 12 December 2023
DOI: https://doi.org/10.1007/s10844-023-00833-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio super-resolution via vision transformer

Abstract

Access this article

Similar content being viewed by others

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Learning a Deep Convolutional Network for Image Super-Resolution

Deepfake generation and detection, a survey

Availability of data and material

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Audio super-resolution via vision transformer

Abstract

Access this article

Similar content being viewed by others

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Learning a Deep Convolutional Network for Image Super-Resolution

Deepfake generation and detection, a survey

Availability of data and material

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation