MPEG-1 psychoacoustic model emulation using multiscale convolutional neural networks

Kemper, Guillermo; Sanchez, Alonso; Serpa, Sergio

doi:10.1007/s11042-023-15949-y

MPEG-1 psychoacoustic model emulation using multiscale convolutional neural networks

Published: 05 June 2023

Volume 83, pages 6963–6974, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

108 Accesses
Explore all metrics

Abstract

The Moving Picture Experts Group - 1 (MPEG-1) perceptual audio compression scheme is a successful family of audio codecs described in standard ISO/IEC 11172–3. Currently, there is no general framework to emulate nor MPEG-1 neither any other psychoacoustic model, which is a core piece of many perceptual codecs. This work presents a successful implementation of a convolutional neural network which emulates psychoacoustic model 1 from the MPEG-1 standard, termed “MCNN-PM” (Multiscale Convolutional Neural Network – Psychoacoustic Model). It is then implemented as part of the MPEG-1, Layer I codec. Using the objective difference grade (ODG) to evaluate audio quality, the MCNN-PM MPEG-1, Layer I codec outperforms the original MPEG-1, Layer I codec by up to 17% at 96 kbps, 14% at 128 kbps and performs almost equally at 192 kbps. This work shows that convolutional neural networks are a viable alternative to standard psychoacoustic models and can be used as part of perceptual audio codecs successfully.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

Deepfake: An Overview

Data availability

The original audio files used in the present work can be facilitated upon reasonable request to the corresponding author.

References

Agustsson E, Mentzer F, Tschannen M, Cavigelli L, Timofte R, Benini L, Van Gool L (2017). Soft-to-hard vector quantization for end-to-end learned compression of images and neural networks. arXiv preprint arXiv:1704.00648, 3.
Ananthabhotla I, Ewert S, Paradiso JA (2019). Towards a Perceptual Loss: Using a Neural Network Codec Approximation as a Loss for Generative Audio Models. In Proceedings of the 27th ACM International Conference on Multimedia (pp. 1518-1525).
Bourtsoulatze E, Kurka DB, Gündüz D (2019) Deep joint source-channel coding for wireless image transmission. IEEE Transac Cog Commun Network 5(3):567–579
Article Google Scholar
Cui Z, Chen W, Chen Y (2016). Multi-scale convolutional neural networks for time series classification. arXiv preprint arXiv:1603.06995.
Gârbacea C, van den Oord A, Li Y, Lim FS, Luebs A, Vinyals O, Walters TC (2019). Low bit-rate speech coding with VQ-VAE and a WaveNet decoder. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 735–739). IEEE.
He K, Zhang X, Ren S, Sun J (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).
ISO/IEC (1993). Information technology — Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s — Part 3: Audio. (Standard No. 11172–3)
ISO/IEC (2006). Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC) (Standard No. 13818–1)
Kankanahalli S (2018). End-to-end optimized speech coding with deep neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2521–2525). IEEE.
Kingma DP, Ba J (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
León LFA, Kemper-Vásquez G, Telles J (2011). A novel fuzzy logic-based metric for audio quality assessment: Objective audio quality assessment. In CONATEL 2011 (pp. 1–10). IEEE.
Lim W, Jang I, Beack S, Sung J, Lee T (2022). End-to-end Stereo Audio Coding Using Deep Neural Networks. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 860–864). IEEE.
Min G, Zhang C, Zhang X, Tan W (2019). Deep Vocoder: Low Bit Rate Compression of Speech with Deep Autoencoder. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (pp. 372–377). IEEE.
Recommendation ITU-R BS.1387 (1998): Method for objective measurements of perceived audio quality, ITU-R BS.1387
Zhen K, Lee MS, Sung J, Beack S, Kim M (2020) Psychoacoustic calibration of loss functions for efficient end-to-end neural audio coding. IEEE Sign Proc Lett 27:2159–2163
Article Google Scholar
Zhen K, Lee MS, Sung J, Beack S, Kim M (2020). Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 361–365). IEEE.

Download references

Acknowledgements

The authors would like to thank the Dirección de Investigacion of Universidad Peruana de Ciencias Aplicadas for logistical support to carry out this work.

Code availability

The MATLAB development environment was used to develop the scripts and functions for the MCNN-PM audio codec, via an academic MATLAB license.

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Faculty of Engineering - School of Electronic Engineering, Universidad Peruana de Ciencias Aplicadas, Lima, Peru
Guillermo Kemper, Alonso Sanchez & Sergio Serpa

Authors

Guillermo Kemper
View author publications
You can also search for this author in PubMed Google Scholar
Alonso Sanchez
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Serpa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillermo Kemper.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kemper, G., Sanchez, A. & Serpa, S. MPEG-1 psychoacoustic model emulation using multiscale convolutional neural networks. Multimed Tools Appl 83, 6963–6974 (2024). https://doi.org/10.1007/s11042-023-15949-y

Download citation

Received: 28 October 2021
Revised: 15 February 2023
Accepted: 29 May 2023
Published: 05 June 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15949-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MPEG-1 psychoacoustic model emulation using multiscale convolutional neural networks

Abstract

Access this article

Similar content being viewed by others

A review of convolutional neural networks in computer vision

A survey of the recent architectures of deep convolutional neural networks

Deepfake: An Overview

Data availability

References

Acknowledgements

Code availability

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MPEG-1 psychoacoustic model emulation using multiscale convolutional neural networks

Abstract

Access this article

Similar content being viewed by others

A review of convolutional neural networks in computer vision

A survey of the recent architectures of deep convolutional neural networks

Deepfake: An Overview

Data availability

References

Acknowledgements

Code availability

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation