Skip to main content
Log in

MPEG-1 psychoacoustic model emulation using multiscale convolutional neural networks

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The Moving Picture Experts Group - 1 (MPEG-1) perceptual audio compression scheme is a successful family of audio codecs described in standard ISO/IEC 11172–3. Currently, there is no general framework to emulate nor MPEG-1 neither any other psychoacoustic model, which is a core piece of many perceptual codecs. This work presents a successful implementation of a convolutional neural network which emulates psychoacoustic model 1 from the MPEG-1 standard, termed “MCNN-PM” (Multiscale Convolutional Neural Network – Psychoacoustic Model). It is then implemented as part of the MPEG-1, Layer I codec. Using the objective difference grade (ODG) to evaluate audio quality, the MCNN-PM MPEG-1, Layer I codec outperforms the original MPEG-1, Layer I codec by up to 17% at 96 kbps, 14% at 128 kbps and performs almost equally at 192 kbps. This work shows that convolutional neural networks are a viable alternative to standard psychoacoustic models and can be used as part of perceptual audio codecs successfully.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The original audio files used in the present work can be facilitated upon reasonable request to the corresponding author.

References

  1. Agustsson E, Mentzer F, Tschannen M, Cavigelli L, Timofte R, Benini L, Van Gool L (2017). Soft-to-hard vector quantization for end-to-end learned compression of images and neural networks. arXiv preprint arXiv:1704.00648, 3.

  2. Ananthabhotla I, Ewert S, Paradiso JA (2019). Towards a Perceptual Loss: Using a Neural Network Codec Approximation as a Loss for Generative Audio Models. In Proceedings of the 27th ACM International Conference on Multimedia (pp. 1518-1525).

  3. Bourtsoulatze E, Kurka DB, Gündüz D (2019) Deep joint source-channel coding for wireless image transmission. IEEE Transac Cog Commun Network 5(3):567–579

    Article  Google Scholar 

  4. Cui Z, Chen W, Chen Y (2016). Multi-scale convolutional neural networks for time series classification. arXiv preprint arXiv:1603.06995.

  5. Gârbacea C, van den Oord A, Li Y, Lim FS, Luebs A, Vinyals O, Walters TC (2019). Low bit-rate speech coding with VQ-VAE and a WaveNet decoder. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 735–739). IEEE.

  6. He K, Zhang X, Ren S, Sun J (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).

  7. ISO/IEC (1993). Information technology — Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s — Part 3: Audio. (Standard No. 11172–3)

  8. ISO/IEC (2006). Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC) (Standard No. 13818–1)

  9. Kankanahalli S (2018). End-to-end optimized speech coding with deep neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2521–2525). IEEE.

  10. Kingma DP, Ba J (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  11. León LFA, Kemper-Vásquez G, Telles J (2011). A novel fuzzy logic-based metric for audio quality assessment: Objective audio quality assessment. In CONATEL 2011 (pp. 1–10). IEEE.

  12. Lim W, Jang I, Beack S, Sung J, Lee T (2022). End-to-end Stereo Audio Coding Using Deep Neural Networks. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 860–864). IEEE.

  13. Min G, Zhang C, Zhang X, Tan W (2019). Deep Vocoder: Low Bit Rate Compression of Speech with Deep Autoencoder. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (pp. 372–377). IEEE.

  14. Recommendation ITU-R BS.1387 (1998): Method for objective measurements of perceived audio quality, ITU-R BS.1387

  15. Zhen K, Lee MS, Sung J, Beack S, Kim M (2020) Psychoacoustic calibration of loss functions for efficient end-to-end neural audio coding. IEEE Sign Proc Lett 27:2159–2163

    Article  Google Scholar 

  16. Zhen K, Lee MS, Sung J, Beack S, Kim M (2020). Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 361–365). IEEE.

Download references

Acknowledgements

The authors would like to thank the Dirección de Investigacion of Universidad Peruana de Ciencias Aplicadas for logistical support to carry out this work.

Code availability

The MATLAB development environment was used to develop the scripts and functions for the MCNN-PM audio codec, via an academic MATLAB license.

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillermo Kemper.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kemper, G., Sanchez, A. & Serpa, S. MPEG-1 psychoacoustic model emulation using multiscale convolutional neural networks. Multimed Tools Appl 83, 6963–6974 (2024). https://doi.org/10.1007/s11042-023-15949-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15949-y

Keywords

Navigation