Abstract
The Moving Picture Experts Group - 1 (MPEG-1) perceptual audio compression scheme is a successful family of audio codecs described in standard ISO/IEC 11172–3. Currently, there is no general framework to emulate nor MPEG-1 neither any other psychoacoustic model, which is a core piece of many perceptual codecs. This work presents a successful implementation of a convolutional neural network which emulates psychoacoustic model 1 from the MPEG-1 standard, termed “MCNN-PM” (Multiscale Convolutional Neural Network – Psychoacoustic Model). It is then implemented as part of the MPEG-1, Layer I codec. Using the objective difference grade (ODG) to evaluate audio quality, the MCNN-PM MPEG-1, Layer I codec outperforms the original MPEG-1, Layer I codec by up to 17% at 96 kbps, 14% at 128 kbps and performs almost equally at 192 kbps. This work shows that convolutional neural networks are a viable alternative to standard psychoacoustic models and can be used as part of perceptual audio codecs successfully.
Similar content being viewed by others
Data availability
The original audio files used in the present work can be facilitated upon reasonable request to the corresponding author.
References
Agustsson E, Mentzer F, Tschannen M, Cavigelli L, Timofte R, Benini L, Van Gool L (2017). Soft-to-hard vector quantization for end-to-end learned compression of images and neural networks. arXiv preprint arXiv:1704.00648, 3.
Ananthabhotla I, Ewert S, Paradiso JA (2019). Towards a Perceptual Loss: Using a Neural Network Codec Approximation as a Loss for Generative Audio Models. In Proceedings of the 27th ACM International Conference on Multimedia (pp. 1518-1525).
Bourtsoulatze E, Kurka DB, Gündüz D (2019) Deep joint source-channel coding for wireless image transmission. IEEE Transac Cog Commun Network 5(3):567–579
Cui Z, Chen W, Chen Y (2016). Multi-scale convolutional neural networks for time series classification. arXiv preprint arXiv:1603.06995.
Gârbacea C, van den Oord A, Li Y, Lim FS, Luebs A, Vinyals O, Walters TC (2019). Low bit-rate speech coding with VQ-VAE and a WaveNet decoder. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 735–739). IEEE.
He K, Zhang X, Ren S, Sun J (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).
ISO/IEC (1993). Information technology — Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s — Part 3: Audio. (Standard No. 11172–3)
ISO/IEC (2006). Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC) (Standard No. 13818–1)
Kankanahalli S (2018). End-to-end optimized speech coding with deep neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2521–2525). IEEE.
Kingma DP, Ba J (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
León LFA, Kemper-Vásquez G, Telles J (2011). A novel fuzzy logic-based metric for audio quality assessment: Objective audio quality assessment. In CONATEL 2011 (pp. 1–10). IEEE.
Lim W, Jang I, Beack S, Sung J, Lee T (2022). End-to-end Stereo Audio Coding Using Deep Neural Networks. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 860–864). IEEE.
Min G, Zhang C, Zhang X, Tan W (2019). Deep Vocoder: Low Bit Rate Compression of Speech with Deep Autoencoder. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (pp. 372–377). IEEE.
Recommendation ITU-R BS.1387 (1998): Method for objective measurements of perceived audio quality, ITU-R BS.1387
Zhen K, Lee MS, Sung J, Beack S, Kim M (2020) Psychoacoustic calibration of loss functions for efficient end-to-end neural audio coding. IEEE Sign Proc Lett 27:2159–2163
Zhen K, Lee MS, Sung J, Beack S, Kim M (2020). Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 361–365). IEEE.
Acknowledgements
The authors would like to thank the Dirección de Investigacion of Universidad Peruana de Ciencias Aplicadas for logistical support to carry out this work.
Code availability
The MATLAB development environment was used to develop the scripts and functions for the MCNN-PM audio codec, via an academic MATLAB license.
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kemper, G., Sanchez, A. & Serpa, S. MPEG-1 psychoacoustic model emulation using multiscale convolutional neural networks. Multimed Tools Appl 83, 6963–6974 (2024). https://doi.org/10.1007/s11042-023-15949-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15949-y