Skip to main content
Log in

An effective analysis of deep learning based approaches for audio based feature extraction and its visualization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Visualizations help decipher latent patterns in music and garner a deep understanding of a song’s characteristics. This paper offers a critical analysis of the effectiveness of various state-of-the-art Deep Neural Networks in visualizing music. Several implementations of auto encoders and genre classifiers have been explored for extracting meaningful features from audio tracks. Novel techniques have been devised to map these audio features to parameters that drive visualizations. These methodologies have been designed in a manner that enables the visualizations to be responsive to the music as well as provide unique visual experiences across different songs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

References

  1. Annesi P, Basili R, Gitto R, Moschitti A, Petitti R (2007) Audio feature engineering for automatic music genre classification. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pp. 702-711. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE

  2. Baniya BK, Lee J, Li ZN (2014) Audio feature reduction and analysis for automatic music genre classification. In: 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), San Diego, pp. 457–462

  3. Benzi K, Defferrard M, Vandergheynst P, Bresson X (2016) Fma: A dataset for music analysis,” arXiv preprint arXiv:1612.01840

  4. Chung Y, Wu C, Shen C, Lee H, Lee L (2016) Audio Word2Vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. Proc. Interspeech, pp. 410–415

  5. Ciuha P, Klemenc B, Solina F (2010) Visualization of concurrent tones in music with colours. Univ. of Ljubljana, Slovenia

    Book  Google Scholar 

  6. Congote J, Segura A, Kabongo L, Moreno A, Posada J, Ruiz O (2011) Interactive visualization of volumetric data with webgl in real-time. In: Proceedings of the 16th International Conference on 3D Web Technology, pp. 137–146. ACM

  7. Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. Proc. IEEE Int. Conf Acoust. Speech Signal Process, pp. 6964–6968

  8. Dixon S, Goebl W, Widmer G (2002) The performance worm: Real time visualisation based on langner's representation. In M. Nordahl, eds, Proceedings of the 2002 International Computer Music Conference, pages 361–364, San Francisco. International Computer Music Association

  9. Foote J (2018) Visualizing music and audio using self-similarity

  10. Gallagher M, Downs T (1997) Visualisation of learning in neural networks using principal component analysis. In: Verma B and Yao X (eds) Proceedings of International Conference on Computational Intelligence and Multimedia Applications, Gold Coast, pp. 327–331

  11. Ha D, Eck D (2017) A neural representation of sketch drawings. CoRR

  12. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B et al (2016) CNN architectures for large-scale audio classification. arXiv preprint arXiv: 1609.09430

  13. Humphrey EJ, Bello JP, LeCun Y (2013) Feature Learning and Deep Architectures: New Directions for Music Informatics. J Intell Inf Syst 41(3):461–481

    Article  Google Scholar 

  14. Im DJ, Belghazi MID, Memisevic R (2015) Conservativeness of untied auto-encoders, CoRR, abs/1506.07643

  15. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML

  16. Kahng M, Andrews PY, Kalro A, Chau DH (2018) ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models. IEEE Trans Vis Comput Graph 24(1):88–97. https://doi.org/10.1109/tvcg.2017.2744718

    Article  Google Scholar 

  17. Kim J, Won M, Serra X, Liem CCS (2018) Transfer Learning of Artist Group Factors to Musical Genre Classification. In WWW

  18. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. InNIPS, pp. 1106–1114

  19. Mao X, Shen C, Yang Y-B (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In NIPS

  20. Mierswa I, Morik K (2005) Automatic Feature Extraction for Classifying Audio Data. Mach Learn 58(2-3):127–149. https://doi.org/10.1007/s10994-005-5824-7

    Article  MATH  Google Scholar 

  21. Murauer B, Specht G (2018) Detecting Music Genre Using Extreme Gradient Boosting. In WWW

  22. Nam J, Herrera J, Lee K (2015) A Deep Bag-of-Features Model for Music Auto-Tagging. Eprint arXiv: 1508.04999

  23. Pascual S, Bonafonte A, Serrà J (2017) SEGAN: Speech Enhancement Generative Adversarial Network arXiv: 1703.09452

  24. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in PyTorch. In: NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Long Beach

  25. Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396

  26. Scherer D, Muller A, Behnke S (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In: Proc. of the Intl. Conf. on Artificial Neural Networks, pp. 92–101

  27. Schluter J (2011) Unsupervised audio feature extraction for music similarity estimation. Technische Universit at Munchen, Fakultat fur Informatik

  28. Sigtia S, Dixon S (2014) Improved music feature learning with deep neural networks. In: Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

  29. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In ICLR

  30. Sutskever I, Martens J, Dahl GE, Hinton GE (2013) On the importance of initialization and momentum in deep learning. In ICML, volume 28 of JMLR Proceedings, pp. 1139–1147. JMLR.org

  31. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Rabinovich A (2014) Going deeper with convolutions. Technical report

  32. Takahashi N, Gygli M, Gool LV (2017) AEnet: Learning deep audio features for video analysis. arXiv: 1701.00599

  33. Takahashi N, Gygli M, Van Gool L (2017) Aenet: Learning deep audio features for video analysis

  34. Taylor R, Boulanger P, Torres D (2006) Real-time music visualizations using responsive imagery

  35. Umapathy K, Krishnan S, Rao RK (2007) Audio Signal Feature Extraction and Classification Using Local Discriminant Bases. IEEE Trans Audio Speech Lang Process 15(4):1236–1246

    Article  Google Scholar 

  36. Wang H-H, Liu J-M, You M, Li G-Z (2015) Audio signals encoding for cough classification using convolutional neural networks: A comparative study. 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, pp. 442–445

  37. Wang S, Sun J, Phillips P, Zhao G, Zhang Y (2017) Polarimetric synthetic aperture radar image segmentation by the convolutional neural network using graphical processing units. J Real-Time Image Proc

  38. Wyse L (2017) Audio spectrogram representations for processing with convolutional neural networks. arXiv: 1706.09559

  39. Zeiler MD, Fergus R (2013) Stochastic pooling for regularization of deep convolutional neural networks. CoRR, abs/1301.3557

  40. Zhang Y-D, Dong Z, Chen X, Jia W, Du S, Muhammad K, Wang S (2017) Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimed Tools Appl:1–20. https://doi.org/10.1007/s11042-017-5243-3

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dhiraj.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dhiraj, Biswas, R. & Ghattamaraju, N. An effective analysis of deep learning based approaches for audio based feature extraction and its visualization. Multimed Tools Appl 78, 23949–23972 (2019). https://doi.org/10.1007/s11042-018-6706-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6706-x

Keywords

Navigation