Abstract
Visualizations help decipher latent patterns in music and garner a deep understanding of a song’s characteristics. This paper offers a critical analysis of the effectiveness of various state-of-the-art Deep Neural Networks in visualizing music. Several implementations of auto encoders and genre classifiers have been explored for extracting meaningful features from audio tracks. Novel techniques have been devised to map these audio features to parameters that drive visualizations. These methodologies have been designed in a manner that enables the visualizations to be responsive to the music as well as provide unique visual experiences across different songs.
Similar content being viewed by others
References
Annesi P, Basili R, Gitto R, Moschitti A, Petitti R (2007) Audio feature engineering for automatic music genre classification. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), pp. 702-711. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE
Baniya BK, Lee J, Li ZN (2014) Audio feature reduction and analysis for automatic music genre classification. In: 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), San Diego, pp. 457–462
Benzi K, Defferrard M, Vandergheynst P, Bresson X (2016) Fma: A dataset for music analysis,” arXiv preprint arXiv:1612.01840
Chung Y, Wu C, Shen C, Lee H, Lee L (2016) Audio Word2Vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. Proc. Interspeech, pp. 410–415
Ciuha P, Klemenc B, Solina F (2010) Visualization of concurrent tones in music with colours. Univ. of Ljubljana, Slovenia
Congote J, Segura A, Kabongo L, Moreno A, Posada J, Ruiz O (2011) Interactive visualization of volumetric data with webgl in real-time. In: Proceedings of the 16th International Conference on 3D Web Technology, pp. 137–146. ACM
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. Proc. IEEE Int. Conf Acoust. Speech Signal Process, pp. 6964–6968
Dixon S, Goebl W, Widmer G (2002) The performance worm: Real time visualisation based on langner's representation. In M. Nordahl, eds, Proceedings of the 2002 International Computer Music Conference, pages 361–364, San Francisco. International Computer Music Association
Foote J (2018) Visualizing music and audio using self-similarity
Gallagher M, Downs T (1997) Visualisation of learning in neural networks using principal component analysis. In: Verma B and Yao X (eds) Proceedings of International Conference on Computational Intelligence and Multimedia Applications, Gold Coast, pp. 327–331
Ha D, Eck D (2017) A neural representation of sketch drawings. CoRR
Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B et al (2016) CNN architectures for large-scale audio classification. arXiv preprint arXiv: 1609.09430
Humphrey EJ, Bello JP, LeCun Y (2013) Feature Learning and Deep Architectures: New Directions for Music Informatics. J Intell Inf Syst 41(3):461–481
Im DJ, Belghazi MID, Memisevic R (2015) Conservativeness of untied auto-encoders, CoRR, abs/1506.07643
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML
Kahng M, Andrews PY, Kalro A, Chau DH (2018) ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models. IEEE Trans Vis Comput Graph 24(1):88–97. https://doi.org/10.1109/tvcg.2017.2744718
Kim J, Won M, Serra X, Liem CCS (2018) Transfer Learning of Artist Group Factors to Musical Genre Classification. In WWW
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. InNIPS, pp. 1106–1114
Mao X, Shen C, Yang Y-B (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In NIPS
Mierswa I, Morik K (2005) Automatic Feature Extraction for Classifying Audio Data. Mach Learn 58(2-3):127–149. https://doi.org/10.1007/s10994-005-5824-7
Murauer B, Specht G (2018) Detecting Music Genre Using Extreme Gradient Boosting. In WWW
Nam J, Herrera J, Lee K (2015) A Deep Bag-of-Features Model for Music Auto-Tagging. Eprint arXiv: 1508.04999
Pascual S, Bonafonte A, Serrà J (2017) SEGAN: Speech Enhancement Generative Adversarial Network arXiv: 1703.09452
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in PyTorch. In: NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Long Beach
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396
Scherer D, Muller A, Behnke S (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In: Proc. of the Intl. Conf. on Artificial Neural Networks, pp. 92–101
Schluter J (2011) Unsupervised audio feature extraction for music similarity estimation. Technische Universit at Munchen, Fakultat fur Informatik
Sigtia S, Dixon S (2014) Improved music feature learning with deep neural networks. In: Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In ICLR
Sutskever I, Martens J, Dahl GE, Hinton GE (2013) On the importance of initialization and momentum in deep learning. In ICML, volume 28 of JMLR Proceedings, pp. 1139–1147. JMLR.org
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Rabinovich A (2014) Going deeper with convolutions. Technical report
Takahashi N, Gygli M, Gool LV (2017) AEnet: Learning deep audio features for video analysis. arXiv: 1701.00599
Takahashi N, Gygli M, Van Gool L (2017) Aenet: Learning deep audio features for video analysis
Taylor R, Boulanger P, Torres D (2006) Real-time music visualizations using responsive imagery
Umapathy K, Krishnan S, Rao RK (2007) Audio Signal Feature Extraction and Classification Using Local Discriminant Bases. IEEE Trans Audio Speech Lang Process 15(4):1236–1246
Wang H-H, Liu J-M, You M, Li G-Z (2015) Audio signals encoding for cough classification using convolutional neural networks: A comparative study. 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, pp. 442–445
Wang S, Sun J, Phillips P, Zhao G, Zhang Y (2017) Polarimetric synthetic aperture radar image segmentation by the convolutional neural network using graphical processing units. J Real-Time Image Proc
Wyse L (2017) Audio spectrogram representations for processing with convolutional neural networks. arXiv: 1706.09559
Zeiler MD, Fergus R (2013) Stochastic pooling for regularization of deep convolutional neural networks. CoRR, abs/1301.3557
Zhang Y-D, Dong Z, Chen X, Jia W, Du S, Muhammad K, Wang S (2017) Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimed Tools Appl:1–20. https://doi.org/10.1007/s11042-017-5243-3
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dhiraj, Biswas, R. & Ghattamaraju, N. An effective analysis of deep learning based approaches for audio based feature extraction and its visualization. Multimed Tools Appl 78, 23949–23972 (2019). https://doi.org/10.1007/s11042-018-6706-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6706-x