Basic filters for convolutional neural networks applied to music: Training or design?

  • Monika Dörfler
  • Thomas GrillEmail author
  • Roswitha Bammer
  • Arthur Flexer
S.I. : Deep learning for music and audio


When convolutional neural networks are used to tackle learning problems based on music or other time series, raw one-dimensional data are commonly preprocessed to obtain spectrogram or mel-spectrogram coefficients, which are then used as input to the actual neural network. In this contribution, we investigate, both theoretically and experimentally, the influence of this pre-processing step on the network’s performance and pose the question whether replacing it by applying adaptive or learned filters directly to the raw data can improve learning success. The theoretical results show that approximately reproducing mel-spectrogram coefficients by applying adaptive filters and subsequent time-averaging on the squared amplitudes is in principle possible. We also conducted extensive experimental work on the task of singing voice detection in music. The results of these experiments show that for classification based on convolutional neural networks the features obtained from adaptive filter banks followed by time-averaging the squared modulus of the filters’ output perform better than the canonical Fourier transform-based mel-spectrogram coefficients. Alternative adaptive approaches with center frequencies or time-averaging lengths learned from training data perform equally well.


Machine learning Convolutional neural networks Adaptive filters Gabor multipliers Mel-spectrogram End-to-end learning 



This research has been supported by the Vienna Science and Technology Fund (WWTF) through Project MA14-018.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. 1.
    Abreu LD, Romero JL (2017) MSE estimates for multitaper spectral estimation and off-grid compressive sensing. IEEE Trans Inf Theory 63(12):7770–7776MathSciNetCrossRefGoogle Scholar
  2. 2.
    Andén J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62(16):4114–4128MathSciNetCrossRefGoogle Scholar
  3. 3.
    Anselmi F, Leibo JZ, Rosasco L, Mutch J, Tacchetti A, Poggio TA (2013) Unsupervised learning of invariant representations in hierarchical architectures. CoRR arxiv:1311.4158
  4. 4.
    Balazs P, Dörfler M, Jaillet F, Holighaus N, Velasco G (2011) Theory, implementation and applications of nonstationary gabor frames. J Comput Appl Math 236(6):1481–1496MathSciNetCrossRefGoogle Scholar
  5. 5.
    Balazs P, Dörfler M, Kowalski M, Torrésani B (2013) Adapted and adaptive linear time-frequency representations: a synthesis point of view. IEEE Signal Process Mag 30(6):20–31CrossRefGoogle Scholar
  6. 6.
    Bammer R, Dörfler M (2017) Invariance and stability of Gabor scattering for music signals. In: Sampling theory and applications (SampTA), 2017 international conference on. IEEE, pp 299–302Google Scholar
  7. 7.
    Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392
  8. 8.
    Choi K, Fazekas G, Sandler M, Cho K (2018) The effects of noisy labels on deep convolutional neural networks for music tagging. IEEE Trans Emerg Top Comput Intell 2(2):139–149CrossRefGoogle Scholar
  9. 9.
    Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceddings of the 17th international society for music information retrieval conferenceGoogle Scholar
  10. 10.
    Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation. In: Thirty-second AAAI conference on artificial intelligenceGoogle Scholar
  11. 11.
    Dieleman S, Brakel P, Schrauwen B (2011) Audio-based music classification with a pretrained convolutional network. In: 12th international society for music information retrieval conference (ISMIR-2011). University of Miami, pp 669–674Google Scholar
  12. 12.
    Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: Acoustics, speech and signal processing (ICASSP), 2014 IEEE international conference on, pp 6964–6968.
  13. 13.
    Dörfler M (2001) Time-frequency analysis for music signals: a mathematical approach. J New Music Res 30(1):3–12CrossRefGoogle Scholar
  14. 14.
    Dörfler M, Bammer R, Grill T (2017) Inside the spectrogram: convolutional neural networks in audio processing. In: International conference on sampling theory and applications (SampTA). IEEE, pp 152–155Google Scholar
  15. 15.
    Dörfler M, Torrésani B (2010) Representation of operators in the time-frequency domain and generalized Gabor multipliers. J Fourier Anal Appl 16(2):261–293MathSciNetCrossRefGoogle Scholar
  16. 16.
    Feichtinger HG, Kozek W (1998) Quantization of TF lattice-invariant operators on elementary LCA groups. In: Feichtinger HG, Strohmer T (eds) Gabor analysis and algorithms, applied and numerical harmonic analysis. Birkhäuser, Boston, pp 233–266Google Scholar
  17. 17.
    Feichtinger HG, Nowak K (2003) A first survey of Gabor multipliers. In: Feichtinger HG, Strohmer T (eds) Advances in Gabor analysis, applied and numerical harmonic analysis. Birkhäuser, Boston, pp 99–128Google Scholar
  18. 18.
    Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, CambridgezbMATHGoogle Scholar
  19. 19.
    Grill T, Schlüter J (2015) Music boundary detection using neural networks on combined features and two-level annotations. In: Proceedings of the 16th international society for music information retrieval conference (ISMIR 2015). Malaga, Spain, pp 531–537Google Scholar
  20. 20.
    Grohs P, Wiatowski T, Bölcskei H (2016) Deep convolutional neural networks on cartoon functions. In: Information theory (ISIT), 2016 IEEE international symposium on. IEEE, pp 1163–1167Google Scholar
  21. 21.
    Holighaus N, Dörfler M, Velasco GA, Grill T (2013) A framework for invertible, real-time constant-Q transforms. IEEE Trans Audio Speech Lang Process 21(4):775–785CrossRefGoogle Scholar
  22. 22.
    Humphrey EJ, Bello JP (2012) Rethinking automatic chord recognition with convolutional neural networks. In: Machine learning and applications (ICMLA), 2012 11th international conference on. IEEE, vol 2, pp 357–362Google Scholar
  23. 23.
    Humphrey EJ, Montecchio N, Bittner R, Jansson A, Jehan T (2017) Mining labeled data from web-scale collections for vocal activity detection in music. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR), Suzhou, ChinaGoogle Scholar
  24. 24.
    Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 6th international conference on learning representations (ICLR). San Diego, USAGoogle Scholar
  25. 25.
    Korzeniowski F, Widmer G (2016) A fully convolutional deep auditory model for musical chord recognition. In: Machine learning for signal processing (MLSP), 2016 IEEE 26th international workshop on. IEEE, pp 1–6Google Scholar
  26. 26.
    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
  27. 27.
    Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in neural information processing systems, pp 1096–1104Google Scholar
  28. 28.
    Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), 2015 IEEE international conference on. IEEE, pp 121–125Google Scholar
  29. 29.
    Lehner B, Schlüter J, Widmer G (2018) Online, loudness-invariant vocal detection in mixed music signals. IEEE/ACM Trans Audio Speech Lang Process 26(8):1369–1380CrossRefGoogle Scholar
  30. 30.
    Malik M, Adavanne S, Drossos K, Virtanen T, Ticha D, Jarina R (2017) Stacked convolutional and recurrent neural networks for music emotion recognition. arXiv preprint arXiv:1706.02292
  31. 31.
    Mallat S (2012) Group invariant scattering. Commun Pure Appl Math 65(10):1331–1398MathSciNetCrossRefGoogle Scholar
  32. 32.
    Mallat S (2016) Understanding deep convolutional networks. Philos Trans R Soc Lond A Math Phys Eng Sci 374(2065). URL CrossRefGoogle Scholar
  33. 33.
    Schlüter J, Böck S (2013) Musical onset detection with convolutional neural networks. In: 6th international workshop on machine learning and music (MML), Prague, Czech RepublicGoogle Scholar
  34. 34.
    Schlüter J, Böck S (2014) Improved musical onset detection with convolutional neural networks. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2014). Florence, ItalyGoogle Scholar
  35. 35.
    Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: Proceedings of the 16th international society for music information retrieval conference (ISMIR 2015). Malaga, SpainGoogle Scholar
  36. 36.
    Ullrich K, Schlüter J, Grill T (2014) Boundary detection in music structure analysis using convolutional neural networks. In: Proceedings of the 15th international society for music information retrieval conference (ISMIR 2014). Taipei, TaiwanGoogle Scholar
  37. 37.
    Waldspurger I (2015) Wavelet transform modulus: phase retrieval and scattering. Ph.D. thesis, Ecole normale supérieure-ENS PARISGoogle Scholar
  38. 38.
    Waldspurger I (2017) Exponential decay of scattering coefficients. In: 2017 international conference on sampling theory and applications (SampTA), pp 143–146.
  39. 39.
    Wiatowski T, Grohs P, Bölcskei H (2017) Energy propagation in deep convolutional neural networks. arXiv preprint arXiv:1704.03636
  40. 40.
    Wiatowski T, Tschannen M, Stanic A, Grohs P, Bölcskei H (2016) Discrete deep feature extraction: a theory and new architectures. In: Proceedings of the international conference on machine learning, pp 2149–2158Google Scholar

Copyright information

© The Natural Computing Applications Forum 2018

Authors and Affiliations

  1. 1.Faculty of MathematicsUniversity of ViennaViennaAustria
  2. 2.Austrian Research Institute for Artificial Intelligence (OFAI)ViennaAustria

Personalised recommendations