Skip to main content

Basic filters for convolutional neural networks applied to music: Training or design?


When convolutional neural networks are used to tackle learning problems based on music or other time series, raw one-dimensional data are commonly preprocessed to obtain spectrogram or mel-spectrogram coefficients, which are then used as input to the actual neural network. In this contribution, we investigate, both theoretically and experimentally, the influence of this pre-processing step on the network’s performance and pose the question whether replacing it by applying adaptive or learned filters directly to the raw data can improve learning success. The theoretical results show that approximately reproducing mel-spectrogram coefficients by applying adaptive filters and subsequent time-averaging on the squared amplitudes is in principle possible. We also conducted extensive experimental work on the task of singing voice detection in music. The results of these experiments show that for classification based on convolutional neural networks the features obtained from adaptive filter banks followed by time-averaging the squared modulus of the filters’ output perform better than the canonical Fourier transform-based mel-spectrogram coefficients. Alternative adaptive approaches with center frequencies or time-averaging lengths learned from training data perform equally well.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. 1.

    This observation seems to have served as one motivation to introduce the so-called scattering transform, which consists of repeated composition of convolution, a nonlinearity in the form of taking the absolute value and time-averaging. In that framework, mel-spectrogram coefficients are interpreted as first-order scattering coefficients.


  1. 1.

    Abreu LD, Romero JL (2017) MSE estimates for multitaper spectral estimation and off-grid compressive sensing. IEEE Trans Inf Theory 63(12):7770–7776

    MathSciNet  Article  Google Scholar 

  2. 2.

    Andén J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62(16):4114–4128

    MathSciNet  Article  Google Scholar 

  3. 3.

    Anselmi F, Leibo JZ, Rosasco L, Mutch J, Tacchetti A, Poggio TA (2013) Unsupervised learning of invariant representations in hierarchical architectures. CoRR arxiv:1311.4158

  4. 4.

    Balazs P, Dörfler M, Jaillet F, Holighaus N, Velasco G (2011) Theory, implementation and applications of nonstationary gabor frames. J Comput Appl Math 236(6):1481–1496

    MathSciNet  Article  Google Scholar 

  5. 5.

    Balazs P, Dörfler M, Kowalski M, Torrésani B (2013) Adapted and adaptive linear time-frequency representations: a synthesis point of view. IEEE Signal Process Mag 30(6):20–31

    Article  Google Scholar 

  6. 6.

    Bammer R, Dörfler M (2017) Invariance and stability of Gabor scattering for music signals. In: Sampling theory and applications (SampTA), 2017 international conference on. IEEE, pp 299–302

  7. 7.

    Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392

  8. 8.

    Choi K, Fazekas G, Sandler M, Cho K (2018) The effects of noisy labels on deep convolutional neural networks for music tagging. IEEE Trans Emerg Top Comput Intell 2(2):139–149

    Article  Google Scholar 

  9. 9.

    Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceddings of the 17th international society for music information retrieval conference

  10. 10.

    Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation. In: Thirty-second AAAI conference on artificial intelligence

  11. 11.

    Dieleman S, Brakel P, Schrauwen B (2011) Audio-based music classification with a pretrained convolutional network. In: 12th international society for music information retrieval conference (ISMIR-2011). University of Miami, pp 669–674

  12. 12.

    Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: Acoustics, speech and signal processing (ICASSP), 2014 IEEE international conference on, pp 6964–6968.

  13. 13.

    Dörfler M (2001) Time-frequency analysis for music signals: a mathematical approach. J New Music Res 30(1):3–12

    Article  Google Scholar 

  14. 14.

    Dörfler M, Bammer R, Grill T (2017) Inside the spectrogram: convolutional neural networks in audio processing. In: International conference on sampling theory and applications (SampTA). IEEE, pp 152–155

  15. 15.

    Dörfler M, Torrésani B (2010) Representation of operators in the time-frequency domain and generalized Gabor multipliers. J Fourier Anal Appl 16(2):261–293

    MathSciNet  Article  Google Scholar 

  16. 16.

    Feichtinger HG, Kozek W (1998) Quantization of TF lattice-invariant operators on elementary LCA groups. In: Feichtinger HG, Strohmer T (eds) Gabor analysis and algorithms, applied and numerical harmonic analysis. Birkhäuser, Boston, pp 233–266

    Google Scholar 

  17. 17.

    Feichtinger HG, Nowak K (2003) A first survey of Gabor multipliers. In: Feichtinger HG, Strohmer T (eds) Advances in Gabor analysis, applied and numerical harmonic analysis. Birkhäuser, Boston, pp 99–128

    Google Scholar 

  18. 18.

    Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge

    MATH  Google Scholar 

  19. 19.

    Grill T, Schlüter J (2015) Music boundary detection using neural networks on combined features and two-level annotations. In: Proceedings of the 16th international society for music information retrieval conference (ISMIR 2015). Malaga, Spain, pp 531–537

  20. 20.

    Grohs P, Wiatowski T, Bölcskei H (2016) Deep convolutional neural networks on cartoon functions. In: Information theory (ISIT), 2016 IEEE international symposium on. IEEE, pp 1163–1167

  21. 21.

    Holighaus N, Dörfler M, Velasco GA, Grill T (2013) A framework for invertible, real-time constant-Q transforms. IEEE Trans Audio Speech Lang Process 21(4):775–785

    Article  Google Scholar 

  22. 22.

    Humphrey EJ, Bello JP (2012) Rethinking automatic chord recognition with convolutional neural networks. In: Machine learning and applications (ICMLA), 2012 11th international conference on. IEEE, vol 2, pp 357–362

  23. 23.

    Humphrey EJ, Montecchio N, Bittner R, Jansson A, Jehan T (2017) Mining labeled data from web-scale collections for vocal activity detection in music. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR), Suzhou, China

  24. 24.

    Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 6th international conference on learning representations (ICLR). San Diego, USA

  25. 25.

    Korzeniowski F, Widmer G (2016) A fully convolutional deep auditory model for musical chord recognition. In: Machine learning for signal processing (MLSP), 2016 IEEE 26th international workshop on. IEEE, pp 1–6

  26. 26.

    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  27. 27.

    Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in neural information processing systems, pp 1096–1104

  28. 28.

    Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), 2015 IEEE international conference on. IEEE, pp 121–125

  29. 29.

    Lehner B, Schlüter J, Widmer G (2018) Online, loudness-invariant vocal detection in mixed music signals. IEEE/ACM Trans Audio Speech Lang Process 26(8):1369–1380

    Article  Google Scholar 

  30. 30.

    Malik M, Adavanne S, Drossos K, Virtanen T, Ticha D, Jarina R (2017) Stacked convolutional and recurrent neural networks for music emotion recognition. arXiv preprint arXiv:1706.02292

  31. 31.

    Mallat S (2012) Group invariant scattering. Commun Pure Appl Math 65(10):1331–1398

    MathSciNet  Article  Google Scholar 

  32. 32.

    Mallat S (2016) Understanding deep convolutional networks. Philos Trans R Soc Lond A Math Phys Eng Sci 374(2065). URL

    Article  Google Scholar 

  33. 33.

    Schlüter J, Böck S (2013) Musical onset detection with convolutional neural networks. In: 6th international workshop on machine learning and music (MML), Prague, Czech Republic

  34. 34.

    Schlüter J, Böck S (2014) Improved musical onset detection with convolutional neural networks. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2014). Florence, Italy

  35. 35.

    Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: Proceedings of the 16th international society for music information retrieval conference (ISMIR 2015). Malaga, Spain

  36. 36.

    Ullrich K, Schlüter J, Grill T (2014) Boundary detection in music structure analysis using convolutional neural networks. In: Proceedings of the 15th international society for music information retrieval conference (ISMIR 2014). Taipei, Taiwan

  37. 37.

    Waldspurger I (2015) Wavelet transform modulus: phase retrieval and scattering. Ph.D. thesis, Ecole normale supérieure-ENS PARIS

  38. 38.

    Waldspurger I (2017) Exponential decay of scattering coefficients. In: 2017 international conference on sampling theory and applications (SampTA), pp 143–146.

  39. 39.

    Wiatowski T, Grohs P, Bölcskei H (2017) Energy propagation in deep convolutional neural networks. arXiv preprint arXiv:1704.03636

  40. 40.

    Wiatowski T, Tschannen M, Stanic A, Grohs P, Bölcskei H (2016) Discrete deep feature extraction: a theory and new architectures. In: Proceedings of the international conference on machine learning, pp 2149–2158

Download references


This research has been supported by the Vienna Science and Technology Fund (WWTF) through Project MA14-018.

Author information



Corresponding author

Correspondence to Thomas Grill.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix A: Proof of Theorem 1

Appendix A: Proof of Theorem 1

In order to include the situation described in Theorem 1, we assume the situation in which the original spectrogram is sub-sampled, in other words, we start the computations concerning a signal f from

$$\begin{aligned} S_0 ( \alpha l, \beta k)= |{\mathcal {V}} _g f (\alpha l, \beta k) |^2 = |{\mathcal {F}} (f\cdot T_{\alpha l} g)(\beta k)|^2. \end{aligned}$$

The proof is based on the observation that the mel-spectrogram can be written via the operation of so-called STFT- or Gabor multipliers, cf. [17], on any given function in the sense of a bilinear form. Before deriving the involved correspondence, we thus introduce this important class of operators.

Given a window function g, time- and frequency-sub-sampling parameters \(\alpha , \beta\), respectively, and a function \(\mathbf{{m}}: {\mathbb {Z}} \times {\mathbb {Z}} \mapsto {\mathbb {C}}\), the corresponding Gabor multiplier \(G^{\alpha ,\beta }_{g, \mathbf{{m}}}\) is defined as

$$\begin{aligned} G^{\alpha ,\beta }_{g, \mathbf{{m}}} f = \sum _k \sum _l \mathbf{m} (k,l) \langle f, M_{\beta k} T_{\alpha l} g\rangle M_{\beta k} T_{\alpha l} g . \end{aligned}$$

We next derive the expression of a mel-spectrogram by an appropriately chosen Gabor multiplier. Using sub-sampling factors \(\alpha\) in time and \(\beta\) in frequency as before, we start from (4) and reformulate as follows:

$$\begin{aligned} {{\text {MS}}}_{g}(f) (b,\nu )=&\sum _k |{\mathcal {F}} (f\cdot T_b g)(\beta k)|^2 \cdot \varLambda _\nu (\beta k)\\ =&\sum _k \langle f, M_{\beta k} T_b g\rangle \overline{\langle f, M_{\beta k} T_b g\rangle } \varLambda _\nu (\beta k)\\ =&\left\langle \sum _k \varLambda _\nu ( \beta k) \langle f, M_{\beta k} T_b g\rangle M_{\beta k} T_b g , f\right\rangle \\ =&\left\langle \sum _k \sum _l \mathbf{m} (k,l) \langle f, M_{\beta k} T_{\alpha l} g\rangle M_{\beta k} T_{\alpha l} g , f\right\rangle \end{aligned}$$

with \(\mathbf{m} (k,l) = \delta (\alpha l-b)\varLambda _\nu (\beta k)\). We see that the mel-coefficients can thus be interpreted via a Gabor multiplier: \({{\text {MS}}}_{g}(f) (b,\nu ) = \langle G^{\alpha ,\beta }_{g, \mathbf{{m}}}f, f \rangle\).

The next step is to switch to an alternative operator representation. Indeed, as shown in [16], every operator H can equally be written by means of its spreading function\(\eta _H\) as

$$\begin{aligned} Hf (t) = \int _x \int _\xi \eta _H (x,\xi ) f (t-x) e^{2\pi i t \xi }{\mathrm{d}}\xi {\mathrm{d}}x. \end{aligned}$$

We note that two operators \(H_1\), \(H_2\) are equal if and only if their spreading functions coincide, see [15, 16] for details.

As shown in [15], a Gabor multiplier’s spreading function \(\eta ^{\alpha ,\beta }_{{g, \mathbf{m}} }\) is given by

$$\begin{aligned} \eta ^{\alpha ,\beta }_{{g, \mathbf{m}} } (x,\xi ) = {\mathcal {M}} (x,\xi ) {\mathcal {V}} _g g(x,\xi ), \end{aligned}$$

where \({\mathcal {M}} (x,\xi )\) denotes the \((\beta ^{-1}, \alpha ^{-1})\)-periodic symplectic Fourier transform of \(\mathbf{m}\), i.e.,

$$\begin{aligned} {\mathcal {M}} (x,\xi ) = \mathcal {F}_s ( \mathbf{m} )(x,\xi ) = \sum _k\sum _l \mathbf{m} (k,l) e^{-2\pi i (\alpha l \xi - \beta kx )}. \end{aligned}$$

We now equally rewrite the time-averaging operation applied to a filtered signal, as defined in (6), as a Gabor multiplier. As before, we set \(\check{h}_\nu (t) = \overline{h_\nu (-t)}\) and have

$$\begin{aligned} {{\text {FB}}}_{h_\nu }(f) (b,\nu )&=\sum _l |(f*h_\nu )(\alpha l)|^2 \cdot \varpi _\nu (\alpha l-b) = \sum _l |\sum _n f(n) \check{h}_\nu (n-\alpha l)|^2 \cdot \varpi _\nu (\alpha l-b)\\&=\sum _k \sum _l |\langle f, M_{\beta k} T_{\alpha l} \check{h}_\nu \rangle |^2 \cdot \varpi _\nu (\alpha l-b)\delta (\beta k)= \langle G^{\alpha ,\beta }_{\check{h}_{\nu }, \mathbf{m}_F} f, f\rangle . \end{aligned}$$

with \(\mathbf{m}_F (k,l) = T_b \varpi _\nu (l) \delta (\beta k)\). To obtain the error estimate in Corollary 1, first note that by straightforward computation using the operators’ representation by their spreading functions as in (12)

$$\begin{aligned}&|{{\text {MS}}}_{g}(f) (b,\nu )-{{\text {FB}}}_{h_\nu }(f)(b,\nu )| = \left| \left\langle \left( G^{\alpha ,\beta }_{g, \mathbf{m}} - G^{\alpha ,\beta }_{\check{h}_{\nu }, \mathbf{m}_F}\right) f, f\right\rangle \right| \nonumber \\&\quad = \left| \left\langle \left( \eta _{g_\alpha ^\beta , \mathbf{m}} - \eta _{\check{h}_{\alpha \nu }^\beta , \mathbf{m}_F}\right) , {\mathcal {V}}_f f\right\rangle \right| \le \left\| \eta ^{\alpha , \beta }_{g, \mathbf{m}} - \eta ^{\alpha , \beta }_{\check{h}_{ \nu } ,\mathbf{m}_F}\right\| \cdot \Vert f\Vert _2^2 \end{aligned}$$

and we can estimate the error by the difference of the spreading functions. We write the sampled version of \(\varLambda _\nu\) by using the Dirac comb Ш\(_\beta\): \(\varLambda _\nu (\beta k) = (\)Ш\(_\beta \varLambda _\nu ) (t) = \sum _k \varLambda _\nu (t) \delta (t-\beta k)\) and analogously for \(\varpi _\nu\) using Ш\(_\alpha\) to obtain \(\mathbf{m} =T_b \delta (\alpha l) \cdot\)Ш\(_\beta \varLambda _\nu\) and \(\mathbf{m}_F =\)Ш\(_\alpha T_b \varpi _\nu \cdot \delta (\beta k)\). Applying the symplectic Fourier transform (14) to \(\mathbf{m}\) then gives:


Now it is a well-known fact that the Fourier transform turns sampling with sampling interval \(\beta\) into periodization by \(1/\beta\), in other words, into a convolution with Ш\(_{\frac{1}{\beta }}\):



$$\begin{aligned} {\mathcal {M}}^\nu (x,\xi ) = \sum _l T_{\frac{l}{\beta } }{\mathcal {F}}^{-1} (\varLambda _\nu ) (x) \cdot e^{-2\pi i b \xi }. \end{aligned}$$

Completely analogous considerations for \(\varpi _\nu\) and Ш\(_\alpha\) lead to the periodization of \(\mathcal {F}(\varpi _\nu )\) and thus the following expression for the symplectic Fourier transform of \(\mathbf{m}_F\):

$$\begin{aligned} {\mathcal {M}}^\nu _F(x,\xi ) = \sum _l T_{\frac{l}{\alpha } }{\mathcal {F}}(\varpi _\nu ) (\xi ) \cdot e^{-2\pi i b \xi }. \end{aligned}$$

Plugging these expressions into (13) gives the bound (8).

Remark 5

It is interesting to interpret the action of an operator in terms of its spreading function. In view of (12), we see that the spreading function determines the amount of shift in time and frequency, which the action of the operator imposes on a function. For Gabor multipliers, if well-concentrated window functions are used, it is immediately obvious that the amount of shifting is moderate as well as determined by the window’s eccentricity. At the same time, the aliasing effects introduced by coarse sub-sampling are reflected in the periodic nature of \({\mathcal {M}}\). Since, for \(\mathcal {F}^{-1} (\varLambda _\nu )\) the sub-sampling density in frequency, determined by \(\beta\), and for \(\mathcal {F}(\varpi _\nu )\) the sub-sampling density in time, determined by \(\alpha\), determine the amount of aliasing, the overall approximation quality deteriorates with increasing sub-sampling factors.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dörfler, M., Grill, T., Bammer, R. et al. Basic filters for convolutional neural networks applied to music: Training or design?. Neural Comput & Applic 32, 941–954 (2020).

Download citation


  • Machine learning
  • Convolutional neural networks
  • Adaptive filters
  • Gabor multipliers
  • Mel-spectrogram
  • End-to-end learning