Skip to main content

Trainable Windowing Coefficients in DNN for Raw Audio Classification

  • Conference paper
  • First Online:
Cloud Computing, Big Data & Emerging Topics (JCC-BD&ET 2020)

Abstract

An artificial neural network for audio classification is proposed. This includes the windowing operation of raw audio and the calculation of the power spectrogram. A windowing layer is initialized with a hann window and its weights are adapted during training. The non-trainable weights of spectrogram calculation are initialized with the discrete Fourier transform coefficients. The tests are performed on the Speech Commands dataset. Results show that adapting the windowing coefficients produces a moderate accuracy improvement. It is concluded that the gradient of the error function can be propagated through the neural calculation of the power spectrum. It is also concluded that the training of the windowing layer improves the model’s ability to generalize.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/speech_commands.

References

  1. Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276. IEEE (2012)

    Google Scholar 

  2. Sainath, T.N., Kingsbury, B., Mohamed, A.R., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 297–302. IEEE (2013)

    Google Scholar 

  3. Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  4. Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE (2015)

    Google Scholar 

  5. Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

    Google Scholar 

  6. Ghahremani, P., Hadian, H., Lv, H., Povey, D., Khudanpur, S.: Acoustic modeling from frequency domain representations of speech. In: Interspeech, pp. 1596–1600 (2018)

    Google Scholar 

  7. Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Interspeech, pp. 1766–1770 (2013)

    Google Scholar 

  8. Tripathi, A., Mohan, A., Anand, S., Singh, M.: Adversarial learning of raw speech features for domain invariant speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5959–5963. IEEE (2018)

    Google Scholar 

  9. Kim, Y., Kim, M., Goo, J., Kim, H.: Learning self-informed feature contribution for deep learning-based acoustic modeling. IEEE/ACM Trans. Audio, Speech, and Lang. Process. 26(11), 2204–2214 (2018)

    Article  Google Scholar 

  10. Guo, J., Xu, N., Chen, X., Shi, Y., Xu, K., Alwan, A.: Filter sampling and combination CNN (FSC-CNN): a compact CNN model for small-footprint ASR acoustic modeling using raw waveforms. In: Interspeech, pp. 3713–3717 (2018)

    Google Scholar 

  11. Menne, T., Tüske, Z., Schlüter, R., Ney, H.: Learning acoustic features from the raw waveform for automatic speech recognition. In: DEGA, pp. 1533–1536 (2018)

    Google Scholar 

  12. von Platen, P., Zhang, C., Woodland, P.: Multi-span acoustic modelling using raw waveform signals. In: Interspeech, pp. 1393–1397 (2019)

    Google Scholar 

  13. Alisamir, S., Ahadi, S.M., Seyedin, S.: An end-to-end deep learning model to recognize farsi speech from raw input. In: 2018 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), pp. 1–5. IEEE (2018)

    Google Scholar 

  14. Takeda, R., Nakadai, K., Komatani, K.: Multi-timescale feature-extraction architecture of deep neural networks for acoustic model training from raw speech signal. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2503–2510. IEEE (2018)

    Google Scholar 

  15. Dubagunta, S.P., Kabil, S.H., Doss, M.M.: Improving children speech recognition through feature learning from raw speech signal. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5736–5740. IEEE (2019)

    Google Scholar 

  16. Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., Dupoux, E.: End-to-end speech recognition from the raw waveform. In: Interspeech, pp. 781–785 (2018)

    Google Scholar 

  17. Zeghidour, N., Usunier, N., Kokkinos, I., Schaiz, T., Synnaeve, G., Dupoux, E.: Learning filterbanks from raw speech for phone recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5509–5513. IEEE (2018)

    Google Scholar 

  18. Seki, H., Yamamoto, K., Nakagawa, S.: A deep neural network integrated with filterbank learning for speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5480–5484. IEEE (2017)

    Google Scholar 

  19. Sailor, H.B., Patil, H.A.: Novel unsupervised auditory filterbank learning using convolutional RBM for speech recognition. IEEE/ACM Trans. Audio, Speech, and Lang. Process. 24(12), 2341–2353 (2016)

    Article  Google Scholar 

  20. Zhu, Z., Engel, J.H., Hannun, A.: Learning multiscale features directly from waveforms. In: Interspeech, pp. 1305–1309 (2016)

    Google Scholar 

  21. Ravanelli, M., Bengio, Y.: Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028. IEEE (2018)

    Google Scholar 

  22. Dinkel, H., Chen, N., Qian, Y., Yu, K.: End-to-end spoofing detection with raw waveform CLDNNS. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4860–4864. IEEE (2017)

    Google Scholar 

  23. Muckenhirn, H., Magimai-Doss, M., Marcel, S.: End-to-end convolutional neural network-based voice presentation attack detection. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 335–341. IEEE (2017)

    Google Scholar 

  24. Yu, H., Tan, Z.H., Zhang, Y., Ma, Z., Guo, J.: Dnn filter bank cepstral coefficients for spoofing detection. IEEE Access 5, 4779–4787 (2017)

    Article  Google Scholar 

  25. Muckenhirn, H., Magimai-Doss, M., Marcel, S.: On learning vocal tract system related speaker discriminative information from raw signal using CNNS. In: Interspeech, pp. 1116–1120 (2018)

    Google Scholar 

  26. Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J.: Direct modelling of speech emotion from raw speech. In: Interspeech, pp. 3920–3924 (2019)

    Google Scholar 

  27. Tzirakis, P., Zhang, J., Schuller, B.W.: End-to-end speech emotion recognition using deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5089–5093. IEEE (2018)

    Google Scholar 

  28. Millet, J., Zeghidour, N.: Learning to detect dysarthria from raw speech. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5831–5835. IEEE (2019)

    Google Scholar 

  29. Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)

    Google Scholar 

  30. Pons Puig, J., Nieto Caballero, O., Prockup, M., Schmidt, E.M., Ehmann, A.F., Serra, X.: End-to-end learning for music audio tagging at scale. In: ISMIR, pp. 637–644. International Society for Music Information Retrieval (ISMIR) (2018)

    Google Scholar 

  31. Chen, N., Wang, S.: High-level music descriptor extraction algorithm based on combination of multi-channel cnns and LSTM. In: ISMIR, pp. 509–514 (2017)

    Google Scholar 

  32. Pons, J., Lidy, T., Serra, X.: Experimenting with musically motivated convolutional neural networks. In: 2016 14th International Workshop on Content-based Multimedia Indexing (CBMI), pp. 1–6. IEEE (2016)

    Google Scholar 

  33. Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.Y., Sainath, T.: Deep learning for audio signal processing. IEEE J. Selected Top. Signal Process. 13(2), 206–219 (2019)

    Article  Google Scholar 

  34. Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. IEEE (2013)

    Google Scholar 

  35. Deng, L., et al.: Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8604–8608. IEEE (2013)

    Google Scholar 

  36. Nam, J., Choi, K., Lee, J., Chou, S.Y., Yang, Y.H.: Deep learning for audio-based music classification and tagging: teaching computers to distinguish rock from bach. IEEE Signal Process. Magazine 36(1), 41–51 (2018)

    Article  Google Scholar 

  37. Harris, F.J.: On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66(1), 51–83 (1978)

    Article  Google Scholar 

  38. Phan, H., Hertel, L., Maass, M., Mertins, A.: Robust audio event recognition with 1-max pooling convolutional neural networks. In: Interspeech, pp. 3653–3657 (2016)

    Google Scholar 

  39. Takeda, R., Komatani, K.: Sound source localization based on deep neural networks with directional activate function exploiting phase information. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 405–409. IEEE (2016)

    Google Scholar 

  40. Morales-Cordovilla, J.A., Sánchez, V., Gómez, A.M., Peinado, A.M.: On the use of asymmetric windows for robust speech recognition. Circuits, Syst. and Signal Process. 31(2), 727–736 (2012)

    Article  MathSciNet  Google Scholar 

  41. Alam, M.J., Kenny, P., O’Shaughnessy, D.: On the use of asymmetric-shaped tapers for speaker verification using i-vectors. In: Odyssey 2012-The Speaker and Language Recognition Workshop, pp. 256–262 (2012)

    Google Scholar 

  42. Rozman, R., Kodek, D.M.: Using asymmetric windows in automatic speech recognition. Speech Commun. 49(4), 268–276 (2007)

    Article  Google Scholar 

  43. Sahidullah, M., Saha, G.: A novel windowing technique for efficient computation of MFCC for speaker recognition. IEEE Signal Process. Lett. 20(2), 149–152 (2012)

    Article  Google Scholar 

  44. Warden, P.: Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018)

  45. Sainath, T.N., Parada, C.: Convolutional neural networks for small-footprint keyword spotting. In: Interspeech, pp. 1478–1482 (2015)

    Google Scholar 

  46. Guo, J., Kumatani, K., Sun, M., Wu, M., Raju, A., Ström, N., Mandal, A.: Time-delayed bottleneck highway networks using a dft feature for keyword spotting. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5489–5493. IEEE (2018)

    Google Scholar 

  47. García, M.A., Destéfanis, E.A.: Power cepstrum calculation with convolutional neural networks. J. Comput. Sci. Technol. 19, 132–142 (2019)

    Article  Google Scholar 

  48. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

Download references

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation through the NVIDIA GPU Grant Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mario Alejandro García .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

García, M.A., Destéfanis, E.A., Rosset, A.L. (2020). Trainable Windowing Coefficients in DNN for Raw Audio Classification. In: Rucci, E., Naiouf, M., Chichizola, F., De Giusti, L. (eds) Cloud Computing, Big Data & Emerging Topics. JCC-BD&ET 2020. Communications in Computer and Information Science, vol 1291. Springer, Cham. https://doi.org/10.1007/978-3-030-61218-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61218-4_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61217-7

  • Online ISBN: 978-3-030-61218-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics