Trainable Windowing Coefficients in DNN for Raw Audio Classification

García, Mario Alejandro; Destéfanis, Eduardo Atilio; Rosset, Ana Lorena

doi:10.1007/978-3-030-61218-4_11

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1291))

Included in the following conference series:

Conference on Cloud Computing, Big Data & Emerging Topics

379 Accesses

Abstract

An artificial neural network for audio classification is proposed. This includes the windowing operation of raw audio and the calculation of the power spectrogram. A windowing layer is initialized with a hann window and its weights are adapted during training. The non-trainable weights of spectrogram calculation are initialized with the discrete Fourier transform coefficients. The tests are performed on the Speech Commands dataset. Results show that adapting the windowing coefficients produces a moderate accuracy improvement. It is concluded that the gradient of the error function can be propagated through the neural calculation of the power spectrum. It is also concluded that the training of the windowing layer improves the model’s ability to generalize.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SMCS: Automatic Real-Time Classification of Ambient Sounds, Based on a Deep Neural Network and Mel Frequency Cepstral Coefficients

Stage Audio Classifier Using Artificial Neural Network

Speech Recognition Technologies Based on Artificial Intelligence Algorithms

Notes

1.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/speech_commands.

References

Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276. IEEE (2012)
Google Scholar
Sainath, T.N., Kingsbury, B., Mohamed, A.R., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 297–302. IEEE (2013)
Google Scholar
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE (2015)
Google Scholar
Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Google Scholar
Ghahremani, P., Hadian, H., Lv, H., Povey, D., Khudanpur, S.: Acoustic modeling from frequency domain representations of speech. In: Interspeech, pp. 1596–1600 (2018)
Google Scholar
Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Interspeech, pp. 1766–1770 (2013)
Google Scholar
Tripathi, A., Mohan, A., Anand, S., Singh, M.: Adversarial learning of raw speech features for domain invariant speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5959–5963. IEEE (2018)
Google Scholar
Kim, Y., Kim, M., Goo, J., Kim, H.: Learning self-informed feature contribution for deep learning-based acoustic modeling. IEEE/ACM Trans. Audio, Speech, and Lang. Process. 26(11), 2204–2214 (2018)
Article Google Scholar
Guo, J., Xu, N., Chen, X., Shi, Y., Xu, K., Alwan, A.: Filter sampling and combination CNN (FSC-CNN): a compact CNN model for small-footprint ASR acoustic modeling using raw waveforms. In: Interspeech, pp. 3713–3717 (2018)
Google Scholar
Menne, T., Tüske, Z., Schlüter, R., Ney, H.: Learning acoustic features from the raw waveform for automatic speech recognition. In: DEGA, pp. 1533–1536 (2018)
Google Scholar
von Platen, P., Zhang, C., Woodland, P.: Multi-span acoustic modelling using raw waveform signals. In: Interspeech, pp. 1393–1397 (2019)
Google Scholar
Alisamir, S., Ahadi, S.M., Seyedin, S.: An end-to-end deep learning model to recognize farsi speech from raw input. In: 2018 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), pp. 1–5. IEEE (2018)
Google Scholar
Takeda, R., Nakadai, K., Komatani, K.: Multi-timescale feature-extraction architecture of deep neural networks for acoustic model training from raw speech signal. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2503–2510. IEEE (2018)
Google Scholar
Dubagunta, S.P., Kabil, S.H., Doss, M.M.: Improving children speech recognition through feature learning from raw speech signal. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5736–5740. IEEE (2019)
Google Scholar
Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., Dupoux, E.: End-to-end speech recognition from the raw waveform. In: Interspeech, pp. 781–785 (2018)
Google Scholar
Zeghidour, N., Usunier, N., Kokkinos, I., Schaiz, T., Synnaeve, G., Dupoux, E.: Learning filterbanks from raw speech for phone recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5509–5513. IEEE (2018)
Google Scholar
Seki, H., Yamamoto, K., Nakagawa, S.: A deep neural network integrated with filterbank learning for speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5480–5484. IEEE (2017)
Google Scholar
Sailor, H.B., Patil, H.A.: Novel unsupervised auditory filterbank learning using convolutional RBM for speech recognition. IEEE/ACM Trans. Audio, Speech, and Lang. Process. 24(12), 2341–2353 (2016)
Article Google Scholar
Zhu, Z., Engel, J.H., Hannun, A.: Learning multiscale features directly from waveforms. In: Interspeech, pp. 1305–1309 (2016)
Google Scholar
Ravanelli, M., Bengio, Y.: Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028. IEEE (2018)
Google Scholar
Dinkel, H., Chen, N., Qian, Y., Yu, K.: End-to-end spoofing detection with raw waveform CLDNNS. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4860–4864. IEEE (2017)
Google Scholar
Muckenhirn, H., Magimai-Doss, M., Marcel, S.: End-to-end convolutional neural network-based voice presentation attack detection. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 335–341. IEEE (2017)
Google Scholar
Yu, H., Tan, Z.H., Zhang, Y., Ma, Z., Guo, J.: Dnn filter bank cepstral coefficients for spoofing detection. IEEE Access 5, 4779–4787 (2017)
Article Google Scholar
Muckenhirn, H., Magimai-Doss, M., Marcel, S.: On learning vocal tract system related speaker discriminative information from raw signal using CNNS. In: Interspeech, pp. 1116–1120 (2018)
Google Scholar
Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J.: Direct modelling of speech emotion from raw speech. In: Interspeech, pp. 3920–3924 (2019)
Google Scholar
Tzirakis, P., Zhang, J., Schuller, B.W.: End-to-end speech emotion recognition using deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5089–5093. IEEE (2018)
Google Scholar
Millet, J., Zeghidour, N.: Learning to detect dysarthria from raw speech. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5831–5835. IEEE (2019)
Google Scholar
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)
Google Scholar
Pons Puig, J., Nieto Caballero, O., Prockup, M., Schmidt, E.M., Ehmann, A.F., Serra, X.: End-to-end learning for music audio tagging at scale. In: ISMIR, pp. 637–644. International Society for Music Information Retrieval (ISMIR) (2018)
Google Scholar
Chen, N., Wang, S.: High-level music descriptor extraction algorithm based on combination of multi-channel cnns and LSTM. In: ISMIR, pp. 509–514 (2017)
Google Scholar
Pons, J., Lidy, T., Serra, X.: Experimenting with musically motivated convolutional neural networks. In: 2016 14th International Workshop on Content-based Multimedia Indexing (CBMI), pp. 1–6. IEEE (2016)
Google Scholar
Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.Y., Sainath, T.: Deep learning for audio signal processing. IEEE J. Selected Top. Signal Process. 13(2), 206–219 (2019)
Article Google Scholar
Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. IEEE (2013)
Google Scholar
Deng, L., et al.: Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8604–8608. IEEE (2013)
Google Scholar
Nam, J., Choi, K., Lee, J., Chou, S.Y., Yang, Y.H.: Deep learning for audio-based music classification and tagging: teaching computers to distinguish rock from bach. IEEE Signal Process. Magazine 36(1), 41–51 (2018)
Article Google Scholar
Harris, F.J.: On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66(1), 51–83 (1978)
Article Google Scholar
Phan, H., Hertel, L., Maass, M., Mertins, A.: Robust audio event recognition with 1-max pooling convolutional neural networks. In: Interspeech, pp. 3653–3657 (2016)
Google Scholar
Takeda, R., Komatani, K.: Sound source localization based on deep neural networks with directional activate function exploiting phase information. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 405–409. IEEE (2016)
Google Scholar
Morales-Cordovilla, J.A., Sánchez, V., Gómez, A.M., Peinado, A.M.: On the use of asymmetric windows for robust speech recognition. Circuits, Syst. and Signal Process. 31(2), 727–736 (2012)
Article MathSciNet Google Scholar
Alam, M.J., Kenny, P., O’Shaughnessy, D.: On the use of asymmetric-shaped tapers for speaker verification using i-vectors. In: Odyssey 2012-The Speaker and Language Recognition Workshop, pp. 256–262 (2012)
Google Scholar
Rozman, R., Kodek, D.M.: Using asymmetric windows in automatic speech recognition. Speech Commun. 49(4), 268–276 (2007)
Article Google Scholar
Sahidullah, M., Saha, G.: A novel windowing technique for efficient computation of MFCC for speaker recognition. IEEE Signal Process. Lett. 20(2), 149–152 (2012)
Article Google Scholar
Warden, P.: Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018)
Sainath, T.N., Parada, C.: Convolutional neural networks for small-footprint keyword spotting. In: Interspeech, pp. 1478–1482 (2015)
Google Scholar
Guo, J., Kumatani, K., Sun, M., Wu, M., Raju, A., Ström, N., Mandal, A.: Time-delayed bottleneck highway networks using a dft feature for keyword spotting. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5489–5493. IEEE (2018)
Google Scholar
García, M.A., Destéfanis, E.A.: Power cepstrum calculation with convolutional neural networks. J. Comput. Sci. Technol. 19, 132–142 (2019)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

Download references

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation through the NVIDIA GPU Grant Program.

Author information

Authors and Affiliations

Universidad Tecnológica Nacional, Facultar Regional Córdoba, Córdoba, Argentina
Mario Alejandro García & Eduardo Atilio Destéfanis
Universidad Nacional de Córdoba, Córdoba, Argentina
Ana Lorena Rosset

Authors

Mario Alejandro García
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Atilio Destéfanis
View author publications
You can also search for this author in PubMed Google Scholar
Ana Lorena Rosset
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mario Alejandro García .

Editor information

Editors and Affiliations

III-LIDI, Facultad de Informatica, Universidad Nacional de La Plata and CIC, La Plata, Argentina
Enzo Rucci
III-LIDI, Facultad de Informática, Universidad Nacional de La Plata and CIC, La Plata, Argentina
Marcelo Naiouf
III-LIDI, Facultad de Informática, Universidad Nacional de La Plata and CIC, La Plata, Argentina
Franco Chichizola
III-LIDI, Facultad de Informática, Universidad Nacional de La Plata and CIC, La Plata, Argentina
Laura De Giusti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

García, M.A., Destéfanis, E.A., Rosset, A.L. (2020). Trainable Windowing Coefficients in DNN for Raw Audio Classification. In: Rucci, E., Naiouf, M., Chichizola, F., De Giusti, L. (eds) Cloud Computing, Big Data & Emerging Topics. JCC-BD&ET 2020. Communications in Computer and Information Science, vol 1291. Springer, Cham. https://doi.org/10.1007/978-3-030-61218-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-61218-4_11
Published: 24 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61217-7
Online ISBN: 978-3-030-61218-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Trainable Windowing Coefficients in DNN for Raw Audio Classification

Abstract

Access this chapter

Similar content being viewed by others

SMCS: Automatic Real-Time Classification of Ambient Sounds, Based on a Deep Neural Network and Mel Frequency Cepstral Coefficients

Stage Audio Classifier Using Artificial Neural Network

Speech Recognition Technologies Based on Artificial Intelligence Algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Trainable Windowing Coefficients in DNN for Raw Audio Classification

Abstract

Access this chapter

Similar content being viewed by others

SMCS: Automatic Real-Time Classification of Ambient Sounds, Based on a Deep Neural Network and Mel Frequency Cepstral Coefficients

Stage Audio Classifier Using Artificial Neural Network

Speech Recognition Technologies Based on Artificial Intelligence Algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation