Abstract
An artificial neural network for audio classification is proposed. This includes the windowing operation of raw audio and the calculation of the power spectrogram. A windowing layer is initialized with a hann window and its weights are adapted during training. The non-trainable weights of spectrogram calculation are initialized with the discrete Fourier transform coefficients. The tests are performed on the Speech Commands dataset. Results show that adapting the windowing coefficients produces a moderate accuracy improvement. It is concluded that the gradient of the error function can be propagated through the neural calculation of the power spectrum. It is also concluded that the training of the windowing layer improves the model’s ability to generalize.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276. IEEE (2012)
Sainath, T.N., Kingsbury, B., Mohamed, A.R., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 297–302. IEEE (2013)
Sainath, T.N., Weiss, R.J., Senior, A., Wilson, K.W., Vinyals, O.: Learning the speech front-end with raw waveform CLDNNS. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Hoshen, Y., Weiss, R.J., Wilson, K.W.: Speech acoustic modeling from raw multichannel waveforms. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4624–4628. IEEE (2015)
Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Ghahremani, P., Hadian, H., Lv, H., Povey, D., Khudanpur, S.: Acoustic modeling from frequency domain representations of speech. In: Interspeech, pp. 1596–1600 (2018)
Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In: Interspeech, pp. 1766–1770 (2013)
Tripathi, A., Mohan, A., Anand, S., Singh, M.: Adversarial learning of raw speech features for domain invariant speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5959–5963. IEEE (2018)
Kim, Y., Kim, M., Goo, J., Kim, H.: Learning self-informed feature contribution for deep learning-based acoustic modeling. IEEE/ACM Trans. Audio, Speech, and Lang. Process. 26(11), 2204–2214 (2018)
Guo, J., Xu, N., Chen, X., Shi, Y., Xu, K., Alwan, A.: Filter sampling and combination CNN (FSC-CNN): a compact CNN model for small-footprint ASR acoustic modeling using raw waveforms. In: Interspeech, pp. 3713–3717 (2018)
Menne, T., Tüske, Z., Schlüter, R., Ney, H.: Learning acoustic features from the raw waveform for automatic speech recognition. In: DEGA, pp. 1533–1536 (2018)
von Platen, P., Zhang, C., Woodland, P.: Multi-span acoustic modelling using raw waveform signals. In: Interspeech, pp. 1393–1397 (2019)
Alisamir, S., Ahadi, S.M., Seyedin, S.: An end-to-end deep learning model to recognize farsi speech from raw input. In: 2018 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), pp. 1–5. IEEE (2018)
Takeda, R., Nakadai, K., Komatani, K.: Multi-timescale feature-extraction architecture of deep neural networks for acoustic model training from raw speech signal. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2503–2510. IEEE (2018)
Dubagunta, S.P., Kabil, S.H., Doss, M.M.: Improving children speech recognition through feature learning from raw speech signal. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5736–5740. IEEE (2019)
Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., Dupoux, E.: End-to-end speech recognition from the raw waveform. In: Interspeech, pp. 781–785 (2018)
Zeghidour, N., Usunier, N., Kokkinos, I., Schaiz, T., Synnaeve, G., Dupoux, E.: Learning filterbanks from raw speech for phone recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5509–5513. IEEE (2018)
Seki, H., Yamamoto, K., Nakagawa, S.: A deep neural network integrated with filterbank learning for speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5480–5484. IEEE (2017)
Sailor, H.B., Patil, H.A.: Novel unsupervised auditory filterbank learning using convolutional RBM for speech recognition. IEEE/ACM Trans. Audio, Speech, and Lang. Process. 24(12), 2341–2353 (2016)
Zhu, Z., Engel, J.H., Hannun, A.: Learning multiscale features directly from waveforms. In: Interspeech, pp. 1305–1309 (2016)
Ravanelli, M., Bengio, Y.: Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028. IEEE (2018)
Dinkel, H., Chen, N., Qian, Y., Yu, K.: End-to-end spoofing detection with raw waveform CLDNNS. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4860–4864. IEEE (2017)
Muckenhirn, H., Magimai-Doss, M., Marcel, S.: End-to-end convolutional neural network-based voice presentation attack detection. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 335–341. IEEE (2017)
Yu, H., Tan, Z.H., Zhang, Y., Ma, Z., Guo, J.: Dnn filter bank cepstral coefficients for spoofing detection. IEEE Access 5, 4779–4787 (2017)
Muckenhirn, H., Magimai-Doss, M., Marcel, S.: On learning vocal tract system related speaker discriminative information from raw signal using CNNS. In: Interspeech, pp. 1116–1120 (2018)
Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J.: Direct modelling of speech emotion from raw speech. In: Interspeech, pp. 3920–3924 (2019)
Tzirakis, P., Zhang, J., Schuller, B.W.: End-to-end speech emotion recognition using deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5089–5093. IEEE (2018)
Millet, J., Zeghidour, N.: Learning to detect dysarthria from raw speech. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5831–5835. IEEE (2019)
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE (2014)
Pons Puig, J., Nieto Caballero, O., Prockup, M., Schmidt, E.M., Ehmann, A.F., Serra, X.: End-to-end learning for music audio tagging at scale. In: ISMIR, pp. 637–644. International Society for Music Information Retrieval (ISMIR) (2018)
Chen, N., Wang, S.: High-level music descriptor extraction algorithm based on combination of multi-channel cnns and LSTM. In: ISMIR, pp. 509–514 (2017)
Pons, J., Lidy, T., Serra, X.: Experimenting with musically motivated convolutional neural networks. In: 2016 14th International Workshop on Content-based Multimedia Indexing (CBMI), pp. 1–6. IEEE (2016)
Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.Y., Sainath, T.: Deep learning for audio signal processing. IEEE J. Selected Top. Signal Process. 13(2), 206–219 (2019)
Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. IEEE (2013)
Deng, L., et al.: Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8604–8608. IEEE (2013)
Nam, J., Choi, K., Lee, J., Chou, S.Y., Yang, Y.H.: Deep learning for audio-based music classification and tagging: teaching computers to distinguish rock from bach. IEEE Signal Process. Magazine 36(1), 41–51 (2018)
Harris, F.J.: On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66(1), 51–83 (1978)
Phan, H., Hertel, L., Maass, M., Mertins, A.: Robust audio event recognition with 1-max pooling convolutional neural networks. In: Interspeech, pp. 3653–3657 (2016)
Takeda, R., Komatani, K.: Sound source localization based on deep neural networks with directional activate function exploiting phase information. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 405–409. IEEE (2016)
Morales-Cordovilla, J.A., Sánchez, V., Gómez, A.M., Peinado, A.M.: On the use of asymmetric windows for robust speech recognition. Circuits, Syst. and Signal Process. 31(2), 727–736 (2012)
Alam, M.J., Kenny, P., O’Shaughnessy, D.: On the use of asymmetric-shaped tapers for speaker verification using i-vectors. In: Odyssey 2012-The Speaker and Language Recognition Workshop, pp. 256–262 (2012)
Rozman, R., Kodek, D.M.: Using asymmetric windows in automatic speech recognition. Speech Commun. 49(4), 268–276 (2007)
Sahidullah, M., Saha, G.: A novel windowing technique for efficient computation of MFCC for speaker recognition. IEEE Signal Process. Lett. 20(2), 149–152 (2012)
Warden, P.: Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018)
Sainath, T.N., Parada, C.: Convolutional neural networks for small-footprint keyword spotting. In: Interspeech, pp. 1478–1482 (2015)
Guo, J., Kumatani, K., Sun, M., Wu, M., Raju, A., Ström, N., Mandal, A.: Time-delayed bottleneck highway networks using a dft feature for keyword spotting. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5489–5493. IEEE (2018)
García, M.A., Destéfanis, E.A.: Power cepstrum calculation with convolutional neural networks. J. Comput. Sci. Technol. 19, 132–142 (2019)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Acknowledgements
We gratefully acknowledge the support of NVIDIA Corporation through the NVIDIA GPU Grant Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
García, M.A., Destéfanis, E.A., Rosset, A.L. (2020). Trainable Windowing Coefficients in DNN for Raw Audio Classification. In: Rucci, E., Naiouf, M., Chichizola, F., De Giusti, L. (eds) Cloud Computing, Big Data & Emerging Topics. JCC-BD&ET 2020. Communications in Computer and Information Science, vol 1291. Springer, Cham. https://doi.org/10.1007/978-3-030-61218-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-61218-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61217-7
Online ISBN: 978-3-030-61218-4
eBook Packages: Computer ScienceComputer Science (R0)