Multimedia Tools and Applications

, Volume 78, Issue 3, pp 3705–3722 | Cite as

Spectrogram based multi-task audio classification

  • Yuni Zeng
  • Hua MaoEmail author
  • Dezhong Peng
  • Zhang Yi


Audio classification is regarded as a great challenge in pattern recognition. Although audio classification tasks are always treated as independent tasks, tasks are essentially related to each other such as speakers’ accent and speakers’ identification. In this paper, we propose a Deep Neural Network (DNN)-based multi-task model that exploits such relationships and deals with multiple audio classification tasks simultaneously. We term our model as the gated Residual Networks (GResNets) model since it integrates Deep Residual Networks (ResNets) with a gate mechanism, which extract better representations between tasks compared with Convolutional Neural Networks (CNNs). Specifically, two multiplied convolutional layers are used to replace two feed-forward convolution layers in the ResNets. We tested our model on multiple audio classification tasks and found that our multi-task model achieves higher accuracy than task-specific models which train the models separately.


Multi-task learning Convolutional neural networks Deep residual networks Audio classification 


  1. 1.
    Amodei D, Anubhai R, Battenberg E, Case C, Casper J, Catanzaro B, Chen J, Chrzanowski M, Coates A, Diamos G, Elsen E, Engel J, Fan L, Fougner C, Hannun AY, Jun B, Han T, LeGresley P, Li X, Lin L, Narang S, Ng AY, Ozair S, Prenger R, Qian S, Raiman J, Satheesh S, Seetapun D, Sengupta S, Wang C, Yi W, Wang Z, Bo X, Xie Y, Yogatama D, Zhan J, Zhu Z (2016) Deep speech 2: End-to-end speech recognition in english and mandarin. In: Proceedings of the 33nd international conference on machine learning, pp 173–182Google Scholar
  2. 2.
    Boureau Y-L, Ponce J, LeCun Y (2010) A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th international conference on machine learning, pp 111–118Google Scholar
  3. 3.
    Bouvrie J (2006) Notes on convolutional neural networks. Neural Nets 2006:1–8Google Scholar
  4. 4.
    Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75MathSciNetCrossRefGoogle Scholar
  5. 5.
    Chen L, Mao X, Xue Y-L, Cheng LL (2012) Speech emotion recognition: features and classification models. Digital Signal Process 22(6):1154–1160MathSciNetCrossRefGoogle Scholar
  6. 6.
    Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, Zhang Z (2015) Mxnet: a, flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274
  7. 7.
    Gong P, Ye J, Zhang C (2012) Robust multi-task feature learning. In: The international conference on knowledge discovery and data mining, pp 895–903Google Scholar
  8. 8.
    Goodfellow IJ, Bengio Y, Courville AC (2016) Deep learning (Adaptive Computation and Machine Learning series). MIT PressGoogle Scholar
  9. 9.
    Grosse R, Raina R, Kwong H, Ng AY (2007) Shift-invariant sparse coding for audio classification. In: Proceedings of the Twenty-Third conference on uncertainty in artificial intelligence, AUAI Press, pp 149–158Google Scholar
  10. 10.
    Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G (2015) Recent advances in convolutional neural networks. arXiv:1512.07108
  11. 11.
    Han Y, Kim J, Lee K (2017) Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Trans Audio Speech Lang Process 25(1):208–221CrossRefGoogle Scholar
  12. 12.
    Hannun AY, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, Ng AY (2014) Deep speech: scaling up end-to-end speech recognition. arXiv:1412.5567
  13. 13.
    Hansen JHL, Liu G (2016) Unsupervised accent classification for deep data fusion of accent and language information. Speech Comm 78:19–33CrossRefGoogle Scholar
  14. 14.
    Hashimoto K, Xiong C, Tsuruoka Y, Socher R (2016) A joint many-task model: Growing a neural network for multiple NLP tasks. arXiv:1611.01587
  15. 15.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  16. 16.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  17. 17.
    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, pp 448–456Google Scholar
  18. 18.
    LISA Lab (2017) Convolutional neural networks (lenet).
  19. 19.
    Lartillot O, Toiviainen P (2007) MIR in matlab (II): a toolbox for musical feature extraction from audio. In: Proceedings of the 8th International Conference on Music Information Retrieval, pp 127–130Google Scholar
  20. 20.
    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
  21. 21.
    Lee H, Pham PT, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in annual conference on neural information processing systems, pp 1096–1104Google Scholar
  22. 22.
    Li C, Ma X, Jiang B, Li X, Zhang X, Liu X, Cao Y, Kannan A, Zhu Z Deep speaker: an end-to-end neural speaker embedding system. arXiv:1705.02304.2017
  23. 23.
    Livingstone SR , Peck K, Russo FA (2012) Ravdess: The ryerson audio-visual database of emotional speech and song. In: Annual meeting of the canadian society for brain, behaviour and cognitive scienceGoogle Scholar
  24. 24.
    Long M, Wang J (2015) Learning multiple tasks with deep relationship networks. CoRR, arXiv:1506.02117
  25. 25.
    Pedersen C, Diederich J (2007) Accent classification using support vector machines. In: Annual IEEE/ACIS, international conference on computer and information science, pp 444–449Google Scholar
  26. 26.
    Phapatanaburi K, Wang L, Sakagami R, Zhang Z, Li X, Iwahashi M (2016) Distant-talking accent recognition by combining GMM and DNN. Multimedia Tools Appl 75(9):5109–5124CrossRefGoogle Scholar
  27. 27.
    Pons J, Slizovskaia O, Gong R, Gómez E, Serra X (2017) Timbre analysis of music audio signals with convolutional neural networks. arXiv:1703.06697
  28. 28.
    Rao P (2008) Audio signal processing. In: Speech, audio, image and biomedical signal processing using neural networks. Springer, pp 169–189Google Scholar
  29. 29.
    Shegokar P, Sircar P (2016) Continuous wavelet transform based speech emotion recognition. In: International conference on signal processing and communication systems, pp 1–8Google Scholar
  30. 30.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  31. 31.
    Steven R, Thompson WF, Wanderley MM, Palmer C (2015) Common cues to emotion in the dynamic facial expressions of speech and song. Q J Exp Psychol 68 (5):952–970CrossRefGoogle Scholar
  32. 32.
    Wu B, Jia J, He T, Du J, Yi X, Ning Y (2016) Inferring users’ emotions for human-mobile voice dialogue applications. In: IEEE international conference on multimedia and expo, pp 1–6Google Scholar
  33. 33.
    Zhang B, Essl G, Provost EM (2015) Recognizing emotion from singing and speaking using shared models. In: International conference on affective computing and intelligent interaction, pp 139–145Google Scholar
  34. 34.
    Zhang B, Provost EM, Essl G (2016) Cross-corpus acoustic emotion recognition from singing and speaking: a multi-task learning approach. In: IEEE international conference on acoustics, speech and signal processing, pp 5805–5809Google Scholar
  35. 35.
    Zhu X, Suk H-I, Shen D (2014) A novel multi-relation regularization method for regression and classification in ad diagnosis. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 401–408Google Scholar
  36. 36.
    Zhu Y, Lucey S (2015) Convolutional sparse coding for trajectory reconstruction. IEEE Trans Pattern Anal Mach Intell 37(3):529–540CrossRefGoogle Scholar
  37. 37.
    Zhu X, Suk H-I, Wang L, Lee S-W, Shen D (2015) A novel relational regularization feature selection method for joint regression and classification in AD diagnosis. Medical Image Analysis 38:205–214CrossRefGoogle Scholar
  38. 38.
    Zhu Y, Zhu X, Kim M, Shen D, Wu G (2016) Early diagnosis of alzheimers disease by joint feature selection and classification on temporally structured support vector machine. In: International conference on medical image computing and computer-assisted intervention. Springer, Berlin, pp 264–272Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Machine Intelligence Laboratory, College of Computer ScienceSichuan UniversityChengduPeople’s Republic of China

Personalised recommendations