Skip to main content
Log in

Multiple Predominant Instruments Recognition in Polyphonic Music Using Spectro/Modgd-gram Fusion

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Identification of multiple predominant instruments in polyphonic music is addressed using convolutional neural networks (CNN) through Mel-spectrogram, modgd-gram, and its fusion. Modgd-gram, a visual representation, is obtained by stacking modified group delay functions of consecutive frames successively. CNN learns the distinctive local characteristics from the visual representation and classifies the instrument to the group to which it belongs. The proposed system is systematically evaluated using the IRMAS dataset. We trained our networks using fixed-length audio excerpts to recognize multiple predominant instruments from the variable-length testing files. A wave-generative adversarial network (WaveGAN) architecture is also employed to generate audio files for data augmentation. We experimented with different fusion techniques, early fusion, mid-level fusion, and late or score-level fusion. The late fusion experiment reports a micro and macro F1 score of 0.69 and 0.62, respectively. These metrics are 7.81% and 12.73% higher than those obtained by the state-of-the-art Han’s model. The architectural choice of CNN with score-level fusion on Mel-spectro/modgd-gram has merit in recognizing the predominant instruments in polyphonic music.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The datasets analyzed during the current study are available at https://www.upf.edu/web/mtg/irmas

References

  1. M. Airaksinen, L. Juvela, P. Alku, O. Rsnen, Data augmentation strategies for neural network F0 estimation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),10-15 Brighton, UK, (2019)

  2. R. Ajayakumar, R. Rajan, Predominant Instrument Recognition in Polyphonic Music Using GMM-DNN Framework. in Proc. of International Conference on Signal Processing and Communications (SPCOM), (2020),1-5

  3. G. Atkar, P. Jayaraju, Speech synthesis using generative adversarial network for improving readability of Hindi words to recuperate from dyslexia. Neural Computing and Applications, 1-10 (2021)

  4. J.J. Bosch, J. Janer, F. Fuhrmann, P. Herrera, A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In: Proceedings of 13th International Society for Music Information Retrieval Conference (ISMIR) 552-564 (2012)

  5. C. Chen, Q. Li, A multimodal music emotion classification method based on multi-feature combined network classifier. Math. Probl. Eng. 2020 (2020)

  6. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  7. A. Diment, P. Rajan, T. Heittola, T. Virtanen, Modified group delay feature for musical instrument recognition. In: Proceedings of 10th International Symposium on Computer Music Multidisciplinary Research (CMMR), Marseille, France, 431-438 (2013)

  8. T.-B. Do, H.-H. Nguyen, T.-T.-N. Nguyen, H. Vu, T.-T.-H. Tran, T.-L. Le, Plant identification using score-based fusion of multi-organ images. In: Proceedings of 9th International Conference on Knowledge and Systems Engineering (KSE), 191-196 (2017)

  9. C. Donahue, J.J. McAuley, M. Puckette, Adversarial audio synthesis. In: Proceedings of International Conference on Learning Representations (ICLR), 1-16 (2019)

  10. Z. Duan, J. Han, B. Pardo, Multi-pitch streaming of harmonic sound mixtures. IEEE/ACM Trans. Audio Speech Language Process. 22(1), 138–150 (2013)

    Article  Google Scholar 

  11. F. Fuhrmann, P. Herrera, Polyphonic instrument recognition for exploring semantic similarities in music. In: Proceedings of 13th International Conference on Digital Audio Effects DAFx10, pp. 1-8 (2010)

  12. J. Gao, P. Li, Z. Chen, J. Zhang, A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020). https://doi.org/10.1162/necoa01273

    Article  MathSciNet  MATH  Google Scholar 

  13. D. Ghosal, M.H. Kolekar, Music genre recognition using deep neural networks and transfer learning. In: Proceedings of Interspeech, 2087-2091 (2018)

  14. X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth International conference on artificial intelligence and statistics, 249-256 (2010). JMLR Workshop and Conference Proceedings

  15. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, Improved training of wasserstein GANs. In: Proceedings of Neural Information Processing System (NIPS) (2017)

  16. S. Gururani, C. Summers, A. Lerch, Instrument activity detection in polyphonic music using deep neural networks. In: Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 569-576 (2018)

  17. Y. Han, J. Kim, K. Lee, Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Trans Audio Speech Language Process. 25(1), 208–221 (2017)

    Article  Google Scholar 

  18. B. Hariharan, P. Arbeláez, R. Girshick, J. Malik, Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 447-456 (2015)

  19. T. Heittola, A. Klapuri, T. Virtanen, Musical instrument recognition in polyphonic audio using source-filter model for sound separation. In: Proceedings of International Society of Music Information Retrieval Conference, 327-332 (ISMIR) (2009)

  20. G.C. Juan, A. Jakob, E. Cano, Jazz solo instrument classification with convolutional neural networks, source separation, and transfer learning. In: Proceedings of International Society for Music Information Retrieval Conference, 577-584,(ISMIR) (2018)

  21. T. Kitahara, M. Goto, K. Komatani, T. Ogata, H.G. Okuno, Instrument identification in polyphonic music: feature weighting to minimize influence of sound overlaps. EURASIP J. Adv. Signal Process. 2007, 1–15 (2006)

    Article  MATH  Google Scholar 

  22. A. Kratimenos, K. Avramidis, C. Garoufis, A. Zlatintsi, P. Maragos, Augmentation methods on monophonic audio for instrument classification in polyphonic music. In: Proceedings of 28th European Signal Processing Conference, 156-160 (2021). IEEE

  23. J. Kong, J. Kim, J. Bae, Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 33, 17022–17033 (2020)

    Google Scholar 

  24. P. Li, J. Qian, T. Wang, Automatic instrument recognition in polyphonic music using convolutional neural networks. arXiv:1511.05520 (2015)

  25. C.-J. Lin, C.-H. Lin, S.-Y. Jeng, Using feature fusion and parameter optimization of dual-input convolutional neural network for face gender recognition. Appl. Sci. (2020). https://doi.org/10.3390/app10093166

    Article  Google Scholar 

  26. A. Madhu, S. Kumaraswamy, Data augmentation using generative adversarial network for environmental sound classification. In: Proceedings of 27th European Signal Processing Conference, 1-5 (2019). IEEE

  27. B. McFee, C. Raffel, D. Liang, D. Ellis, M. Mcvicar, E. Battenberg, O. Nieto, librosa: Audio and music signal analysis in python, pp. 18-24 (2015). https://doi.org/10.25080/Majora-7b98e3ed-003

  28. S. Motamed, P. Rogalla, F. Khalvati, Data augmentation using generative adversarial networks (gans) for gan-based detection of pneumonia and covid-19 in chest x-ray images. Inf. Med. Unlock. 27, 100779 (2021)

    Article  Google Scholar 

  29. H.A. Murthy, B. Yegnanarayana, Group delay functions and its applications in speech technology. Sadhana 36(5), 745–782 (2011)

    Article  Google Scholar 

  30. A.V. Oppenheim, R.W. Schafer, Discrete Time Signal Processing (Prentice Hall Inc, New Jersey, 1990)

    MATH  Google Scholar 

  31. S. Oramas, F. Barbieri, O. Nieto Caballero, X. Serra, Multimodal deep learning for music genre classification. Trans. Int. Soc. Music Inf. 4-21 (2018)

  32. D. O’Shaughnessy, Speech communication: human and machine. Universities press, 1-5 (1987)

  33. L. Perez, J. Wang, The effectiveness of data augmentation in image classification using deep learning. arXiv:1712.04621 (2017)

  34. J. Pons, O. Slizovskaia, R. Gong, E. Gomez, X. Serra, Timbre analysis of music audio signals with convolutional neural networks. In: Proceedings of 25th European Signal Processing Conference, 2744-2748 (2017). IEEE

  35. K. Racharla, V. Kumar, C.B. Jayant, A. Khairkar, P. Harish, Predominant musical instrument classification based on spectral features. In: Proceedings of 7th International Conference on Signal Processing and Integrated Networks (SPIN), 617-622 (2020)

  36. R. Rajan, H.A. Murthy, Two-pitch tracking in co-channel speech using modified group delay functions. Speech Commun. 89, 37–46 (2017)

    Article  Google Scholar 

  37. R. Rajan, H.A. Murthy, Group delay based melody monopitch extraction from music. In: Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing, 186-190 (2013)

  38. R. Rajan, Estimating pitch in speech and music using modified group delay functions. Ph.D. dissertation, Indian Institute of Technology, Madras (2017)

  39. R. Rajan, H.A. Murthy, Music genre classification by fusion of modified group delay and melodic features. In: Proceedings of Twenty-third National Conference on Communications (NCC), 1-6 (2017). https://doi.org/10.1109/NCC.2017.8077056

  40. R. Rajan, H.A. Murthy, Melodic pitch extraction from music signals using modified group delay functions. In: Proceedings of 2013 National Conference on Communications (NCC), pp. 1-5. IEEE, (2013)

  41. L.C. Reghunath, R. Rajan, Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music. EURASIP Journal on Audio, Speech, and Music Processing, 11 (2022),1–14, Springer. https://doi.org/10.1186/s13636-022-00245-8

  42. L.C. Reghunath, R. Rajan, Attention-based predominant instruments recognition in polyphonic music. In: Proceedings of 18th Sound and Music Computing Conference (SMC),(2021),199-206

  43. J. Sebastian, H.A. Murthy, Group delay-based music source separation using deep recurrent neural networks. In: Proceedings of International Conference on Signal Processing and Communications (SPCOM), 1-5 (2016). IEEE

  44. M. Seeland, P. Mäder, Multi-view classification with convolutional neural networks. PLOS ONE 16, 0245230 (2021). https://doi.org/10.1371/journal.pone.0245230

    Article  Google Scholar 

  45. O. Slizovskaia, E. Gomez Gutierrez, G. Haro Ortega, Automatic musical instrument recognition in audiovisual recordings by combining image and audio classification strategies. In: Proceedings of 13th Sound and Music Computing Conference (SMC) 2016, 442-7 (2016)

  46. M. Sukhavasi, S. Adapa, Music theme recognition using cnn and self-attention. arXiv preprint arXiv:1911.07041 (2019)

  47. M. Uzair, N. Jamil, Effects of hidden layers on the efficiency of neural networks. In: Proceedings of IEEE 23rd International Multitopic Conference (INMIC), 1-6 (2020). IEEE

  48. W. Yao, A. Moumtzidou, C.O. Dumitru, A. Stelios, I. Gialampoukidis, S. Vrochidis, M. Datcu, I. Kompatsiaris, Early and late fusion of multiple modalities in sentinel imagery and social media retrieval. In: Proceedings of International Conference of Pattern Recognition (ICPR) (2021)

  49. D. Yu, H. Duan, J. Fang, B. Zeng, Predominant instrument recognition based on deep neural network with auxiliary classification. IEEE/ACM Trans. Audio Speech Language Process. 28, 852–861 (2020)

    Article  Google Scholar 

  50. M.D. Zeiler, R. Fergus, T visualizing and understanding convolutional networks. In: Proceedings of European conference on computer vision (ECCV), 818-8331 (2014)

Download references

Author information

Authors and Affiliations

Authors

Contributions

Lekshmi C. R and Rajeev Rajan jointly designed, implemented, and interpreted the computer simulations and prepared the manuscript. RR implemented the modgd-gram algorithm.

Corresponding authors

Correspondence to C. R. Lekshmi or Rajan Rajeev.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lekshmi , C.R., Rajeev, R. Multiple Predominant Instruments Recognition in Polyphonic Music Using Spectro/Modgd-gram Fusion. Circuits Syst Signal Process 42, 3464–3484 (2023). https://doi.org/10.1007/s00034-022-02278-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02278-y

Keywords

Navigation