Abstract
Identification of multiple predominant instruments in polyphonic music is addressed using convolutional neural networks (CNN) through Mel-spectrogram, modgd-gram, and its fusion. Modgd-gram, a visual representation, is obtained by stacking modified group delay functions of consecutive frames successively. CNN learns the distinctive local characteristics from the visual representation and classifies the instrument to the group to which it belongs. The proposed system is systematically evaluated using the IRMAS dataset. We trained our networks using fixed-length audio excerpts to recognize multiple predominant instruments from the variable-length testing files. A wave-generative adversarial network (WaveGAN) architecture is also employed to generate audio files for data augmentation. We experimented with different fusion techniques, early fusion, mid-level fusion, and late or score-level fusion. The late fusion experiment reports a micro and macro F1 score of 0.69 and 0.62, respectively. These metrics are 7.81% and 12.73% higher than those obtained by the state-of-the-art Han’s model. The architectural choice of CNN with score-level fusion on Mel-spectro/modgd-gram has merit in recognizing the predominant instruments in polyphonic music.
Similar content being viewed by others
Data availability
The datasets analyzed during the current study are available at https://www.upf.edu/web/mtg/irmas
References
M. Airaksinen, L. Juvela, P. Alku, O. Rsnen, Data augmentation strategies for neural network F0 estimation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),10-15 Brighton, UK, (2019)
R. Ajayakumar, R. Rajan, Predominant Instrument Recognition in Polyphonic Music Using GMM-DNN Framework. in Proc. of International Conference on Signal Processing and Communications (SPCOM), (2020),1-5
G. Atkar, P. Jayaraju, Speech synthesis using generative adversarial network for improving readability of Hindi words to recuperate from dyslexia. Neural Computing and Applications, 1-10 (2021)
J.J. Bosch, J. Janer, F. Fuhrmann, P. Herrera, A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In: Proceedings of 13th International Society for Music Information Retrieval Conference (ISMIR) 552-564 (2012)
C. Chen, Q. Li, A multimodal music emotion classification method based on multi-feature combined network classifier. Math. Probl. Eng. 2020 (2020)
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
A. Diment, P. Rajan, T. Heittola, T. Virtanen, Modified group delay feature for musical instrument recognition. In: Proceedings of 10th International Symposium on Computer Music Multidisciplinary Research (CMMR), Marseille, France, 431-438 (2013)
T.-B. Do, H.-H. Nguyen, T.-T.-N. Nguyen, H. Vu, T.-T.-H. Tran, T.-L. Le, Plant identification using score-based fusion of multi-organ images. In: Proceedings of 9th International Conference on Knowledge and Systems Engineering (KSE), 191-196 (2017)
C. Donahue, J.J. McAuley, M. Puckette, Adversarial audio synthesis. In: Proceedings of International Conference on Learning Representations (ICLR), 1-16 (2019)
Z. Duan, J. Han, B. Pardo, Multi-pitch streaming of harmonic sound mixtures. IEEE/ACM Trans. Audio Speech Language Process. 22(1), 138–150 (2013)
F. Fuhrmann, P. Herrera, Polyphonic instrument recognition for exploring semantic similarities in music. In: Proceedings of 13th International Conference on Digital Audio Effects DAFx10, pp. 1-8 (2010)
J. Gao, P. Li, Z. Chen, J. Zhang, A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020). https://doi.org/10.1162/necoa01273
D. Ghosal, M.H. Kolekar, Music genre recognition using deep neural networks and transfer learning. In: Proceedings of Interspeech, 2087-2091 (2018)
X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth International conference on artificial intelligence and statistics, 249-256 (2010). JMLR Workshop and Conference Proceedings
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, Improved training of wasserstein GANs. In: Proceedings of Neural Information Processing System (NIPS) (2017)
S. Gururani, C. Summers, A. Lerch, Instrument activity detection in polyphonic music using deep neural networks. In: Proceedings of International Society for Music Information Retrieval Conference (ISMIR), 569-576 (2018)
Y. Han, J. Kim, K. Lee, Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Trans Audio Speech Language Process. 25(1), 208–221 (2017)
B. Hariharan, P. Arbeláez, R. Girshick, J. Malik, Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 447-456 (2015)
T. Heittola, A. Klapuri, T. Virtanen, Musical instrument recognition in polyphonic audio using source-filter model for sound separation. In: Proceedings of International Society of Music Information Retrieval Conference, 327-332 (ISMIR) (2009)
G.C. Juan, A. Jakob, E. Cano, Jazz solo instrument classification with convolutional neural networks, source separation, and transfer learning. In: Proceedings of International Society for Music Information Retrieval Conference, 577-584,(ISMIR) (2018)
T. Kitahara, M. Goto, K. Komatani, T. Ogata, H.G. Okuno, Instrument identification in polyphonic music: feature weighting to minimize influence of sound overlaps. EURASIP J. Adv. Signal Process. 2007, 1–15 (2006)
A. Kratimenos, K. Avramidis, C. Garoufis, A. Zlatintsi, P. Maragos, Augmentation methods on monophonic audio for instrument classification in polyphonic music. In: Proceedings of 28th European Signal Processing Conference, 156-160 (2021). IEEE
J. Kong, J. Kim, J. Bae, Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 33, 17022–17033 (2020)
P. Li, J. Qian, T. Wang, Automatic instrument recognition in polyphonic music using convolutional neural networks. arXiv:1511.05520 (2015)
C.-J. Lin, C.-H. Lin, S.-Y. Jeng, Using feature fusion and parameter optimization of dual-input convolutional neural network for face gender recognition. Appl. Sci. (2020). https://doi.org/10.3390/app10093166
A. Madhu, S. Kumaraswamy, Data augmentation using generative adversarial network for environmental sound classification. In: Proceedings of 27th European Signal Processing Conference, 1-5 (2019). IEEE
B. McFee, C. Raffel, D. Liang, D. Ellis, M. Mcvicar, E. Battenberg, O. Nieto, librosa: Audio and music signal analysis in python, pp. 18-24 (2015). https://doi.org/10.25080/Majora-7b98e3ed-003
S. Motamed, P. Rogalla, F. Khalvati, Data augmentation using generative adversarial networks (gans) for gan-based detection of pneumonia and covid-19 in chest x-ray images. Inf. Med. Unlock. 27, 100779 (2021)
H.A. Murthy, B. Yegnanarayana, Group delay functions and its applications in speech technology. Sadhana 36(5), 745–782 (2011)
A.V. Oppenheim, R.W. Schafer, Discrete Time Signal Processing (Prentice Hall Inc, New Jersey, 1990)
S. Oramas, F. Barbieri, O. Nieto Caballero, X. Serra, Multimodal deep learning for music genre classification. Trans. Int. Soc. Music Inf. 4-21 (2018)
D. O’Shaughnessy, Speech communication: human and machine. Universities press, 1-5 (1987)
L. Perez, J. Wang, The effectiveness of data augmentation in image classification using deep learning. arXiv:1712.04621 (2017)
J. Pons, O. Slizovskaia, R. Gong, E. Gomez, X. Serra, Timbre analysis of music audio signals with convolutional neural networks. In: Proceedings of 25th European Signal Processing Conference, 2744-2748 (2017). IEEE
K. Racharla, V. Kumar, C.B. Jayant, A. Khairkar, P. Harish, Predominant musical instrument classification based on spectral features. In: Proceedings of 7th International Conference on Signal Processing and Integrated Networks (SPIN), 617-622 (2020)
R. Rajan, H.A. Murthy, Two-pitch tracking in co-channel speech using modified group delay functions. Speech Commun. 89, 37–46 (2017)
R. Rajan, H.A. Murthy, Group delay based melody monopitch extraction from music. In: Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing, 186-190 (2013)
R. Rajan, Estimating pitch in speech and music using modified group delay functions. Ph.D. dissertation, Indian Institute of Technology, Madras (2017)
R. Rajan, H.A. Murthy, Music genre classification by fusion of modified group delay and melodic features. In: Proceedings of Twenty-third National Conference on Communications (NCC), 1-6 (2017). https://doi.org/10.1109/NCC.2017.8077056
R. Rajan, H.A. Murthy, Melodic pitch extraction from music signals using modified group delay functions. In: Proceedings of 2013 National Conference on Communications (NCC), pp. 1-5. IEEE, (2013)
L.C. Reghunath, R. Rajan, Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music. EURASIP Journal on Audio, Speech, and Music Processing, 11 (2022),1–14, Springer. https://doi.org/10.1186/s13636-022-00245-8
L.C. Reghunath, R. Rajan, Attention-based predominant instruments recognition in polyphonic music. In: Proceedings of 18th Sound and Music Computing Conference (SMC),(2021),199-206
J. Sebastian, H.A. Murthy, Group delay-based music source separation using deep recurrent neural networks. In: Proceedings of International Conference on Signal Processing and Communications (SPCOM), 1-5 (2016). IEEE
M. Seeland, P. Mäder, Multi-view classification with convolutional neural networks. PLOS ONE 16, 0245230 (2021). https://doi.org/10.1371/journal.pone.0245230
O. Slizovskaia, E. Gomez Gutierrez, G. Haro Ortega, Automatic musical instrument recognition in audiovisual recordings by combining image and audio classification strategies. In: Proceedings of 13th Sound and Music Computing Conference (SMC) 2016, 442-7 (2016)
M. Sukhavasi, S. Adapa, Music theme recognition using cnn and self-attention. arXiv preprint arXiv:1911.07041 (2019)
M. Uzair, N. Jamil, Effects of hidden layers on the efficiency of neural networks. In: Proceedings of IEEE 23rd International Multitopic Conference (INMIC), 1-6 (2020). IEEE
W. Yao, A. Moumtzidou, C.O. Dumitru, A. Stelios, I. Gialampoukidis, S. Vrochidis, M. Datcu, I. Kompatsiaris, Early and late fusion of multiple modalities in sentinel imagery and social media retrieval. In: Proceedings of International Conference of Pattern Recognition (ICPR) (2021)
D. Yu, H. Duan, J. Fang, B. Zeng, Predominant instrument recognition based on deep neural network with auxiliary classification. IEEE/ACM Trans. Audio Speech Language Process. 28, 852–861 (2020)
M.D. Zeiler, R. Fergus, T visualizing and understanding convolutional networks. In: Proceedings of European conference on computer vision (ECCV), 818-8331 (2014)
Author information
Authors and Affiliations
Contributions
Lekshmi C. R and Rajeev Rajan jointly designed, implemented, and interpreted the computer simulations and prepared the manuscript. RR implemented the modgd-gram algorithm.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lekshmi , C.R., Rajeev, R. Multiple Predominant Instruments Recognition in Polyphonic Music Using Spectro/Modgd-gram Fusion. Circuits Syst Signal Process 42, 3464–3484 (2023). https://doi.org/10.1007/s00034-022-02278-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02278-y