Abstract
There has been much research on estimating noise and speech source direction, but there have not been many studies on estimating the source direction of instrumental sound sources. In this study, we considered the source direction estimation of a single instrumental sound. Direction estimation of sound sources by the multiple signal classification (MUSIC) method often causes large estimation errors. Then, we propose a technique for estimating the direction of musical instrument sound sources by applying regression analysis using a convolutional neural network (CNN), a type of neural network. We calculated the MUSIC spectrum obtained using MUSIC that uses the fundamental and harmonic components, which have relatively large amplitudes, and we estimated the direction of the sound source using the CNN with these components as input. We achieved this by focusing on the overtone structure of the instrumental sound source. This study demonstrated the effectiveness of this method using simulations in a monaural environment.
Similar content being viewed by others
Data Availability
The datasets generated during the current study are available from the corresponding author on reasonable request.
References
S. Chakrabarty, E.A.P. Habets, Multi-speaker DOA estimation using deep convolutional networks trained with noise signals. IEEE J. Select. Topics Signal Process. 13(1), 8–21 (2019). https://doi.org/10.1109/JSTSP.2019.2901664
Y.N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 933–941. https://doi.org/10.48550/arXiv.1612.08083
Y. Denda, T. Nishiura, Y. Yamashita, Robust talker direction estimation based on weighted CSP analysis and maximum likelihood estimation. IEICE Trans. Inf. Syst. E89-D(3), 1050–1057 (2006). https://doi.org/10.1093/ietisy/e89-d.3.1050
A.M. Elbir, DeepMUSIC: multiple signal classification via deep learning. IEEE Sens. Lett. (2020). https://doi.org/10.1109/LSENS.2020.2980384
E.L. Ferguson, S.B. Williams, C.T. Jin, Sound source localization in a multipath environment using convolutional neural networks, in Proceedings of 2018 IEEE International Conference on Acoustic, Speech and Signal Process (2018), pp. 2386–2390. https://doi.org/10.1109/ICASSP.2018.8462024
P.-A. Grumiaux, S. Kitić, L. Girin, A. Guérin, A survey of sound source localization with deep learning methods. J. Acoust. Soc. Am. 152(1), 107–151 (2022). https://doi.org/10.1121/10.0011809
M. Ikeuchi, H. Tanji, T. Murakami, Improvement of the direction-of-arrival estimation method using a single channel microphone by correcting a spectral slope of speech, in Proceedings of APSIPA ASC 2022 (2022), pp. 186–393. https://doi.org/10.23919/APSIPAASC55919.2022.9980291
H. Kameoka, Deep learning approach to audio source separation. J. Acoust. Soc. Jpn. 75(9), 525–531 (2019). https://doi.org/10.20697/jasj.75.9_525
K. Kikuma, Adaptive signal processing with array antenna. (Science and Technology Publishing Company, 1999)
M. Kitahashi, H. Handa, Estimating classroom situations by using CNN with environmental sound spectrograms. J. Adv. Comput. Intell. Intell. Inf. 22(2), 242–248 (2018). https://doi.org/10.20965/jaciii.2018.p0242
W. Ma, X. Liu, Phased microphone array for sound source localization with deep learning. Aerosp. Syst. 2(2), 71–81 (2019). https://doi.org/10.1007/s42401-019-00026-w
R. Masumura, Language modeling and spoken language understanding based on deep learning. J. Acoust. Soc. Jpn. 73(1), 39–46 (2017). https://doi.org/10.20697/jasj.73.1_39
K. Mori, T. Yokoyama, A. Hasegawa, Comparison of high-resolution techniques for array signal processing method in silent target detection using ambient noise. J. Marine Acoust. Soc. Jpn. 32(2), 89–97 (2005). https://doi.org/10.3135/jmasj.32.89
R. Nishimura, Y. Suzuki, Source and direction of arrival estimation based on maximum likelihood combined with GMM and eigenanalysis, in Proceedings of 2018 IEEE International Conference on Acoustic, Speech and Signal Processing (2018), pp. 3434–3438. https://doi.org/10.1109/ICASSP.2018.8461658
R. Roy, T. Kailath, ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. ASSP-37(7), 984–995 (1989). https://doi.org/10.1109/29.32276
R.O. Schmidt, Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. AP-34(3), 276–280 (1986). https://doi.org/10.1109/TAP.1986.1143830
T. Suzuki, Y. Kaneda, Sound source direction estimation based on subband peak-hold processing. J. Acoust. Soc. Jpn. 65(10), 513–522 (2009). https://doi.org/10.20697/jasj.65.10_513
M. Uneda, K. Ishikawa, Study on high resolvable location finding of near sound source using MUSIC algorithm. J. Jpn. Soc. Precis. Eng. 70(8), 1111–1116 (2004). https://doi.org/10.2493/jspe.70.1111
M. Uneda, H. Kondo, K. Ishikawa, O. Ohnishi, S. Kurokawa, T. Doi, Location finding function of high correlation sound sources, using combined methods of spatial smoothing processing and MUSIC-development of handy microphone array system for high efficiency location finding-. J. Jpn. Soc. Precis. Eng. 77(12), 1158–1164 (2011). https://doi.org/10.2493/jjspe.77.1158
M. Unoki, M. Akagi, Signal extraction from noisy signal based on auditory scene analysis, in Proceedings of 5th International Conference on Spoken Language Process (1998). https://doi.org/10.21437/ICSLP.1998-342
K. Yamamoto, A. Ogihara, H. Murata, Direction estimation of virtual sound source by MUSIC method using fundamental frequency components in stereo sound, in The 2019 (70th) Chugoku-branch Joint Convention of the Institutes of Electrical and Information Engineers, R19-08-01-05 (2019)
K. Yamamoto, A. Ogihara, H. Murata, Direction estimation of sound source by MUSIC method and CNN considering overtone structure, in Proceedings of 2022 international technical conference on circuits/systems, computers and communications (2022), pp. 671–674. https://doi.org/10.1109/ITC-CSCC55581.2022.9895088
K. Yamamoto, A. Ogihara, H. Murata, Direction estimation of sound source using MUSIC method and FFNN focusing on the overtone structure of instrumental sounds. IEICE Trans. Inf. Syst. J104-D(10), 780–783 (2021). https://doi.org/10.14923/transinfj.2020JDL8018
K. Yamamoto, F. Asano, I. Hara, J. Ogata, H. Asoh, T. Yamada, N. Kitawaki, Real-time speech interface based on the fusion of audio and video information for humanoid robot HRP-2. J. Acoust. Soc. Jpn 62(3), 161–172 (2006). https://doi.org/10.20697/jasj.62.3_161
Y.X. Zhu, H.R. Jin, Speaker localization based on audio-visual bimodal fusion. JACIII 25(3), 375–382 (2021). https://doi.org/10.20965/jaciii.2021.p0375
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they have no conflict of interest to this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yamamoto, K., Ogihara, A. & Murata, H. Direction Estimation of Instrumental Sound Sources Using Regression Analysis by Convolutional Neural Network. Circuits Syst Signal Process 42, 7004–7021 (2023). https://doi.org/10.1007/s00034-023-02433-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02433-z