Abstract
On a comprehensive speech database, speaker recognition characteristics are compared under the usage of various voice-source models. Inverse problems to find a source via vowel speech segments are solved on the base of a special speech-production model and voice-source models (A-source, piecewise-linear source, nonparametric source, and source found by means of the spectral relation method). In the first stage, we find the pulses such that the relative residuals of their segmented and their theoretical analogs computed by means of the speech-production model are less than 0.25. For the selected pulses, a posteriori estimates of the error of their determining are computed and the final selection of the source pulses is performed: for the recognition procedure, we leave only pulses with a posteriori estimates of the error less than the accepted level 0.3. In the space of parameters found for each source model, a statistical model is created for each speaker and the recognition is performed. For the speaker recognition with respect to one vowel, the mean error is approximately equal to 66% for the piecewise-linear source, 61% for the spectral relation method, and 33% for the A-source.
Similar content being viewed by others
References
E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, and A. Stolcke, “Modeling prosodic feature sequences for speaker recognition,” Speech Commun. 46 (3–4), 455–472 (2005).
V. N. Sorokin and I. S. Makarov, “Gender recognition from vocal source,” Acoust. Phys. 54 (4), 571–578 (2008).
V. N. Sorokin, A. A. Tananykin, and V. G. Trunov, “Speaker recognition using vocal source model,” Pattern Recogn. Image Anal. 24 (1), 156–173 (2014).
D. Wong, J. Markel, and A. Gray, “Least squares glottal inverse filtering from the acoustic speech waveform,” IEEE Trans. Acoust. Speech Signal Process. 27 (4), 350–355 (1979).
P. Milenkovic, “Glottal inverse filtering by joint estimation of an AR system with a linear input model,” IEEE Trans. Acoust. Speech Signal Process. 34 (1), 28–42 (1986).
P. Alku, “Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering,” Speech Commun. 11 (2–3), 109–118 (1992).
Q. Fu and P. Murphy, “Robust glottal source estimation based on joint source-filter model optimization,” IEEE Trans. Audio Speech Lang. Process. 14 (2), 492–501 (2006).
H. Deng, R. K. Ward, M. P. Beddoes, and M. Hodgson, “A new method for obtaining accurate estimates of vocal-tract filters and glottal waves from vowel sounds,” IEEE Trans. Speech Audio Process. 14 (2), 445–455 (2006).
A. S. Leonov and V. N. Sorokin, “Two parametric voice source models and their asymptotic analysis,” Acoust. Phys. 60 (3), 323–334 (2014).
J. Walker and P. Murphy, “A review of glottal waveform analysis,” in Progress in Nonlinear Speech Processing, Ed. by Y. Stylianou, M. Faundez-Zanuy, and A. Esposito, Lecture Notes in Computer Science (Springer, Berlin, Heidelberg, 2007), Vol. 4391, pp. 1–21.
T. Drugman, B. Bozkurt, and T. Dutoit, “A comparative study of glottal source estimation techniques,” Comput. Speech Lang. 26 (1), 20–34 (2012).
P. Alku, “Glottal inverse filtering analysis of human voice production–A review of estimation and parameterization methods of the glottal excitation and their applications,” Sadhana 36 (5), 623–650 (2011).
M. D. Plumpe, T. F. Quatieri, and D. A. Reynolds, “Modeling of the glottal flow derivative waveform with application to speaker identification,” IEEE Trans. Speech Audio Process. 7 (5), 569–586 (1999).
P. Thévenaz and H. Hügli, “Usefulness of the LPC-residue in text-independent speaker verification,” Speech Commun. 17 (1–2), 145–157 (1995).
S. R. Mahadeva Prasanna, C. S. Gupta, and B. Yegnanarayana, “Extraction of speaker-specific excitation information from linear prediction residual of speech,” Speech Commun. 48 (10), 1243–1261 (2006).
N. Dhananjaya and B. Yegnanarayana, “Speaker change detection in casual conversations using excitation source features,” Speech Commun. 50 (2), 153–161 (2008).
V. N. Sorokin, A. S. Leonov, and V. G. Trunov, “Speaker recognition regardless of context and lan guage on a fixed set of competitors,” Pattern Recogn. Image Anal. 26 (2), 450–459 (2016).
A. S. Leonov and V. N. Sorokin, “Upper bound of errors in solving the inverse problem of identifying a voice source,” Acoust. Phys. 63 (5), 570–582 (2017).
A. S. Leonov, “A posteriori accuracy estimations of solutions to ill-posed inverse problems and extra-optimal regularizing algorithms for their solution,” Numer. Anal. Appl. 5 (1), 68–83 (2012).
A. S. Leonov, “Extra-optimal methods for solving ill-posed problems,” J. Inverse Ill-Posed Probl. 20 (5–6), 637–665 (2012).
CMU ARCTIC speech synthesis databases. http://festvox.org/cmu_arctic/
A. S. Leonov and V. N. Sorokin, “Unique determination of vocal tract resonance frequencies from a speech signal,” Dokl. Math. 84 (2), 740–742 (2011).
A. S. Leonov and V. N. Sorokin, “On the uniqueness of determination of a vocal source from a speech signal and formant frequencies,” Dokl. Math. 85 (3), 432–435 (2012).
G. Fant, “The LF-model revisited. Transformations and frequency domain analysis,” STL-QPSR 36 (2–3), 119–156 (1995).
T. V. Ananthapadmanabha, “Acoustic analysis of voice source dynamics,” STL-QPSR 25 (2–3), 1–24 (1984).
I. R. Titze and F. Alipour, The Myoelastic Aerodynamic Theory of Phonation (National Center for Voice and Speech, Iowa City, IA, 2006).
O. Schleusing, T. Kinnunen, B. Story, and J.-M. Vesin, “Joint source-filter optimization for accurate vocal tract estimation using differential evolution,” IEEE Trans. Audio Speech Lang. Process. 21 (8), 1560–1572 (2013).
D. G. Childers and C. Ahn, “Modeling the glottal volume-velocity waveform for three voice types,” J. Acoust. Soc. Am. 97 (1), 505–519 (1995).
H. Strik and L. Boves, “On the relation between voice source parameters and prosodic features in connected speech,” Speech Commun. 11 (2–3), 167–174 (1992).
V. N. Sorokin, “Segmentation of the period of the fundamental tone of a voice source,” Acoust. Phys. 62 (2), 244–254 (2016).
V. K. Ivanov, V. V. Vasin, and V. P. Tanana, Theory of Linear Ill-Posed Problems and Its Applications (Nauka, Moscow, 1978; VSP, Utrecht, 2002).
J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed., in Springer Series in Operations Research and Financial Engineering (Springer-Verlag, New York, 2006).
R. H. Byrd, M. E. Hribar, and J. Nocedal, “An interior point algorithm for large-scale nonlinear programming,” SIAM J. Optim. 9 (4), 877–900 (1999).
A. N. Tikhonov, A. S. Leonov, and A. G. Yagola, Nonlinear Ill-posed Problems. (Chapman and Hall, London, 1998), Vols. 1–2.
A. S. Leonov, Solution of Ill-Posed Inverse Problems. Theory Review, Practical Algorithms, and MATLAB Demonstrations (Librokom, Moscow, 2010) [in Russian].
V. N. Sorokin and A. S. Leonov, “Determination of a vocal source by the spectral ratio method,” Pattern Recogn. Image Anal. 27 (1), 139–151 (2017).
G. A. F. Seber, Multivariate Observations (Wiley, New York, 1984).
A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Royal Statist. Soc., Ser. B 39 (1), 1–38 (1977).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Viktor Nikolaevich Sorokin. Born in 1938. Graduated from the Moscow Aviation Institute in 1963. Senior Research Fellow at the Institute for Information Transmission Problems of the Russian Academy of Sciences, Doctor of Physical and Mathematical Sciences. Author of three monographs (Theory of Speech Production, 1985; Speech Synthesis, 1992; Speech Processes, 2012) and more than 150 scientific papers. Scientific interests: the theory of speech production, automatic speech and speaker recognition, and speech synthesis.
Aleksandr Sergeevich Leonov. Born in 1948. Graduated from Moscow State University in 1972. Received candidate’s degree in 1975 and doctoral degree in 1988. Professor of the National Research Nuclear University (MEPhI). Scientific interests: mathematical physics, mathematical modeling, and methods for solving inverse and ill-posed problems. Author of three books and more than 140 scientific papers.
Rights and permissions
About this article
Cite this article
Sorokin, V.N., Leonov, A.S. Multisource Speech Analysis for Speaker Recognition. Pattern Recognit. Image Anal. 29, 181–193 (2019). https://doi.org/10.1134/S1054661818040260
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1054661818040260