Abstract
The problem of speaker recognition from a given set of speakers for any language and any context is considered. A database of Russian numerals that contains speech segments from 216 men and 177 women, each of whom spoke from 400 to 800 words, is used for recognition. Speech has been recorded on different types of microphones in different rooms at the natural noise level. Recognition is based on solutions of the inverse problem of finding the voice excitation pulse shape for each pitch period by the known speech segment. The pulse shape is defined as the inverse Fourier transform of the regularized ratio of speech signal spectra at the intervals of the open and closed glottis. Recognition is carried out by ten parameters: the pitch period, the open glottis interval duration, times when the source amplitude is maximum, minimum, or zero, the amplitude ratio for the minimum and maximum source pulses, three decomposition ratios of the source function by the principal component method, and the vowel duration. In such a recognition procedure, in the case of the utterance of a word that contains one vowel, the false reject rate (FRR) for men is 1.7–5.4%, and the false acceptance rate (FAR) is 5.4–7.1%. For women FRR = 2–5.2% and FAR = 5.2–6.3%. The recognition error decreases with an increasing number of vowels in the speech signal. At 10 vowels, for men FRR = 0.05–0.2% and FAR = 0.07–0.8%, and for women FRR = 0.09–0.2% and FAR = 0.17–2.1%.
Similar content being viewed by others
References
N. Dhananjaya and B. Yegnanarayana, “Speaker change detection in casual conversations using excitation source features,” Speech Commun. 50, 153–161 (2008).
M. Plumpe, T. Quatieri, and D. Reynolds, “Modeling the glottal flow derivative with application to speaker identification,” IEEE Trans. Speech, Audio Process 7 5, 569–585 (1999).
S. Prasanna, C. Gupta, and B. Yegnanarayana, “Extraction of speaker specific excitation information from linear prediction residual of speech,” Speech Commun. 48, 1243–1261 (2006).
B. Yegnanaraynana, S. M. Prasanna, J. Zachariah, and Ch. Gupta, “Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system,” IEEE Trans. Speech, Audio Process 13, 575–582 (2005).
V. N. Sorokin, A. A. Tananykin, and V. G. Trunov, “Speaker recognition using vocal source model,” Pattern Recogn. Image Anal. 24 1, 156–173 (2014).
T. V. Anathapadmanabha and G. Fant, “Calculation of true glottal flow and its components,” Speech Commun. 1, 167–184 (1982).
O. Schleusing, T. Kinnunen, B. Story, and J.-M. Vesin, “Joint source-filter optimization for accurate vocal tract estimation using differential evolution,” IEEE Trans. Audio, Speech, Language Processing 21 8, 1560–1572 (2013).
A. S. Leonov, I. S. Makarov, and V. N. Sorokin, “Estimation stability for format frequencies,” Rechevye Tekhnol., No. 1, 3–18 (2009).
B. Yegnanarayana and P. Satyanarayana, “Enhancement of reverberant speech using LP residual signal,” IEEE Trans. Acoust., Speech, Signal Processing 8 3, 267–281 (2000).
M. B. Shuwmaker, E. R. Hapner, M. Gilman, A. M. Klein, and M. M. Johns, “Analysis of voice change during cellular phone use: a blinded controlled study,” J. Voice 23 3, 308–313 (2010).
G. K. Vallabha and B. Tuller, “Systematic errors in formant analysis of steady-state vowels,” Speech Commun. 38, 141–160 (2002).
V. N. Sorokin and V. P. Trifonenkov, “Autocorrelation analysis of speech signals,” Acoust. Phys. 42 3, 368–374 (1996).
A. S. Leonov and V. N. Sorokin, “On the uniqueness of determination of a vocal source from a speech signal and formant frequencies,” Dokl. Math. 85 33, 432–435 (2012).
A. S. Leonov and V. N. Sorokin, “Unique determination of vocal tract resonance frequencies from a speech signal,” Dokl. Math. 84 2, 671–673 (2011).
A. N. Tikhonov and V. Ya. Arsenin, Methods for Solving Incorrect Problems (Nauka, Moscow, 1979) [in Russian].
A. S. Leonov and V. N. Sorokin, “Accuracy in determining voice source parameters,” Acoust. Phys. 60 6, 687–693 (2014).
G. Fant, “Glottal source and excitation analysis,” STL-QPSR 1, 85–107 (1979).
G. Fant, J. Liljencrants, and Q. A. Lin, “Four parameter model of glottal flow,” STL-QPSR 4, 1–13 (1985).
A. I. Tsyplikhin and V. N. Sorokin, “Speech segmentation at cardinal elements,” Inf. Protsessy 6 3, 177–207 (2006). wwwjpgru
V. N. Sorokin and D. N. Chepelev, “Primary analysis of speech signals,” Acoust. Phys. 51 4, 457–462 (2005).
D. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Commun. 17, 91–108 (1995).
D. Reynolds and R. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Process 3, 72–83 (1995).
A. Kounoudes, P. Naylor, and M. Brookes, “The DYPSA algorithm for estimation of glottal closure instants in voiced speech,” in Proc. IEEEICASSP (Orlando, FL, 2002), pp. 349–352.
M. Thomas, J. Gudnason, and P. Naylor, “Estimation of glottal closing and opening instants in voiced speech using the YAGA algorithm,” IEEE Trans. Audio, Speech, Lang. Process 20 1, 82–91 (2012).
V. N. Sorokin, A. A. Tananykin, and Yu. N. Romashkin, “Gender identification,” Rechevye Tekhnol., No. 4, 49–67 (2012).
I. T. Jolliffe, Principal Component Analysis (Springer, New York, 2002).
V. N. Sorokin and A. I. Tsyplikhin, “Speaker verification using the spectral and time parameters of voice signal,” J. Commun. Technol. Electron. 55 12, 1561–1574 (2010).
Author information
Authors and Affiliations
Corresponding author
Additional information
Viktor Nikolaevich Sorokin. Born in 1938. Graduated from the Moscow Aviation Institute in 1963. Senior Research Fellow at the Institute for Information Transmission Problems of the Russian Academy of Sciences, Doctor of Physical and Mathematical Sciences. Author of three monographs and about 150 scientific papers (Theory of Speech Production, 1985; Speech Synthesis, 1992; Speech Processes, 2012). Scientific interests: the theory of speech production, automatic speech and speaker recognition, and speech synthesis.
Aleksandr Sergeevich Leonov. Born in 1948. Graduated from Moscow State University in 1972. Received candidate’s degree in 1975 and doctoral degree in 1988. Professor of the National Research Nuclear University (MEPhI). Scientific interests: mathematical physics, mathematical modeling, and methods for solving inverse and ill-posed problems. Author of three books and more than 140 scientific papers.
Vladimir Grigor’evich Trunov. Born in 1947. Graduated from the Moscow Institute of Physics and Technology in 1970. Defended his thesis in 1986. Senior Research Fellow at the Institute for Information Transmission Problems of the Russian Academy of Sciences. Author of more than 80 scientific papers. Scientific interests: pattern recognition, probability theory, mathematical statistics, and mathematical modeling of bioelectric processes.
Rights and permissions
About this article
Cite this article
Sorokin, V.N., Leonov, A.S. & Trunov, V.G. Speaker recognition regardless of context and language on a fixed set of competitors. Pattern Recognit. Image Anal. 26, 450–459 (2016). https://doi.org/10.1134/S105466181602022X
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S105466181602022X