Abstract
The total variability model (TVM)/probabilistic linear discriminant analysis (PLDA) framework is one of the most popular methods for speaker verification. In this framework, the i-vector representations are first extracted from utterances via an estimated TVM and then employed to estimate the PLDA parameters for classification. The TVM and PLDA are estimated serially, so the information loss in the TVM is inherited by the i-vectors, and then passed into the PLDA classifier. More seriously, this loss cannot be compensated by the PLDA. To solve this problem, we propose a task-driven variability model (TDVM) to jointly estimate the TVM and PLDA classifier. In this method, the feedback from the PLDA can supervise the optimal solution of the TVM to move toward the space that has the maximum between-class separation and minimum within-class variation. Meanwhile, this space is suitable for open-set test which can deal with unenrolled speakers. Unlike most embedding methods which extract the embedding representations via the stack of network structures, the TDVM contains the assumptions about latent variables, which can enhance the interpretation of speaker representation extraction. The proposed method is evaluated on the King-ASR-010 and VoxCeleb databases, and the experimental results show that the TDVM method can achieve better performance than the traditional TVM/PLDA and VGG-M network with different cost functions.
Similar content being viewed by others
References
D. Bansé, G.R. Doddington, D. Garcia-Romero, J.J. Godfrey, C.S. Greenberg, A.F. Martin, A. McCree, M. Przybocki, D.A. Reynolds, Summary and initial results of the 2013–2014 speaker recognition i-vector machine learning challenge, in Interspeech (2014), pp. 368–372
H. Bredin, Tristounet: triplet loss for speaker turn embedding, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (2017), pp. 5430–5434
K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets, in The British Machine Vision Conference (2014)
C. Chen, J. Han, Y. Pan, Speaker verification via estimating total variability space using probabilistic partial least squares, in Interspeech (2017), pp. 1537–1541
S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in 2005 IEEE Conference on Computer Vision and Pattern Recognition (2005), pp. 539–546
S. Cumani, P. Laface, Scoring heterogeneous speaker vectors using nonlinear transformations and tied PLDA models. IEEE/ACM Trans. Audio Speech Lang. Process. 26(5), 995–1009 (2018)
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in Interspeech (2011), pp. 249–252
R.L. Gorsuch, Factor analysis, edition: 2nd publisher. Lawrence Earlbaum Associates (1983)
J.H. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015)
Y.Z. Isik, H. Erdogan, R. Sarikaya, S-vector: a discriminative representation derived from i-vector for speaker verification, in European Signal Processing Conference (2015), pp. 2097–2101
P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process. 16(5), 980–988 (2008)
P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition. Digital Signal Process. 15(4), 1435–1447 (2007)
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
King-ASR-010: Chinese mandarin speech recognition corpus (desktop)-digit string-200 speakers. http://en.speechocean.com/datacenter/details/41.html. Accessed 27 Nov 2019
C.D. Kolstad, L.S. Lasdon, Derivative evaluation and computational experience with large bilevel mathematical programs. J. Optim. Theory Appl. 65(3), 485–499 (1990)
M.A. Laskar, R.H. Laskar, Integrating DNN-HMM technique with hierarchical multi-layer acoustic model for text-dependent speaker verification. Circuits Syst. Signal Process. 38, 3548–3572 (2019)
V.D.M. Laurens, G.E. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
Z. Lei, Y. Yang, Maximum likelihood i-vector space using PCA for speaker verification, in Interspeech (2011), pp. 2725–2728
N. Li, M.W. Mak, SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(10), 1648–1659 (2015)
J. Ma, V.Sethu, E. Ambikairajah, K.A. Lee, Twin model G-PLDA for duration mismatch compensation in text-independent speaker verification, in Interspeech (2016), pp. 1853–1857
J. Ma, V. Sethu, E. Ambikairajah, K.A. Lee, Generalized variability model for speaker verification. IEEE Signal Process. Lett. 25(12), 1775–1779 (2018)
N. Maghsoodi, H. Sameti, H. Zeinali, T. Stafylakis, Speaker recognition with random digit strings using uncertainty normalized HMM-based i-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 27(11), 1815–1825 (2019)
M.W. Mak, X. Pang, J.T. Chien, Mixture of PLDA for noise robust i-vector speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 130–142 (2016)
A. Nagrani, J. Chung, A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, in Interspeech (2017), pp. 2616–2620
M.A. Nematollahi, S.A.R. Al-Haddad, Distant speaker recognition: an overview. Int. J. Humanoid Robot. 13(2), 1–45 (2016)
S.J.D. Prince, J.H. Elder, Probabilistic linear discriminant analysis for inferences about identity, in Proceedings of IEEE International Conference on Computer Vision (2007), pp. 1–8
D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 10, 19–41 (2000)
S.O. Sadjadi, M. Slaney, A.L. Heck, MSR identity toolbox v1.0: A MATLAB toolbox for speaker recognition research. Speech Lang. Process. Tech. Comm. Newsl. 1(4), 1–32 (2013)
S.E. Shepstone, K.A. Lee, H. Li, Z. Tan, S.H. Jensen, Total variability modeling using source-specific priors. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 504–517 (2016)
A. Sinha, P. Malo, K. Deb, A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Trans. Evol. Comput. 22(2), 276–295 (2018)
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Interspeech (2017), pp. 999–1003
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 5329–5333
M. Tipping, C. Bishop, Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B (Stat. Method) 61(3), 611–622 (1999)
R. Travadi, S. Narayanan, Efficient estimation and model generalization for the total variability model. Comput. Speech Lang. 53, 43–64 (2019)
R. Travadi, S. Narayanan, Total variability layer in deep neural network embeddings for speaker verification. IEEE Signal Process. Lett. 26(6), 893–897 (2019)
E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (2014), pp. 4080–4084
V. Vestman, T. Kinnunen, Supervector compression strategies to speed up i-vector system development (2018). arXiv preprint arXiv: 1805.01156
J. Villalba, A. Miguel, A. Ortega, E. Lleida, Bayesian networks to model the variability of speaker verification scores in adverse environments. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2327–2340 (2016)
Y. Xu, I. Mcloughlin, Y. Song, Improved i-vector representation for speaker diarization. Circuits Syst. Signal Process. 35(9), 3393–3404 (2016)
Y. Yang, S. Wang, M. Sun, Y. Qian, K. Yu, Generative adversarial networks based x-vector augmentation for robust probabilistic linear discriminant analysis in speaker verification, in 2018 11th International Symposium on Chinese Spoken Language Processing (2018), pp. 205–209
Y.Q. Yu, L. Fan, W.J. Li, Ensemble additive margin softmax for speaker verification, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (2019), pp. 6046–6050
C. Zhang, K. Koishida, J.H.L. Hansen, Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1633–1644 (2018)
Acknowledgements
This research is supported by the National Natural Science Foundation of China under Grant No. U1736210.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Appendix
A Appendix
1.1 A.1 Derivation of Lower-Level Objective Function
The logarithmic likelihood function of parameter \(\varvec{T}\) can be expressed as:
where \(Q(\pmb w_{s,h})\) is an auxiliary function. Since logarithmic likelihood function \(f(x) = \log (x)\) is a concave function, according to the Jensen’s inequality rule, \(\log (\mathbb E[x])\ge {\mathbb {E}}[\log (x)]\) can be obtained. Thus, (20) can be transformed into the following:
In (21), \(\varvec{M}_{s,h}|\varvec{w}_{s,h}\sim {\mathbb {N}}(\varvec{m} + \varvec{Tw}_{s,h},\varvec{\varPhi })\); \({\mathbb {E}}_{\varvec{w}}\) is short for \(\mathbb E_{\varvec{w}_{s,h}\sim Q}\), and the “\(\varvec{w}_{s,h}\sim Q\)” subscript indicates that the expectation is with respect to \(\varvec{w}_{s,h}\) drawn from Q. Then, the items not related to \(\varvec{T}\) are removed; therefore, the lower-level function \(f_L\) can be written as (9).
1.2 A.2 Convergence Analysis
For the development supervector set, we have:
where \(\theta \) is the parameter set \(\{\varvec{T},\varvec{\varLambda },\varvec{\varPsi }\}\). Then, the logarithm of (22) can be expressed as:
Here, for each iteration index i of the TDVM, we set \(L(\theta ,\theta ^{(i)})\), \(H(\theta ,\theta ^{(i)})\) and \(G(\theta ,\theta ^{(i)})\) as follows:
Therefore, \(\sum _{s,h}\log P(\varvec{M}_{s,h};\theta )\) can be rewritten as follows:
In (25), \(\theta \) is taken as \(\theta ^{(i+1)}\) and \(\theta ^{(i)}\), respectively, and then they are subtracted to obtain the follows formula:
For the first item, since \(\theta ^{(i+1)}\) makes \(L(\theta ,\theta ^{(i+1)})\) take the maximum value, the following inequality can be obtained:
For the second item, it can be transformed as follows:
The third item is similar to the second one, and therefore, the following inequality can be obtained:
According to (27)–(29), we have:
Thus, the value of the logarithmic likelihood function has been increasing during the iterative process, and the algorithm for the TDVM is convergent.
Rights and permissions
About this article
Cite this article
Chen, C., Han, J. Task-Driven Variability Model for Speaker Verification. Circuits Syst Signal Process 39, 3125–3144 (2020). https://doi.org/10.1007/s00034-019-01315-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-019-01315-7