Skip to main content
Log in

Task-Driven Variability Model for Speaker Verification

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

The total variability model (TVM)/probabilistic linear discriminant analysis (PLDA) framework is one of the most popular methods for speaker verification. In this framework, the i-vector representations are first extracted from utterances via an estimated TVM and then employed to estimate the PLDA parameters for classification. The TVM and PLDA are estimated serially, so the information loss in the TVM is inherited by the i-vectors, and then passed into the PLDA classifier. More seriously, this loss cannot be compensated by the PLDA. To solve this problem, we propose a task-driven variability model (TDVM) to jointly estimate the TVM and PLDA classifier. In this method, the feedback from the PLDA can supervise the optimal solution of the TVM to move toward the space that has the maximum between-class separation and minimum within-class variation. Meanwhile, this space is suitable for open-set test which can deal with unenrolled speakers. Unlike most embedding methods which extract the embedding representations via the stack of network structures, the TDVM contains the assumptions about latent variables, which can enhance the interpretation of speaker representation extraction. The proposed method is evaluated on the King-ASR-010 and VoxCeleb databases, and the experimental results show that the TDVM method can achieve better performance than the traditional TVM/PLDA and VGG-M network with different cost functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. D. Bansé, G.R. Doddington, D. Garcia-Romero, J.J. Godfrey, C.S. Greenberg, A.F. Martin, A. McCree, M. Przybocki, D.A. Reynolds, Summary and initial results of the 2013–2014 speaker recognition i-vector machine learning challenge, in Interspeech (2014), pp. 368–372

  2. H. Bredin, Tristounet: triplet loss for speaker turn embedding, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (2017), pp. 5430–5434

  3. K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets, in The British Machine Vision Conference (2014)

  4. C. Chen, J. Han, Y. Pan, Speaker verification via estimating total variability space using probabilistic partial least squares, in Interspeech (2017), pp. 1537–1541

  5. S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in 2005 IEEE Conference on Computer Vision and Pattern Recognition (2005), pp. 539–546

  6. S. Cumani, P. Laface, Scoring heterogeneous speaker vectors using nonlinear transformations and tied PLDA models. IEEE/ACM Trans. Audio Speech Lang. Process. 26(5), 995–1009 (2018)

    Article  Google Scholar 

  7. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)

    Article  Google Scholar 

  8. D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in Interspeech (2011), pp. 249–252

  9. R.L. Gorsuch, Factor analysis, edition: 2nd publisher. Lawrence Earlbaum Associates (1983)

  10. J.H. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015)

    Article  Google Scholar 

  11. Y.Z. Isik, H. Erdogan, R. Sarikaya, S-vector: a discriminative representation derived from i-vector for speaker verification, in European Signal Processing Conference (2015), pp. 2097–2101

  12. P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process. 16(5), 980–988 (2008)

    Article  Google Scholar 

  13. P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition. Digital Signal Process. 15(4), 1435–1447 (2007)

    Google Scholar 

  14. T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)

    Article  Google Scholar 

  15. King-ASR-010: Chinese mandarin speech recognition corpus (desktop)-digit string-200 speakers. http://en.speechocean.com/datacenter/details/41.html. Accessed 27 Nov 2019

  16. C.D. Kolstad, L.S. Lasdon, Derivative evaluation and computational experience with large bilevel mathematical programs. J. Optim. Theory Appl. 65(3), 485–499 (1990)

    Article  MathSciNet  Google Scholar 

  17. M.A. Laskar, R.H. Laskar, Integrating DNN-HMM technique with hierarchical multi-layer acoustic model for text-dependent speaker verification. Circuits Syst. Signal Process. 38, 3548–3572 (2019)

    Article  Google Scholar 

  18. V.D.M. Laurens, G.E. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)

    MATH  Google Scholar 

  19. Z. Lei, Y. Yang, Maximum likelihood i-vector space using PCA for speaker verification, in Interspeech (2011), pp. 2725–2728

  20. N. Li, M.W. Mak, SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(10), 1648–1659 (2015)

    Article  Google Scholar 

  21. J. Ma, V.Sethu, E. Ambikairajah, K.A. Lee, Twin model G-PLDA for duration mismatch compensation in text-independent speaker verification, in Interspeech (2016), pp. 1853–1857

  22. J. Ma, V. Sethu, E. Ambikairajah, K.A. Lee, Generalized variability model for speaker verification. IEEE Signal Process. Lett. 25(12), 1775–1779 (2018)

    Article  Google Scholar 

  23. N. Maghsoodi, H. Sameti, H. Zeinali, T. Stafylakis, Speaker recognition with random digit strings using uncertainty normalized HMM-based i-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 27(11), 1815–1825 (2019)

    Article  Google Scholar 

  24. M.W. Mak, X. Pang, J.T. Chien, Mixture of PLDA for noise robust i-vector speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 130–142 (2016)

    Article  Google Scholar 

  25. A. Nagrani, J. Chung, A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, in Interspeech (2017), pp. 2616–2620

  26. M.A. Nematollahi, S.A.R. Al-Haddad, Distant speaker recognition: an overview. Int. J. Humanoid Robot. 13(2), 1–45 (2016)

    Article  Google Scholar 

  27. S.J.D. Prince, J.H. Elder, Probabilistic linear discriminant analysis for inferences about identity, in Proceedings of IEEE International Conference on Computer Vision (2007), pp. 1–8

  28. D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 10, 19–41 (2000)

    Article  Google Scholar 

  29. S.O. Sadjadi, M. Slaney, A.L. Heck, MSR identity toolbox v1.0: A MATLAB toolbox for speaker recognition research. Speech Lang. Process. Tech. Comm. Newsl. 1(4), 1–32 (2013)

    Google Scholar 

  30. S.E. Shepstone, K.A. Lee, H. Li, Z. Tan, S.H. Jensen, Total variability modeling using source-specific priors. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 504–517 (2016)

    Article  Google Scholar 

  31. A. Sinha, P. Malo, K. Deb, A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Trans. Evol. Comput. 22(2), 276–295 (2018)

    Article  Google Scholar 

  32. D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Interspeech (2017), pp. 999–1003

  33. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 5329–5333

  34. M. Tipping, C. Bishop, Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B (Stat. Method) 61(3), 611–622 (1999)

    Article  MathSciNet  Google Scholar 

  35. R. Travadi, S. Narayanan, Efficient estimation and model generalization for the total variability model. Comput. Speech Lang. 53, 43–64 (2019)

    Article  Google Scholar 

  36. R. Travadi, S. Narayanan, Total variability layer in deep neural network embeddings for speaker verification. IEEE Signal Process. Lett. 26(6), 893–897 (2019)

    Article  Google Scholar 

  37. E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (2014), pp. 4080–4084

  38. V. Vestman, T. Kinnunen, Supervector compression strategies to speed up i-vector system development (2018). arXiv preprint arXiv: 1805.01156

  39. J. Villalba, A. Miguel, A. Ortega, E. Lleida, Bayesian networks to model the variability of speaker verification scores in adverse environments. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2327–2340 (2016)

    Article  Google Scholar 

  40. Y. Xu, I. Mcloughlin, Y. Song, Improved i-vector representation for speaker diarization. Circuits Syst. Signal Process. 35(9), 3393–3404 (2016)

    Article  MathSciNet  Google Scholar 

  41. Y. Yang, S. Wang, M. Sun, Y. Qian, K. Yu, Generative adversarial networks based x-vector augmentation for robust probabilistic linear discriminant analysis in speaker verification, in 2018 11th International Symposium on Chinese Spoken Language Processing (2018), pp. 205–209

  42. Y.Q. Yu, L. Fan, W.J. Li, Ensemble additive margin softmax for speaker verification, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (2019), pp. 6046–6050

  43. C. Zhang, K. Koishida, J.H.L. Hansen, Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1633–1644 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China under Grant No. U1736210.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiqing Han.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

A Appendix

1.1 A.1 Derivation of Lower-Level Objective Function

The logarithmic likelihood function of parameter \(\varvec{T}\) can be expressed as:

$$\begin{aligned} \begin{aligned} L(\varvec{T})&= \sum _{s,h}\log P(\varvec{M}_{s,h};\varvec{T})\\&= \sum _{s,h}\log \sum _{\varvec{w}_{s,h}} P(\varvec{M}_{s,h},\varvec{w}_{s,h};\varvec{T})\\&= \sum _{s,h}\log \sum _{\varvec{w}_{s,h}}Q(\pmb w_{s,h})\frac{P(\varvec{M}_{s,h},\varvec{w}_{s,h};\varvec{T})}{Q(\pmb w_{s,h})} \end{aligned} \end{aligned}$$
(20)

where \(Q(\pmb w_{s,h})\) is an auxiliary function. Since logarithmic likelihood function \(f(x) = \log (x)\) is a concave function, according to the Jensen’s inequality rule, \(\log (\mathbb E[x])\ge {\mathbb {E}}[\log (x)]\) can be obtained. Thus, (20) can be transformed into the following:

$$\begin{aligned} \begin{aligned} L(\varvec{T})&= \sum _{s,h}\log \sum _{\varvec{w}_{s,h}}Q(\varvec{w}_{s,h})\frac{P(\varvec{M}_{s,h},\varvec{w}_{s,h};\varvec{T})}{Q(\varvec{w}_{s,h})}\\&\ge \sum _{s,h}\sum _{\varvec{w}_{s,h}}Q(\varvec{w}_{s,h})\log \frac{P(\varvec{M}_{s,h},\varvec{w}_{s,h};\varvec{T})}{Q(\varvec{w}_{s,h})}\\&=\sum _{s,h}\sum _{\varvec{w}_{s,h}}Q(\varvec{w}_{s,h})\log \frac{P(\varvec{M}_{s,h}|\varvec{w}_{s,h};\varvec{T})P(\varvec{w}_{s,h})}{Q(\varvec{w}_{s,h})}\\&=\sum _{s,h}{\mathbb {E}}_{\varvec{w}}\Big [\log P(\varvec{M}_{s,h}|\varvec{w}_{s,h};\varvec{T})+\log P(\varvec{w}_{s,h})-\log Q(\varvec{w}_{s,h})\Big ]. \end{aligned} \end{aligned}$$
(21)

In (21), \(\varvec{M}_{s,h}|\varvec{w}_{s,h}\sim {\mathbb {N}}(\varvec{m} + \varvec{Tw}_{s,h},\varvec{\varPhi })\); \({\mathbb {E}}_{\varvec{w}}\) is short for \(\mathbb E_{\varvec{w}_{s,h}\sim Q}\), and the “\(\varvec{w}_{s,h}\sim Q\)” subscript indicates that the expectation is with respect to \(\varvec{w}_{s,h}\) drawn from Q. Then, the items not related to \(\varvec{T}\) are removed; therefore, the lower-level function \(f_L\) can be written as (9).

1.2 A.2 Convergence Analysis

For the development supervector set, we have:

$$\begin{aligned} \begin{aligned} \prod _{s,h}P(\varvec{M}_{s,h};\theta )&=\frac{\prod _{s,h}P(\varvec{M}_{s,h},\varvec{w}_{s,h};\theta )}{\prod _{s,h}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta )}\\&=\frac{\prod _{s,h}P(\varvec{M}_{s,h},\varvec{w}_{s,h},\varvec{z}_{s};\theta )}{\prod _{s,h}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta )P(\varvec{z}_{s}|\varvec{w}_{s,h},\varvec{M}_{s,h};\theta )} \end{aligned} \end{aligned}$$
(22)

where \(\theta \) is the parameter set \(\{\varvec{T},\varvec{\varLambda },\varvec{\varPsi }\}\). Then, the logarithm of (22) can be expressed as:

$$\begin{aligned} \begin{aligned} \sum _{s,h}\log P(\varvec{M}_{s,h};\theta )=&\sum _{s,h}\log P(\varvec{M}_{s,h},\varvec{w}_{s,h},\varvec{z}_{s};\theta )\\&-\sum _{s,h}\log P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta )\\&-\sum _{s,h}\log P(\varvec{z}_{s}|\varvec{w}_{s,h},\varvec{M}_{s,h};\theta ). \end{aligned} \end{aligned}$$
(23)

Here, for each iteration index i of the TDVM, we set \(L(\theta ,\theta ^{(i)})\), \(H(\theta ,\theta ^{(i)})\) and \(G(\theta ,\theta ^{(i)})\) as follows:

$$\begin{aligned} \left\{ \begin{aligned} L(\theta ,\theta ^{(i)}) =&\sum _{s,h}\sum _{\varvec{w}_{s,h}}\sum _{\varvec{z}_{s}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})P(\varvec{z}_{s}|\varvec{w}_{s,1},\dots ,\varvec{w}_{s,H_{s}};\theta ^{(i)}) \\&\qquad \cdot \log P(\varvec{M}_{s,h},\varvec{w}_{s,h},\varvec{z}_{s};\theta ) \\ H(\theta ,\theta ^{(i)}) =&\sum _{s,h}\sum _{\varvec{w}_{s,h}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})\log P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ) \\ G(\theta ,\theta ^{(i)}) =&\sum _{s,h}\sum _{\varvec{w}_{s,h}}\sum _{\varvec{z}_{s}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})P(\varvec{z}_{s}|\varvec{w}_{s,1},\dots ,\varvec{w}_{s,H_{s}};\theta ^{(i)}) \\&\qquad \cdot \log P(\varvec{z}_{s}|\varvec{w}_{s,h},\varvec{M}_{s,h};\theta ). \end{aligned} \right. \end{aligned}$$
(24)

Therefore, \(\sum _{s,h}\log P(\varvec{M}_{s,h};\theta )\) can be rewritten as follows:

$$\begin{aligned} \sum _{s,h}\log P(\varvec{M}_{s,h};\theta )=L(\theta ,\theta ^{(i)})-H(\theta ,\theta ^{(i)})-G(\theta ,\theta ^{(i)}). \end{aligned}$$
(25)

In (25), \(\theta \) is taken as \(\theta ^{(i+1)}\) and \(\theta ^{(i)}\), respectively, and then they are subtracted to obtain the follows formula:

$$\begin{aligned} \begin{aligned}&\sum _{s,h}\log P(\varvec{M}_{s,h};\theta ^{(i+1)})-\sum _{s,h}\log P(\varvec{M}_{s,h};\theta ^{(i)})\\&\quad =\Big [L(\theta ,\theta ^{(i+1)})-L(\theta ,\theta ^{(i)})\Big ]-\Big [H(\theta ,\theta ^{(i+1)})-H(\theta ,\theta ^{(i)})\Big ]\\&\qquad -\Big [G(\theta ,\theta ^{(i+1)})-G(\theta ,\theta ^{(i)})\Big ]. \end{aligned} \end{aligned}$$
(26)

For the first item, since \(\theta ^{(i+1)}\) makes \(L(\theta ,\theta ^{(i+1)})\) take the maximum value, the following inequality can be obtained:

$$\begin{aligned} L(\theta ,\theta ^{(i+1)})-L(\theta ,\theta ^{(i)}) \ge 0. \end{aligned}$$
(27)

For the second item, it can be transformed as follows:

$$\begin{aligned} \begin{aligned}&~-\Big [H(\theta ,\theta ^{(i+1)})-H(\theta ,\theta ^{(i)})\Big ]\\&\quad =~-\sum _{s,h}\sum _{\varvec{w}_{s,h}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})\log \frac{P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i+1)})}{P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})}\\&\quad \ge ~\sum _{s,h}\log \Big [\sum _{\varvec{w}_{s,h}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})\frac{P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i+1)})}{P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})}\Big ]\\&\quad =~\sum _{s,h}\log \Big [\sum _{\varvec{w}_{s,h}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i+1)})\Big ]=0. \end{aligned} \end{aligned}$$
(28)

The third item is similar to the second one, and therefore, the following inequality can be obtained:

$$\begin{aligned} -[G(\theta ,\theta ^{(i+1)})-G(\theta ,\theta ^{(i)})] \ge 0. \end{aligned}$$
(29)

According to (27)–(29), we have:

$$\begin{aligned} \begin{aligned}&\sum _{s,h}\log P(\varvec{M}_{s,h};\theta ^{(i+1)})-\sum _{s,h}\log P(\varvec{M}_{s,h};\theta ^{(i)})\ge 0. \end{aligned} \end{aligned}$$
(30)

Thus, the value of the logarithmic likelihood function has been increasing during the iterative process, and the algorithm for the TDVM is convergent.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, C., Han, J. Task-Driven Variability Model for Speaker Verification. Circuits Syst Signal Process 39, 3125–3144 (2020). https://doi.org/10.1007/s00034-019-01315-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-019-01315-7

Keywords

Navigation