Task-Driven Variability Model for Speaker Verification

Chen, Chen; Han, Jiqing

doi:10.1007/s00034-019-01315-7

Task-Driven Variability Model for Speaker Verification

Published: 27 November 2019

Volume 39, pages 3125–3144, (2020)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

197 Accesses
1 Citation
Explore all metrics

Abstract

The total variability model (TVM)/probabilistic linear discriminant analysis (PLDA) framework is one of the most popular methods for speaker verification. In this framework, the i-vector representations are first extracted from utterances via an estimated TVM and then employed to estimate the PLDA parameters for classification. The TVM and PLDA are estimated serially, so the information loss in the TVM is inherited by the i-vectors, and then passed into the PLDA classifier. More seriously, this loss cannot be compensated by the PLDA. To solve this problem, we propose a task-driven variability model (TDVM) to jointly estimate the TVM and PLDA classifier. In this method, the feedback from the PLDA can supervise the optimal solution of the TVM to move toward the space that has the maximum between-class separation and minimum within-class variation. Meanwhile, this space is suitable for open-set test which can deal with unenrolled speakers. Unlike most embedding methods which extract the embedding representations via the stack of network structures, the TDVM contains the assumptions about latent variables, which can enhance the interpretation of speaker representation extraction. The proposed method is evaluated on the King-ASR-010 and VoxCeleb databases, and the experimental results show that the TDVM method can achieve better performance than the traditional TVM/PLDA and VGG-M network with different cost functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining speech signal patterns for robust speaker variability classification

Article 14 September 2022

Exploration of Local Variability in Text-Independent Speaker Verification

Article 17 April 2015

Relative Significance of Speech Sounds in Speaker Verification Systems

Article 11 April 2023

References

D. Bansé, G.R. Doddington, D. Garcia-Romero, J.J. Godfrey, C.S. Greenberg, A.F. Martin, A. McCree, M. Przybocki, D.A. Reynolds, Summary and initial results of the 2013–2014 speaker recognition i-vector machine learning challenge, in Interspeech (2014), pp. 368–372
H. Bredin, Tristounet: triplet loss for speaker turn embedding, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (2017), pp. 5430–5434
K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets, in The British Machine Vision Conference (2014)
C. Chen, J. Han, Y. Pan, Speaker verification via estimating total variability space using probabilistic partial least squares, in Interspeech (2017), pp. 1537–1541
S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in 2005 IEEE Conference on Computer Vision and Pattern Recognition (2005), pp. 539–546
S. Cumani, P. Laface, Scoring heterogeneous speaker vectors using nonlinear transformations and tied PLDA models. IEEE/ACM Trans. Audio Speech Lang. Process. 26(5), 995–1009 (2018)
Article Google Scholar
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Article Google Scholar
D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in Interspeech (2011), pp. 249–252
R.L. Gorsuch, Factor analysis, edition: 2nd publisher. Lawrence Earlbaum Associates (1983)
J.H. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015)
Article Google Scholar
Y.Z. Isik, H. Erdogan, R. Sarikaya, S-vector: a discriminative representation derived from i-vector for speaker verification, in European Signal Processing Conference (2015), pp. 2097–2101
P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process. 16(5), 980–988 (2008)
Article Google Scholar
P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition. Digital Signal Process. 15(4), 1435–1447 (2007)
Google Scholar
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
Article Google Scholar
King-ASR-010: Chinese mandarin speech recognition corpus (desktop)-digit string-200 speakers. http://en.speechocean.com/datacenter/details/41.html. Accessed 27 Nov 2019
C.D. Kolstad, L.S. Lasdon, Derivative evaluation and computational experience with large bilevel mathematical programs. J. Optim. Theory Appl. 65(3), 485–499 (1990)
Article MathSciNet Google Scholar
M.A. Laskar, R.H. Laskar, Integrating DNN-HMM technique with hierarchical multi-layer acoustic model for text-dependent speaker verification. Circuits Syst. Signal Process. 38, 3548–3572 (2019)
Article Google Scholar
V.D.M. Laurens, G.E. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
MATH Google Scholar
Z. Lei, Y. Yang, Maximum likelihood i-vector space using PCA for speaker verification, in Interspeech (2011), pp. 2725–2728
N. Li, M.W. Mak, SNR-invariant PLDA modeling in nonparametric subspace for robust speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(10), 1648–1659 (2015)
Article Google Scholar
J. Ma, V.Sethu, E. Ambikairajah, K.A. Lee, Twin model G-PLDA for duration mismatch compensation in text-independent speaker verification, in Interspeech (2016), pp. 1853–1857
J. Ma, V. Sethu, E. Ambikairajah, K.A. Lee, Generalized variability model for speaker verification. IEEE Signal Process. Lett. 25(12), 1775–1779 (2018)
Article Google Scholar
N. Maghsoodi, H. Sameti, H. Zeinali, T. Stafylakis, Speaker recognition with random digit strings using uncertainty normalized HMM-based i-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 27(11), 1815–1825 (2019)
Article Google Scholar
M.W. Mak, X. Pang, J.T. Chien, Mixture of PLDA for noise robust i-vector speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 130–142 (2016)
Article Google Scholar
A. Nagrani, J. Chung, A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, in Interspeech (2017), pp. 2616–2620
M.A. Nematollahi, S.A.R. Al-Haddad, Distant speaker recognition: an overview. Int. J. Humanoid Robot. 13(2), 1–45 (2016)
Article Google Scholar
S.J.D. Prince, J.H. Elder, Probabilistic linear discriminant analysis for inferences about identity, in Proceedings of IEEE International Conference on Computer Vision (2007), pp. 1–8
D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 10, 19–41 (2000)
Article Google Scholar
S.O. Sadjadi, M. Slaney, A.L. Heck, MSR identity toolbox v1.0: A MATLAB toolbox for speaker recognition research. Speech Lang. Process. Tech. Comm. Newsl. 1(4), 1–32 (2013)
Google Scholar
S.E. Shepstone, K.A. Lee, H. Li, Z. Tan, S.H. Jensen, Total variability modeling using source-specific priors. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 504–517 (2016)
Article Google Scholar
A. Sinha, P. Malo, K. Deb, A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Trans. Evol. Comput. 22(2), 276–295 (2018)
Article Google Scholar
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Interspeech (2017), pp. 999–1003
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018), pp. 5329–5333
M. Tipping, C. Bishop, Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B (Stat. Method) 61(3), 611–622 (1999)
Article MathSciNet Google Scholar
R. Travadi, S. Narayanan, Efficient estimation and model generalization for the total variability model. Comput. Speech Lang. 53, 43–64 (2019)
Article Google Scholar
R. Travadi, S. Narayanan, Total variability layer in deep neural network embeddings for speaker verification. IEEE Signal Process. Lett. 26(6), 893–897 (2019)
Article Google Scholar
E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (2014), pp. 4080–4084
V. Vestman, T. Kinnunen, Supervector compression strategies to speed up i-vector system development (2018). arXiv preprint arXiv: 1805.01156
J. Villalba, A. Miguel, A. Ortega, E. Lleida, Bayesian networks to model the variability of speaker verification scores in adverse environments. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2327–2340 (2016)
Article Google Scholar
Y. Xu, I. Mcloughlin, Y. Song, Improved i-vector representation for speaker diarization. Circuits Syst. Signal Process. 35(9), 3393–3404 (2016)
Article MathSciNet Google Scholar
Y. Yang, S. Wang, M. Sun, Y. Qian, K. Yu, Generative adversarial networks based x-vector augmentation for robust probabilistic linear discriminant analysis in speaker verification, in 2018 11th International Symposium on Chinese Spoken Language Processing (2018), pp. 205–209
Y.Q. Yu, L. Fan, W.J. Li, Ensemble additive margin softmax for speaker verification, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (2019), pp. 6046–6050
C. Zhang, K. Koishida, J.H.L. Hansen, Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio Speech Lang. Process. 26(9), 1633–1644 (2018)
Article Google Scholar

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China under Grant No. U1736210.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China
Chen Chen & Jiqing Han

Authors

Chen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiqing Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiqing Han.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

1.1 A.1 Derivation of Lower-Level Objective Function

The logarithmic likelihood function of parameter $\varvec{T}$ can be expressed as:

$$\begin{aligned} \begin{aligned} L(\varvec{T})&= \sum _{s,h}\log P(\varvec{M}_{s,h};\varvec{T})\\&= \sum _{s,h}\log \sum _{\varvec{w}_{s,h}} P(\varvec{M}_{s,h},\varvec{w}_{s,h};\varvec{T})\\&= \sum _{s,h}\log \sum _{\varvec{w}_{s,h}}Q(\pmb w_{s,h})\frac{P(\varvec{M}_{s,h},\varvec{w}_{s,h};\varvec{T})}{Q(\pmb w_{s,h})} \end{aligned} \end{aligned}$$

(20)

where $Q(\pmb w_{s,h})$ is an auxiliary function. Since logarithmic likelihood function $f(x) = \log (x)$ is a concave function, according to the Jensen’s inequality rule, $\log (\mathbb E[x])\ge {\mathbb {E}}[\log (x)]$ can be obtained. Thus, (20) can be transformed into the following:

$$\begin{aligned} \begin{aligned} L(\varvec{T})&= \sum _{s,h}\log \sum _{\varvec{w}_{s,h}}Q(\varvec{w}_{s,h})\frac{P(\varvec{M}_{s,h},\varvec{w}_{s,h};\varvec{T})}{Q(\varvec{w}_{s,h})}\\&\ge \sum _{s,h}\sum _{\varvec{w}_{s,h}}Q(\varvec{w}_{s,h})\log \frac{P(\varvec{M}_{s,h},\varvec{w}_{s,h};\varvec{T})}{Q(\varvec{w}_{s,h})}\\&=\sum _{s,h}\sum _{\varvec{w}_{s,h}}Q(\varvec{w}_{s,h})\log \frac{P(\varvec{M}_{s,h}|\varvec{w}_{s,h};\varvec{T})P(\varvec{w}_{s,h})}{Q(\varvec{w}_{s,h})}\\&=\sum _{s,h}{\mathbb {E}}_{\varvec{w}}\Big [\log P(\varvec{M}_{s,h}|\varvec{w}_{s,h};\varvec{T})+\log P(\varvec{w}_{s,h})-\log Q(\varvec{w}_{s,h})\Big ]. \end{aligned} \end{aligned}$$

(21)

In (21), $\varvec{M}_{s,h}|\varvec{w}_{s,h}\sim {\mathbb {N}}(\varvec{m} + \varvec{Tw}_{s,h},\varvec{\varPhi })$; ${\mathbb {E}}_{\varvec{w}}$ is short for $\mathbb E_{\varvec{w}_{s,h}\sim Q}$, and the “$\varvec{w}_{s,h}\sim Q$” subscript indicates that the expectation is with respect to $\varvec{w}_{s,h}$ drawn from Q. Then, the items not related to $\varvec{T}$ are removed; therefore, the lower-level function $f_L$ can be written as (9).

1.2 A.2 Convergence Analysis

For the development supervector set, we have:

$$\begin{aligned} \begin{aligned} \prod _{s,h}P(\varvec{M}_{s,h};\theta )&=\frac{\prod _{s,h}P(\varvec{M}_{s,h},\varvec{w}_{s,h};\theta )}{\prod _{s,h}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta )}\\&=\frac{\prod _{s,h}P(\varvec{M}_{s,h},\varvec{w}_{s,h},\varvec{z}_{s};\theta )}{\prod _{s,h}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta )P(\varvec{z}_{s}|\varvec{w}_{s,h},\varvec{M}_{s,h};\theta )} \end{aligned} \end{aligned}$$

(22)

where $\theta $ is the parameter set $\{\varvec{T},\varvec{\varLambda },\varvec{\varPsi }\}$. Then, the logarithm of (22) can be expressed as:

$$\begin{aligned} \begin{aligned} \sum _{s,h}\log P(\varvec{M}_{s,h};\theta )=&\sum _{s,h}\log P(\varvec{M}_{s,h},\varvec{w}_{s,h},\varvec{z}_{s};\theta )\\&-\sum _{s,h}\log P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta )\\&-\sum _{s,h}\log P(\varvec{z}_{s}|\varvec{w}_{s,h},\varvec{M}_{s,h};\theta ). \end{aligned} \end{aligned}$$

(23)

Here, for each iteration index i of the TDVM, we set $L(\theta ,\theta ^{(i)})$, $H(\theta ,\theta ^{(i)})$ and $G(\theta ,\theta ^{(i)})$ as follows:

$$\begin{aligned} \left\{ \begin{aligned} L(\theta ,\theta ^{(i)}) =&\sum _{s,h}\sum _{\varvec{w}_{s,h}}\sum _{\varvec{z}_{s}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})P(\varvec{z}_{s}|\varvec{w}_{s,1},\dots ,\varvec{w}_{s,H_{s}};\theta ^{(i)}) \\&\qquad \cdot \log P(\varvec{M}_{s,h},\varvec{w}_{s,h},\varvec{z}_{s};\theta ) \\ H(\theta ,\theta ^{(i)}) =&\sum _{s,h}\sum _{\varvec{w}_{s,h}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})\log P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ) \\ G(\theta ,\theta ^{(i)}) =&\sum _{s,h}\sum _{\varvec{w}_{s,h}}\sum _{\varvec{z}_{s}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})P(\varvec{z}_{s}|\varvec{w}_{s,1},\dots ,\varvec{w}_{s,H_{s}};\theta ^{(i)}) \\&\qquad \cdot \log P(\varvec{z}_{s}|\varvec{w}_{s,h},\varvec{M}_{s,h};\theta ). \end{aligned} \right. \end{aligned}$$

(24)

Therefore, $\sum _{s,h}\log P(\varvec{M}_{s,h};\theta )$ can be rewritten as follows:

$$\begin{aligned} \sum _{s,h}\log P(\varvec{M}_{s,h};\theta )=L(\theta ,\theta ^{(i)})-H(\theta ,\theta ^{(i)})-G(\theta ,\theta ^{(i)}). \end{aligned}$$

(25)

In (25), $\theta $ is taken as $\theta ^{(i+1)}$ and $\theta ^{(i)}$, respectively, and then they are subtracted to obtain the follows formula:

$$\begin{aligned} \begin{aligned}&\sum _{s,h}\log P(\varvec{M}_{s,h};\theta ^{(i+1)})-\sum _{s,h}\log P(\varvec{M}_{s,h};\theta ^{(i)})\\&\quad =\Big [L(\theta ,\theta ^{(i+1)})-L(\theta ,\theta ^{(i)})\Big ]-\Big [H(\theta ,\theta ^{(i+1)})-H(\theta ,\theta ^{(i)})\Big ]\\&\qquad -\Big [G(\theta ,\theta ^{(i+1)})-G(\theta ,\theta ^{(i)})\Big ]. \end{aligned} \end{aligned}$$

(26)

For the first item, since $\theta ^{(i+1)}$ makes $L(\theta ,\theta ^{(i+1)})$ take the maximum value, the following inequality can be obtained:

$$\begin{aligned} L(\theta ,\theta ^{(i+1)})-L(\theta ,\theta ^{(i)}) \ge 0. \end{aligned}$$

(27)

For the second item, it can be transformed as follows:

$$\begin{aligned} \begin{aligned}&~-\Big [H(\theta ,\theta ^{(i+1)})-H(\theta ,\theta ^{(i)})\Big ]\\&\quad =~-\sum _{s,h}\sum _{\varvec{w}_{s,h}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})\log \frac{P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i+1)})}{P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})}\\&\quad \ge ~\sum _{s,h}\log \Big [\sum _{\varvec{w}_{s,h}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})\frac{P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i+1)})}{P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i)})}\Big ]\\&\quad =~\sum _{s,h}\log \Big [\sum _{\varvec{w}_{s,h}}P(\varvec{w}_{s,h}|\varvec{M}_{s,h};\theta ^{(i+1)})\Big ]=0. \end{aligned} \end{aligned}$$

(28)

The third item is similar to the second one, and therefore, the following inequality can be obtained:

$$\begin{aligned} -[G(\theta ,\theta ^{(i+1)})-G(\theta ,\theta ^{(i)})] \ge 0. \end{aligned}$$

(29)

According to (27)–(29), we have:

$$\begin{aligned} \begin{aligned}&\sum _{s,h}\log P(\varvec{M}_{s,h};\theta ^{(i+1)})-\sum _{s,h}\log P(\varvec{M}_{s,h};\theta ^{(i)})\ge 0. \end{aligned} \end{aligned}$$

(30)

Thus, the value of the logarithmic likelihood function has been increasing during the iterative process, and the algorithm for the TDVM is convergent.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, C., Han, J. Task-Driven Variability Model for Speaker Verification. Circuits Syst Signal Process 39, 3125–3144 (2020). https://doi.org/10.1007/s00034-019-01315-7

Download citation

Received: 25 June 2019
Revised: 18 November 2019
Accepted: 20 November 2019
Published: 27 November 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s00034-019-01315-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Task-Driven Variability Model for Speaker Verification

Abstract

Access this article

Similar content being viewed by others

Mining speech signal patterns for robust speaker variability classification

Exploration of Local Variability in Text-Independent Speaker Verification

Relative Significance of Speech Sounds in Speaker Verification Systems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Appendix

1.1 A.1 Derivation of Lower-Level Objective Function

1.2 A.2 Convergence Analysis

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Task-Driven Variability Model for Speaker Verification

Abstract

Access this article

Similar content being viewed by others

Mining speech signal patterns for robust speaker variability classification

Exploration of Local Variability in Text-Independent Speaker Verification

Relative Significance of Speech Sounds in Speaker Verification Systems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Appendix

A Appendix

1.1 A.1 Derivation of Lower-Level Objective Function

1.2 A.2 Convergence Analysis

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation