Skip to main content
Log in

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Performance degradation with intraspeaker variability is a hot topic in speaker recognition. Accuracy dropping over time has become a common and accepted phenomenon in the field of speaker recognition. In China, many people travel between their birthplace and workplaces. Different cultural atmospheres and customs have an effect on the pronunciation of speech. The ongoing work focuses on time-varying and region-changed factors that are caused by population migration. This paper introduces a time-varying and region-changed speech database (TRSD) collected from 55 university students over 3 years. In total, it contains 3795 utterances. To study the impact of the time-varying and region-changed factors on speaker identification and explore hidden factors that may lead to performance degradation, there are also many experimental studies for the database. In the experiments, the changes in characteristic parameters (pitch, intensity, formant and spectrogram) are analyzed and grouped by gender and birthplace. The Gaussian mixture model-universal background model, deep neural network model, i-vector/PLDA and x-vector/PLDA are evaluated on TRSD to provide a reference performance. For the time-varying and region-changed factors, this paper also provided three kinds of corresponding solutions: speaker model adaption, cepstral mean normalization and mel-frequency cepstrum coefficient normalization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

Data will be made available on reasonable request.

References

  1. A. Abo Absa, M. Deriche, A two-stage hierarchical multilingual emotion recognition system using hidden Markov models and neural networks, in 2017 9th IEEE-GCC Conference and Exhibition (GCCCE), Manama, Bahrain (2017), pp. 1–6. https://doi.org/10.1109/IEEEGCC.2017.8448155

  2. A. Abumallouh, Z. Qawaqneh, B. Barkana, New transformed features generated by deep bottleneck extractor and a GMM-UBM classifier for speaker age and gender classification. Neural Comput. Appl. 30(8), 2581–2593 (2018). https://doi.org/10.1007/s00521-017-2848-4

    Article  Google Scholar 

  3. M. Ajili, J.F. Bonastre, S. Rossato, J. Kahn, G. Bernard, Fabiole, a speech database for forensic speaker comparison, in 10th Edition of Its Language Resources and Evaluation Conference (LREC 2016), Paris, France (2016)

  4. S. Alcorn, K. Meemann, C. Clopper, R. Smiljanic, Acoustic cues and linguistic experience as factors in regional dialect classification. J. Acoust. Soc. Am. 141(5), 3979–3979 (2020). https://doi.org/10.1121/1.4989083

    Article  Google Scholar 

  5. K. Amino, T. Osanai, Native vs. non-native accent identification using Japanese spoken telephone numbers. Speech Commun. 56, 70–81 (2014)

    Article  Google Scholar 

  6. H. Aronowitz, Inter dataset variability compensation for speaker recognition, in ICASSP 2014—2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy (2014), pp. 4002–4006. https://doi.org/10.1109/ICASSP.2014.6854353

  7. B. Barkana, J. Zhou, A new pitch-range based feature set for a speaker’s age and gender classification. Appl. Acoust. 98, 52–61 (2015). https://doi.org/10.1016/j.apacoust.2015.04.013

    Article  Google Scholar 

  8. Y. Bengio, Learning deep architectures for ai. Foundations 2(1), 1–127 (2009). https://doi.org/10.1561/2200000006

    Article  MathSciNet  MATH  Google Scholar 

  9. P. Boersma, D. Weenink, Praat, a system for doing phonetics by computer. Glot Int. 5, 341–345 (2001)

    Google Scholar 

  10. L. Brandschain, D. Graff, C. Cieri, K. Walker, C. Caruso, A. Neely, Greybeard longitudinal speech study, in International Conference on Language Resources and Evaluation, Valletta, Malta (2010)

  11. H. Chao, B.Y. Lu, Y.L. Liu, H.L. Zhi, Vocal effort detection based on spectral information entropy feature and model fusion. J. Inf. Process. Syst. 14(1), 218–227 (2018). https://doi.org/10.3745/JIPS.04.0063

    Article  Google Scholar 

  12. W. Chen, Y. Yang, First study on time-varying speaker recognition, in Phonetic Conference of China (2010), pp. 1–6

  13. X. Chen, Y. Peng, H. Song, Research on time-varying robustness in speaker recognition based on PLDA. Microcomput. Its Appl. (2016)

  14. R. Cole, M. Noel, V. Noel, The cslu speaker recognition corpus, in International Conference on Semiconductor Laser and Photonics (1999), pp. 3167–3170

  15. R.K. Das, S. Jelil, S.R.M. Prasanna, Multi-style speaker recognition database in practical conditions. Int. J. Speech Technol. 21(3), 409–419 (2017). https://doi.org/10.1007/s10772-017-9475-4

    Article  Google Scholar 

  16. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307

    Article  Google Scholar 

  17. G. Dobry, R. Hecht, M. Avigal, Y. Zigel, Supervector dimension reduction for efficient speaker age estimation based on the acoustic speech signal. IEEE Trans. Audio Speech Lang. Process. 19(7), 1975–1985 (2011). https://doi.org/10.1109/TASL.2011.2104955

    Article  Google Scholar 

  18. G. Droua-Hamdani, S.A. Selouani, M. Boudraa, Algerian Arabic speech database (ALGASD): corpus design and automatic speech recognition application. Arab. J. Sci. Eng. 35(2C), 157–166 (2010)

    Google Scholar 

  19. O. Ghahabi, J. Hernando, Deep learning for single and multi-session i-vector speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(4), 807–817 (2017). https://doi.org/10.1109/TASLP.2017.2661705

    Article  Google Scholar 

  20. A. Hanani, M. Russell, M. Carey, Human and computer recognition of regional accents and ethnic groups from British English speech. Comput. Speech Lang. 27(1), 59–74 (2013). https://doi.org/10.1016/j.csl.2012.01.003

    Article  Google Scholar 

  21. J. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015). https://doi.org/10.1109/MSP.2015.2462851

    Article  Google Scholar 

  22. J. Harnsberger, R. Shrivastav, W. Brown, H. Rothman, H. Hollien, Speaking rate and fundamental frequency as speech cues to perceived age. J. Voice Off. J. Voice Found. 22(1), 58–69 (2008). https://doi.org/10.1016/j.jvoice.2006.07.004

    Article  Google Scholar 

  23. J. Harnsberger, R. Shrivastav, W. Jr, Modeling perceived vocal age in American English, in INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association (2010), pp. 466–469

  24. P. Harár, R. Burget, M.K. Dutta, Speech emotion recognition with deep learning, in International Conference on Signal Processing and Integrated Networks, Noida, India (2017), pp. 137–140. https://doi.org/10.1109/SPIN.2017.8049931

  25. B.C. Haris, G. Pradhan, A. Misra, S. Prasanna, R. Das, R. Sinha, Multivariability speaker recognition database in Indian scenario. Int. J. Speech Technol. 15(4), 441–453 (2012). https://doi.org/10.1007/s10772-012-9140-x

    Article  Google Scholar 

  26. M. Hrúz, Z. Zajíc, Convolutional neural network for speaker change detection in telephone speaker diarization system, in IEEE International Conference on Acoustics, New Orleans, LA, USA (2017), pp. 4945–4949. https://doi.org/10.1109/ICASSP.2017.7953097

  27. K. Jones, S. Strassel, K. Walker, D. Graff, J. Wright, Call my net corpus: a multilingual corpus for evaluation of speaker recognition technology, in Interspeech 2017 (2017), pp. 2621–2624

  28. F. Kelly, A. Drygajlo, N. Harte, Speaker verification with long-term ageing data, in Proceedings—2012 5th IAPR International Conference on Biometrics, ICB 2012, New Delhi, India (2012), pp. 478–483. https://doi.org/10.1109/ICB.2012.6199796

  29. F. Kelly, A. Drygajlo, N. Harte, Speaker verification in score-ageing-quality classification space. Comput. Speech Lang. 27(5), 1068–1084 (2013). https://doi.org/10.1016/j.csl.2012.12.005

    Article  Google Scholar 

  30. R.A. Khalil, E. Jones, M. Babar, T. Jan, M. Zafar, T. Alhussain, Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019). https://doi.org/10.1109/ACCESS.2019.2936124

    Article  Google Scholar 

  31. A. Kolokolov, I. Lyubinskii, Measuring the pitch of a speech signal using the autocorrelation function. Autom. Remote. Control 80(2), 317–323 (2019). https://doi.org/10.1134/S0005117919020097

    Article  MathSciNet  MATH  Google Scholar 

  32. A. Krobba, M. Debyeche, S.A. Selouani, Maximum entropy PLDA for robust speaker recognition under speech coding distortion. Int. J. Speech Technol. 22(5), 1115–1122 (2019). https://doi.org/10.1007/s10772-019-09642-5

    Article  Google Scholar 

  33. N. Kurpukdee, S. Kasuriya, V. Chunwijitra, C. Wutiwiwatchai, P. Lamsrichan, A study of support vector machines for emotional speech recognition, in International Conference of Information and Communication Technology for Embedded Systems, Chonburi, Thailand (2017), pp. 1–6. https://doi.org/10.1109/ICTEmSys.2017.7958773

  34. A. Lawson, A. Stauffer, E. Cupples, S. Wenndt, W. Bray, J. Grieco, The multi-session audio research project (MARP) corpus: goals, design and initial findings, in InterSpeech, Brighton, United Kingdom (2009), pp. 1811–1814

  35. A. Lazaridis, E. Khoury, J.P. Goldman, M. Avanzi, S. Marcel, P. Garner, Swiss French regional accent identification, in Odyssey: The Speaker and Language Recognition Workshop, Joensuu, Finland (2014)

  36. Y. Lecun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015). https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  37. K.A. Lee, A. Larcher, G. Wang, P. Kenny, N. Brummer, D. Van Leeuwen, H. Aronowitz, M. Kockmann, C. Vaquero, B. Ma, H. Li, T. Stafylakis, M.J. Alam, A. Swart, J. Perez, The reddots data collection for speaker recognition, in Interspeech, Dresden, Germany (2015), pp. 2996–3000

  38. D. Li, J. Wang, Y. Yang, PVD: a new pathological voice dataset for intra-speaker recognition research interest, in 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, China (2016), pp. 1–5. https://doi.org/10.1109/ISCSLP.2016.7918488

  39. Y. Lukic, C. Vogt, O. Durr, T. Stadelmann, Speaker identification and clustering using convolutional neural networks, in IEEE International Workshop on Machine Learning for Signal Processing, Vietri sul Mare, Italy (2016), pp. 1–6. https://doi.org/10.1109/MLSP.2016.7738816

  40. S. Mao, D. Tao, G. Zhang, P. Ching, T. Lee, Revisiting hidden Markov models for speech emotion recognition, in ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK (2019), pp. 6715–6719. https://doi.org/10.1109/ICASSP.2019.8683172

  41. M. McLaren, L. Ferrer, D. Castán Lavilla, A. Lawson, The speakers in the wild (sitw) speaker recognition database, in Interspeech 2016, San Francisco, CA, USA (2016), pp. 818–822. https://doi.org/10.21437/Interspeech.2016-1129

  42. Y.J. Miao, X.F. Liu, X.M. Zhang, Compensation of speech enhancement distortion with combination of CMN and PMC. Microelectron. Comput. 28(6), 147–160 (2011)

    Google Scholar 

  43. O. Novotny, O. Plchot, O. Glembek, J.H. Cernocky, L. Burget, Analysis of DNN speech signal enhancement for robust speaker recognition. Comput. Speech Lang. 58, 403–421 (2019). https://doi.org/10.1016/j.csl.2019.06.004

    Article  Google Scholar 

  44. T. Ozseven, A novel feature selection method for speech emotion recognition. Appl. Acoust. 146, 320–326 (2019). https://doi.org/10.1016/j.apacoust.2018.11.028

    Article  Google Scholar 

  45. S. Paulose, D. Mathew, A. Thomas, Performance evaluation of different modeling methods and classifiers with MFCC and IHC features for speaker recognition. Procedia Comput. Sci. 115, 55–62 (2017). https://doi.org/10.1016/j.procs.2017.09.076 (7th International Conference on Advances in Computing & Communications, ICACC-2017, 22–24 August 2017, Cochin, India)

    Article  Google Scholar 

  46. Z. Peng, X. Li, Z. Zhu, M. Unoki, J. Dang, M. Akagi, Speech emotion recognition using 3d convolutions and attention-based sliding recurrent networks with auditory front-ends. IEEE Access 8, 16560–16572 (2020). https://doi.org/10.1109/ACCESS.2020.2967791

    Article  Google Scholar 

  47. X. Qin, H. Bu, M. Li, Hi-mia: a far-field text-dependent speaker verification database and the baselines, in ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain (2020), pp. 7609–7613. https://doi.org/10.1109/ICASSP40776.2020.9054423

  48. S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2001). https://doi.org/10.1126/science.290.5500.2323

    Article  Google Scholar 

  49. R. Saeidi, P. Alku, T. Backstrom, Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 42–53 (2016). https://doi.org/10.1109/TASLP.2015.2493366

    Article  Google Scholar 

  50. S. Schötz, Perception Analysis and Synthesis of Speaker Age (Department of Linguistics and Phonetics, Centre for Languages and Literature, 2006)

  51. D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification, in Interspeech (2017), pp. 999–1003

  52. I. Tashev, Z.Q. Wang, K. Godin, Speech emotion recognition based on gaussian mixture models and deep neural networks, in Information Theory and Applications Workshop, San Diego, CA, USA (2017), pp. 1–4. https://doi.org/10.1109/ITA.2017.8023477

  53. K. Wang, N. An, B.N. Li, Y. Zhang, L. Li, Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015). https://doi.org/10.1109/TAFFC.2015.2392101

    Article  Google Scholar 

  54. L. Wang, N. Kitaoka, S. Nakagawa, Analysis of effect of compensation parameter estimation for cmn on speech/speaker recognition, in International Symposium on Signal Processing and Its Applications, Sharjah, United Arab Emirates (2007), pp. 1–4. https://doi.org/10.1109/ISSPA.2007.4555505

  55. L. Wang, X. Wu, F. Zheng, C. Zhang, An investigation into better frequency warping for time-varying speaker recognition, in Asia-Pacific Signal and Information Processing Association Summit and Conference, Hollywood, CA, USA (2012), pp. 1–4

  56. L. Wang, F. Zheng, Creation of time-varying voiceprint database. Oriental-COCOSDA (2010)

  57. L. Wang, T. Zheng, C. Zhang, G. Wang, Discrimination-emphasized mel-frequency-warping for time-varying speaker recognition, in APSIPA ASC 2011—Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2011 (2011), pp. 731–734

  58. Z. Wu, P. Swietojanski, C. Veaux, S. Renals, A study of speaker adaptation for dnn-based speech synthesis, in INTERSPEECH, Dresden, Germany (2015), pp. 879–883

  59. W.S. Yang, X. Wang, S. Zhou, H.x. Zhao, J. Huang, An improved method for voiceprint recognition, in Complex, Intelligent, and Software Intensive Systems—Proceedings of the 12th International Conference on Complex, Intelligent, and Software Intensive Systems, CISIS-2018, Matsue, Japan, 4–6 July 2018, Advances in Intelligent Systems and Computing, eds. by L. Barolli, N. Javaid, M. Ikeda, M. Takizawa, vol. 772 (Springer, Berlin, 2018), pp. 735–746. https://doi.org/10.1007/978-3-319-93659-8_67

  60. S. Zhang, A. Chen, W. Guo, Y. Cui, X. Zhao, L. Liu, Learning deep binaural representations with deep convolutional neural networks for spontaneous speech emotion recognition. IEEE Access 8, 23496–23505 (2020). https://doi.org/10.1109/ACCESS.2020.2969032

    Article  Google Scholar 

  61. W. Zhang, D. Zhao, Z. Chai, L. Yang, X. Liu, F. Gong, S. Yang, Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services. Softw. Pract. Exp. 47(8), 1127–1138 (2017). https://doi.org/10.1002/spe.2487

    Article  Google Scholar 

  62. Y. Zhang, J. Du, Z. Wang, J. Zhang, t. Yanhui, Attention based fully convolutional network for speech emotion recognition, in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA (2019), pp. 1771–1775. https://doi.org/10.23919/APSIPA.2018.8659587

  63. F. Zheng, Q. Jin, L. Li, J. Wang, F. Bie, An overview of robustness related issues in speaker recognition, in Asia-Pacific Signal and Information Processing Association, Summit and Conference, Chiang Mai, Thailand (2014), pp. 1–10. https://doi.org/10.1109/APSIPA.2014.7041826

Download references

Acknowledgements

This work is supported by Natural Science Foundation of China under Grant No. 61806078, No.62076094, No. 61976091, Shanghai Science and Technology Program “Distributed and generative few-shot algorithm and theory research” under Grant No.20511100600.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhe Wang.

Ethics declarations

Conflict of interest

The manuscript has been approved by all authors for publication, and no conflict of interest exits in the submission of it.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, D., Liu, J., Wang, Z. et al. TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition. Circuits Syst Signal Process 41, 3931–3956 (2022). https://doi.org/10.1007/s00034-022-01964-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-01964-1

Keywords

Navigation