Skip to main content

Attacking Speaker Recognition Systems with Phoneme Morphing

  • Conference paper
  • First Online:
Computer Security – ESORICS 2019 (ESORICS 2019)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11735))

Included in the following conference series:

Abstract

As voice interfaces become more widely available they increasingly implement speaker recognition, to provide both personalized functionalities and security via authentication. In this paper, we present a method that transforms the voice of one person so that it resembles the voice of a victim, such that it can be used to deceive speaker recognition systems into believing an utterance was spoken by the victim. The transformation only requires short pieces of audio recordings from the source and victim voices, and does not require specific words to be spoken by the victim. We show that the attack can be improved by using a population of source voices and we provide a metric to identify promising source voices, from within such a population.

We evaluate our attack along a set of dimensions, including: varying quantity, quality and types of known victim audio, verification and identification systems, white- and black-box models and both over-the-wire and over-the-air access. We test the audio transformation on two different proprietary models: (i) the Azure Speaker Recognition API and (ii) the Siri voice activation of an Apple iPhone, showing that individuals can easily be impersonated by obtaining as little as one minute of their audio, even when such audio is recorded in noisy conditions. With attempts from only three source voices, our attack achieves success rates of over 40% in the weakest assumption scenario against the Azure Verification API and rates of over 80% in all scenarios against Siri.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Transcript summaries are available in Appendix A.

  2. 2.

    https://youtube.com/watch?v=BOdLmxy06H0.

  3. 3.

    https://azure.microsoft.com/en-us/services/cognitive-services/.

  4. 4.

    We conducted our experiments against the Microsoft APIs in January 2019.

  5. 5.

    We had to remove one phrase, “Houston we have had a problem”, as participants spoke the phrase as “Houston we have a problem”, a popular misconception.

References

  1. Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: Are your training datasets yet relevant? In: Piessens, F., Caballero, J., Bielova, N. (eds.) ESSoS 2015. LNCS, vol. 8978, pp. 51–67. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15618-7_5

    Chapter  Google Scholar 

  2. Apple Siri Team: Personalized Hey Siri - Apple (2018). https://machinelearning.apple.com/2018/04/16/personalized-hey-siri.html. Accessed 7 Jul 2019

  3. Bimbot, F., et al.: A tutorial on text-independent speaker verification. EURASIP J. Adv. Signal Process. 2004(4), 101962 (2004)

    Article  Google Scholar 

  4. Blue, L., Abdullah, H., Vargas, L., Traynor, P.: 2MA: verifying voice commands via two microphone authentication. In: Proceedings of the 13th on Asia Conference on Computer and Communications Security, pp. 89–100. ACM (2018)

    Google Scholar 

  5. Blue, L., Vargas, L., Traynor, P.: Hello, is it me you’re looking for?: differentiating between human and electronic speakers for voice interface security. In: Proceedings of the 11th Conference on Security & Privacy in Wireless and Mobile Networks, pp. 123–133. ACM (2018)

    Google Scholar 

  6. Blumeyer, D.: Relative frequencies of english phonemes (2012). https://cmloegcmluin.wordpress.com/2012/11/10/relative-frequencies-of-english-phonemes/. Accessed 27 Apr 2019

  7. Carlini, N., et al.: Hidden voice commands. In: Proceedings of the 25th USENIX Security Symposium, pp. 513–530 (2016)

    Google Scholar 

  8. Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. In: IEEE Security and Privacy Workshops, pp. 1–7. IEEE (2018)

    Google Scholar 

  9. Chen, S., et al.: You can hear but you cannot steal: defending against voice impersonation attacks on smartphones. In: Proceedings of the 37th International Conference on Distributed Computing Systems, pp. 183–195. IEEE (2017)

    Google Scholar 

  10. De Leon, P.L., Pucher, M., Yamagishi, J., Hernaez, I., Saratxaga, I.: Evaluation of speaker verification security and detection of HMM-based synthetic speech. Transactions on Audio, Speech and Language Processing (2012)

    Google Scholar 

  11. Eberz, S., Rasmussen, K.B., Lenders, V., Martinovic, I.: Evaluating behavioral biometrics for continuous authentication. In: Proceedings of the 12th Asia Conference on Computer and Communications Security, pp. 386–399 (2017)

    Google Scholar 

  12. Ellis, D.P.W.: PLP and RASTA (and MFCC, and inversion) in Matlab (2005). http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/. Accessed 8 Jul 2019

  13. Ergünay, S.K., Khoury, E., Lazaridis, A., Marcel, S.: On the vulnerability of speaker verification to realistic voice spoofing. In: Proceedings of the 7th International Conference on Biometrics Theory, Applications and Systems, pp. 1–6. IEEE (2015)

    Google Scholar 

  14. Evans, N., Kinnunen, T., Yamagishi, J.: Spoofing and countermeasures for automatic speaker verification. In: Proceedings of the Annual Conference of the International Speech Communication Association pp. 925–929 (2013)

    Google Scholar 

  15. Fant, G.: Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations. No. 2, Walter de Gruyter (1970)

    Google Scholar 

  16. Google: Set up Voice Match on Google Home - Google Home Help (2018). https://support.google.com/googlehome/answer/7323910. Accessed 8 Jul 2019

  17. Helland, T., Kaasa, R.: Dyslexia in english as a second language. Dyslexia 11(1), 41–60 (2005)

    Article  Google Scholar 

  18. HSBC: Voice ID — HSBC UK (2018). https://www.hsbc.co.uk/1/2/voice-id. Accessed 8 Jul 2019

  19. Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: Proceedings of the Signal and Information Processing Association Annual Summit and Conference, pp. 1–6. IEEE (2016)

    Google Scholar 

  20. Khoury, E., El Shafey, L., Marcel, S.: Spear: an open source toolbox for speaker recognition based on Bob. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 1655–1659. IEEE (2014)

    Google Scholar 

  21. Kinnunen, T., Wu, Z.Z., Lee, K.A., Sedlak, F., Chng, E.S., Li, H.: Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 4401–4404. IEEE (2012)

    Google Scholar 

  22. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  23. Lau, Y.W., Tran, D., Wagner, M.: Testing voice mimicry with the YOHO speaker verification corpus. In: Proceedings of the 9th International Conference on Knowledge-Based Intelligent Information And Engineering Systems, vol. 3584, pp. 15–21 (2005)

    Google Scholar 

  24. Lindberg, J., Blomberg, M.: Vulnerability in speaker verification-a study of technical impostor techniques. In: Proceedings of the 6th European Conference on Speech Communication and Technology (1999)

    Google Scholar 

  25. Lloyds Bank: Voice ID — Lloyds Bank (2019). https://www.lloydsbank.com/contact-us/voice-id.asp. Accessed 8 Jul 2019

  26. Matrouf, D., Bonastre, J.F., Fredouille, C.: Effect of speech transformation on impostor acceptance. In: Proceedings of the 31st International Conference on Acoustics Speech and Signal Processing, vol. 1. IEEE (2006)

    Google Scholar 

  27. Mermelstein, P.: Distance measures for speech recognition, psychological and instrumental. Pattern Recogn. Artif. Intell. 116, 374–388 (1976)

    Google Scholar 

  28. Microsoft ML Blog Team: Now available: Speaker & video apis from microsoft project oxford. https://blogs.technet.microsoft.com/machinelearning/2015/12/14/now-available-speaker-video-apis-from-microsoft-project-oxford/

  29. Povey, D., et al.: The kaldi speech recognition toolkit. In: Proceedings of the 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE (2011)

    Google Scholar 

  30. Smith, J.O.: Physical audio signal processing. https://ccrma.stanford.edu/~jos/pasp/Freeverb.html. Accessed 8 Jul 2019

  31. Sun, L., Li, K., Wang, H., Kang, S., Meng, H.: Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In: Proceedings of the 2016 International Conference on Multimedia and Expo, pp. 1–6. IEEE (2016)

    Google Scholar 

  32. Toda, T., et al.: The voice conversion challenge 2016. In: Proceedings of the Annual Conference of the International Speech Communication Association (2016)

    Google Scholar 

  33. Vaidya, T., Zhang, Y., Sherr, M., Shields, C.: Cocaine noodles: exploiting the gap between human and machine speech recognition. In: Proceedings of the 9th USENIX Workshop on Offensive Technologies (2015)

    Google Scholar 

  34. Voxforge Dataset: Free speech... recognition. http://www.voxforge.org/. Accessed 8 Jul 2019

  35. Yuan, X., et al.: Commandersong: a systematic approach for practical adversarial voice recognition. In: Proceedings of the 27th USENIX Security Symposium, pp. 49–64 (2018)

    Google Scholar 

  36. Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice commands. In: Proceedings of the 24th SIGSAC Conference on Computer and Communications Security, pp. 103–117. ACM (2017)

    Google Scholar 

  37. Zhang, L., Tan, S., Yang, J., Chen, Y.: Voicelive: a phoneme localization based liveness detection for voice authentication on smartphones. In: Proceedings of the 23rd SIGSAC Conference on Computer and Communications Security, pp. 1080–1091. ACM (2016)

    Google Scholar 

Download references

Acknowledgements

This work was supported by a grant from Mastercard and the Engineering and Physical Sciences Research Council [grant numbers EP/N509711/1 and EP/P00881X/1].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Henry Turner .

Editor information

Editors and Affiliations

A Audio Collected

A Audio Collected

1.1 A.1 Commands

Command data was sourced as both utterances that could be presented to systems in existence, as well as commands used specifically by the Azure Speaker recognition system for verification. The utterances recorded were as follows:

  1. 1.

    Hey Siri (Repeated 4 times)

  2. 2.

    Ok Google (Repeated 4 times)

  3. 3.

    What is the weather like?

  4. 4.

    What time is it?

  5. 5.

    Who am I?

  6. 6.

    How tall is the shard?

  7. 7.

    My voice is stronger than passwords (Repeated 4 times)

  8. 8.

    My password is not your business (Repeated 4 times)

  9. 9.

    Apple juice tastes funny after toothpaste (Repeated 4 times)

  10. 10.

    Houston we have had a problem (Repeated 4 times)

  11. 11.

    You can activate security system now (Repeated 4 times)

  12. 12.

    My voice is my password (Repeated 4 times)

1.2 A.2 Conference

Conference talk transcripts were obtained from popular TED talks. The transcripts were shortened, so that they contained approximately the first 6 min of a given talk. The transcripts were then split into individual utterances, with each utterance being recorded as a separate audio file by the participant. Five different conference talk transcripts were used, which are the following:

  1. 1.

    Do schools kill creativity? by Sir Ken Robinson - www.ted.com/talks/ken_robinson_says_schools_kill_creativity/transcript

  2. 2.

    Your body language may shape who you are by Amy Cuddy - www.ted.com/talks/amy_cuddy_your_body_language_shapes_who_you_are/transcript

  3. 3.

    What makes a good life? by Robert Waldinger - www.ted.com/talks/robert_waldinger_what_makes_a_good_life_lessons_from_the_longest_study_on_happiness/transcript

  4. 4.

    How great leaders inspire action by Simon Sinek - www.ted.com/talks/simon_sinek_how_great_leaders_inspire_action/transcript

  5. 5.

    The power of vulnerability by Brené Brown - www.ted.com/talks/brene_brown_on_vulnerability/transcript

1.3 A.3 Cafe

Our conversation audio is derived from TED talks where two people are having a conversation. A single speakers audio was extracted from each transcript, and the transcript was shortened until it was approximately 6 min in length. Five different conversation transcripts were used, which were dervied from the following talks:

  1. 1.

    SpaceX’s plan to fly you across the globe in 20 min - Gwynne Shotwell - https://www.ted.com/talks/gwynne_shotwell_spacex_s_plan_to_fly_you_across_the_globe_in_30_minutes/transcript

  2. 2.

    How Netflix changed entertainment - Reed Hastings - https://www.ted.com/talks/reed_hastings_how_netflix_changed_entertainment_and_where_it_s_headed/transcript

  3. 3.

    Mammoths resurrected, geoengineering and other thoughts from a futurist - Stewart Brand - https://www.ted.com/talks/stewart_brand_and_chris_anderson_mammoths_resurrected_geoengineering_and_other_thoughts_from_a_futurist/transcript

  4. 4.

    The future we’re building and boring - Elon Musk - https://www.ted.com/talks/elon_musk_the_future_we_re_building_and_boring/transcript

  5. 5.

    What everyday citizens can do to claim power on the internet - Fadi Cehadé - https://www.ted.com/talks/fadi_chehade_what_everyday_citizens_can_do_to_claim_power_on_the_internet/transcript

1.4 A.4 Enrolment

Enrolment audio was used to enroll individual speakers with the Azure Speaker Recognition API for identification. Participants were asked to read the first 6 paragraphs of the speech given by UK Prime Minister David Cameron at the start of the London 2012 Olympics. The speech can be found on the UK government speeches website at the following URL: https://www.gov.uk/government/speeches/pms-speech-at-olympics-press-conference

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Turner, H., Lovisotto, G., Martinovic, I. (2019). Attacking Speaker Recognition Systems with Phoneme Morphing. In: Sako, K., Schneider, S., Ryan, P. (eds) Computer Security – ESORICS 2019. ESORICS 2019. Lecture Notes in Computer Science(), vol 11735. Springer, Cham. https://doi.org/10.1007/978-3-030-29959-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29959-0_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29958-3

  • Online ISBN: 978-3-030-29959-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics