Abstract
As voice interfaces become more widely available they increasingly implement speaker recognition, to provide both personalized functionalities and security via authentication. In this paper, we present a method that transforms the voice of one person so that it resembles the voice of a victim, such that it can be used to deceive speaker recognition systems into believing an utterance was spoken by the victim. The transformation only requires short pieces of audio recordings from the source and victim voices, and does not require specific words to be spoken by the victim. We show that the attack can be improved by using a population of source voices and we provide a metric to identify promising source voices, from within such a population.
We evaluate our attack along a set of dimensions, including: varying quantity, quality and types of known victim audio, verification and identification systems, white- and black-box models and both over-the-wire and over-the-air access. We test the audio transformation on two different proprietary models: (i) the Azure Speaker Recognition API and (ii) the Siri voice activation of an Apple iPhone, showing that individuals can easily be impersonated by obtaining as little as one minute of their audio, even when such audio is recorded in noisy conditions. With attempts from only three source voices, our attack achieves success rates of over 40% in the weakest assumption scenario against the Azure Verification API and rates of over 80% in all scenarios against Siri.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Transcript summaries are available in Appendix A.
- 2.
- 3.
- 4.
We conducted our experiments against the Microsoft APIs in January 2019.
- 5.
We had to remove one phrase, “Houston we have had a problem”, as participants spoke the phrase as “Houston we have a problem”, a popular misconception.
References
Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: Are your training datasets yet relevant? In: Piessens, F., Caballero, J., Bielova, N. (eds.) ESSoS 2015. LNCS, vol. 8978, pp. 51–67. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15618-7_5
Apple Siri Team: Personalized Hey Siri - Apple (2018). https://machinelearning.apple.com/2018/04/16/personalized-hey-siri.html. Accessed 7 Jul 2019
Bimbot, F., et al.: A tutorial on text-independent speaker verification. EURASIP J. Adv. Signal Process. 2004(4), 101962 (2004)
Blue, L., Abdullah, H., Vargas, L., Traynor, P.: 2MA: verifying voice commands via two microphone authentication. In: Proceedings of the 13th on Asia Conference on Computer and Communications Security, pp. 89–100. ACM (2018)
Blue, L., Vargas, L., Traynor, P.: Hello, is it me you’re looking for?: differentiating between human and electronic speakers for voice interface security. In: Proceedings of the 11th Conference on Security & Privacy in Wireless and Mobile Networks, pp. 123–133. ACM (2018)
Blumeyer, D.: Relative frequencies of english phonemes (2012). https://cmloegcmluin.wordpress.com/2012/11/10/relative-frequencies-of-english-phonemes/. Accessed 27 Apr 2019
Carlini, N., et al.: Hidden voice commands. In: Proceedings of the 25th USENIX Security Symposium, pp. 513–530 (2016)
Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. In: IEEE Security and Privacy Workshops, pp. 1–7. IEEE (2018)
Chen, S., et al.: You can hear but you cannot steal: defending against voice impersonation attacks on smartphones. In: Proceedings of the 37th International Conference on Distributed Computing Systems, pp. 183–195. IEEE (2017)
De Leon, P.L., Pucher, M., Yamagishi, J., Hernaez, I., Saratxaga, I.: Evaluation of speaker verification security and detection of HMM-based synthetic speech. Transactions on Audio, Speech and Language Processing (2012)
Eberz, S., Rasmussen, K.B., Lenders, V., Martinovic, I.: Evaluating behavioral biometrics for continuous authentication. In: Proceedings of the 12th Asia Conference on Computer and Communications Security, pp. 386–399 (2017)
Ellis, D.P.W.: PLP and RASTA (and MFCC, and inversion) in Matlab (2005). http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/. Accessed 8 Jul 2019
Ergünay, S.K., Khoury, E., Lazaridis, A., Marcel, S.: On the vulnerability of speaker verification to realistic voice spoofing. In: Proceedings of the 7th International Conference on Biometrics Theory, Applications and Systems, pp. 1–6. IEEE (2015)
Evans, N., Kinnunen, T., Yamagishi, J.: Spoofing and countermeasures for automatic speaker verification. In: Proceedings of the Annual Conference of the International Speech Communication Association pp. 925–929 (2013)
Fant, G.: Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations. No. 2, Walter de Gruyter (1970)
Google: Set up Voice Match on Google Home - Google Home Help (2018). https://support.google.com/googlehome/answer/7323910. Accessed 8 Jul 2019
Helland, T., Kaasa, R.: Dyslexia in english as a second language. Dyslexia 11(1), 41–60 (2005)
HSBC: Voice ID — HSBC UK (2018). https://www.hsbc.co.uk/1/2/voice-id. Accessed 8 Jul 2019
Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: Proceedings of the Signal and Information Processing Association Annual Summit and Conference, pp. 1–6. IEEE (2016)
Khoury, E., El Shafey, L., Marcel, S.: Spear: an open source toolbox for speaker recognition based on Bob. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 1655–1659. IEEE (2014)
Kinnunen, T., Wu, Z.Z., Lee, K.A., Sedlak, F., Chng, E.S., Li, H.: Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 4401–4404. IEEE (2012)
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Lau, Y.W., Tran, D., Wagner, M.: Testing voice mimicry with the YOHO speaker verification corpus. In: Proceedings of the 9th International Conference on Knowledge-Based Intelligent Information And Engineering Systems, vol. 3584, pp. 15–21 (2005)
Lindberg, J., Blomberg, M.: Vulnerability in speaker verification-a study of technical impostor techniques. In: Proceedings of the 6th European Conference on Speech Communication and Technology (1999)
Lloyds Bank: Voice ID — Lloyds Bank (2019). https://www.lloydsbank.com/contact-us/voice-id.asp. Accessed 8 Jul 2019
Matrouf, D., Bonastre, J.F., Fredouille, C.: Effect of speech transformation on impostor acceptance. In: Proceedings of the 31st International Conference on Acoustics Speech and Signal Processing, vol. 1. IEEE (2006)
Mermelstein, P.: Distance measures for speech recognition, psychological and instrumental. Pattern Recogn. Artif. Intell. 116, 374–388 (1976)
Microsoft ML Blog Team: Now available: Speaker & video apis from microsoft project oxford. https://blogs.technet.microsoft.com/machinelearning/2015/12/14/now-available-speaker-video-apis-from-microsoft-project-oxford/
Povey, D., et al.: The kaldi speech recognition toolkit. In: Proceedings of the 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE (2011)
Smith, J.O.: Physical audio signal processing. https://ccrma.stanford.edu/~jos/pasp/Freeverb.html. Accessed 8 Jul 2019
Sun, L., Li, K., Wang, H., Kang, S., Meng, H.: Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In: Proceedings of the 2016 International Conference on Multimedia and Expo, pp. 1–6. IEEE (2016)
Toda, T., et al.: The voice conversion challenge 2016. In: Proceedings of the Annual Conference of the International Speech Communication Association (2016)
Vaidya, T., Zhang, Y., Sherr, M., Shields, C.: Cocaine noodles: exploiting the gap between human and machine speech recognition. In: Proceedings of the 9th USENIX Workshop on Offensive Technologies (2015)
Voxforge Dataset: Free speech... recognition. http://www.voxforge.org/. Accessed 8 Jul 2019
Yuan, X., et al.: Commandersong: a systematic approach for practical adversarial voice recognition. In: Proceedings of the 27th USENIX Security Symposium, pp. 49–64 (2018)
Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice commands. In: Proceedings of the 24th SIGSAC Conference on Computer and Communications Security, pp. 103–117. ACM (2017)
Zhang, L., Tan, S., Yang, J., Chen, Y.: Voicelive: a phoneme localization based liveness detection for voice authentication on smartphones. In: Proceedings of the 23rd SIGSAC Conference on Computer and Communications Security, pp. 1080–1091. ACM (2016)
Acknowledgements
This work was supported by a grant from Mastercard and the Engineering and Physical Sciences Research Council [grant numbers EP/N509711/1 and EP/P00881X/1].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Audio Collected
A Audio Collected
1.1 A.1 Commands
Command data was sourced as both utterances that could be presented to systems in existence, as well as commands used specifically by the Azure Speaker recognition system for verification. The utterances recorded were as follows:
-
1.
Hey Siri (Repeated 4 times)
-
2.
Ok Google (Repeated 4 times)
-
3.
What is the weather like?
-
4.
What time is it?
-
5.
Who am I?
-
6.
How tall is the shard?
-
7.
My voice is stronger than passwords (Repeated 4 times)
-
8.
My password is not your business (Repeated 4 times)
-
9.
Apple juice tastes funny after toothpaste (Repeated 4 times)
-
10.
Houston we have had a problem (Repeated 4 times)
-
11.
You can activate security system now (Repeated 4 times)
-
12.
My voice is my password (Repeated 4 times)
1.2 A.2 Conference
Conference talk transcripts were obtained from popular TED talks. The transcripts were shortened, so that they contained approximately the first 6 min of a given talk. The transcripts were then split into individual utterances, with each utterance being recorded as a separate audio file by the participant. Five different conference talk transcripts were used, which are the following:
-
1.
Do schools kill creativity? by Sir Ken Robinson - www.ted.com/talks/ken_robinson_says_schools_kill_creativity/transcript
-
2.
Your body language may shape who you are by Amy Cuddy - www.ted.com/talks/amy_cuddy_your_body_language_shapes_who_you_are/transcript
-
3.
What makes a good life? by Robert Waldinger - www.ted.com/talks/robert_waldinger_what_makes_a_good_life_lessons_from_the_longest_study_on_happiness/transcript
-
4.
How great leaders inspire action by Simon Sinek - www.ted.com/talks/simon_sinek_how_great_leaders_inspire_action/transcript
-
5.
The power of vulnerability by Brené Brown - www.ted.com/talks/brene_brown_on_vulnerability/transcript
1.3 A.3 Cafe
Our conversation audio is derived from TED talks where two people are having a conversation. A single speakers audio was extracted from each transcript, and the transcript was shortened until it was approximately 6 min in length. Five different conversation transcripts were used, which were dervied from the following talks:
-
1.
SpaceX’s plan to fly you across the globe in 20 min - Gwynne Shotwell - https://www.ted.com/talks/gwynne_shotwell_spacex_s_plan_to_fly_you_across_the_globe_in_30_minutes/transcript
-
2.
How Netflix changed entertainment - Reed Hastings - https://www.ted.com/talks/reed_hastings_how_netflix_changed_entertainment_and_where_it_s_headed/transcript
-
3.
Mammoths resurrected, geoengineering and other thoughts from a futurist - Stewart Brand - https://www.ted.com/talks/stewart_brand_and_chris_anderson_mammoths_resurrected_geoengineering_and_other_thoughts_from_a_futurist/transcript
-
4.
The future we’re building and boring - Elon Musk - https://www.ted.com/talks/elon_musk_the_future_we_re_building_and_boring/transcript
-
5.
What everyday citizens can do to claim power on the internet - Fadi Cehadé - https://www.ted.com/talks/fadi_chehade_what_everyday_citizens_can_do_to_claim_power_on_the_internet/transcript
1.4 A.4 Enrolment
Enrolment audio was used to enroll individual speakers with the Azure Speaker Recognition API for identification. Participants were asked to read the first 6 paragraphs of the speech given by UK Prime Minister David Cameron at the start of the London 2012 Olympics. The speech can be found on the UK government speeches website at the following URL: https://www.gov.uk/government/speeches/pms-speech-at-olympics-press-conference
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Turner, H., Lovisotto, G., Martinovic, I. (2019). Attacking Speaker Recognition Systems with Phoneme Morphing. In: Sako, K., Schneider, S., Ryan, P. (eds) Computer Security – ESORICS 2019. ESORICS 2019. Lecture Notes in Computer Science(), vol 11735. Springer, Cham. https://doi.org/10.1007/978-3-030-29959-0_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-29959-0_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29958-3
Online ISBN: 978-3-030-29959-0
eBook Packages: Computer ScienceComputer Science (R0)