Attacking Speaker Recognition Systems with Phoneme Morphing

Turner, Henry; Lovisotto, Giulio; Martinovic, Ivan

doi:10.1007/978-3-030-29959-0_23

Henry Turner¹¹,
Giulio Lovisotto¹¹ &
Ivan Martinovic¹¹

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11735))

Included in the following conference series:

European Symposium on Research in Computer Security

2701 Accesses
2 Citations

Abstract

As voice interfaces become more widely available they increasingly implement speaker recognition, to provide both personalized functionalities and security via authentication. In this paper, we present a method that transforms the voice of one person so that it resembles the voice of a victim, such that it can be used to deceive speaker recognition systems into believing an utterance was spoken by the victim. The transformation only requires short pieces of audio recordings from the source and victim voices, and does not require specific words to be spoken by the victim. We show that the attack can be improved by using a population of source voices and we provide a metric to identify promising source voices, from within such a population.

We evaluate our attack along a set of dimensions, including: varying quantity, quality and types of known victim audio, verification and identification systems, white- and black-box models and both over-the-wire and over-the-air access. We test the audio transformation on two different proprietary models: (i) the Azure Speaker Recognition API and (ii) the Siri voice activation of an Apple iPhone, showing that individuals can easily be impersonated by obtaining as little as one minute of their audio, even when such audio is recorded in noisy conditions. With attempts from only three source voices, our attack achieves success rates of over 40% in the weakest assumption scenario against the Azure Verification API and rates of over 80% in all scenarios against Siri.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Transcript summaries are available in Appendix A.
2.
https://youtube.com/watch?v=BOdLmxy06H0.
3.
https://azure.microsoft.com/en-us/services/cognitive-services/.
4.
We conducted our experiments against the Microsoft APIs in January 2019.
5.
We had to remove one phrase, “Houston we have had a problem”, as participants spoke the phrase as “Houston we have a problem”, a popular misconception.

References

Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: Are your training datasets yet relevant? In: Piessens, F., Caballero, J., Bielova, N. (eds.) ESSoS 2015. LNCS, vol. 8978, pp. 51–67. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15618-7_5
Chapter Google Scholar
Apple Siri Team: Personalized Hey Siri - Apple (2018). https://machinelearning.apple.com/2018/04/16/personalized-hey-siri.html. Accessed 7 Jul 2019
Bimbot, F., et al.: A tutorial on text-independent speaker verification. EURASIP J. Adv. Signal Process. 2004(4), 101962 (2004)
Article Google Scholar
Blue, L., Abdullah, H., Vargas, L., Traynor, P.: 2MA: verifying voice commands via two microphone authentication. In: Proceedings of the 13th on Asia Conference on Computer and Communications Security, pp. 89–100. ACM (2018)
Google Scholar
Blue, L., Vargas, L., Traynor, P.: Hello, is it me you’re looking for?: differentiating between human and electronic speakers for voice interface security. In: Proceedings of the 11th Conference on Security & Privacy in Wireless and Mobile Networks, pp. 123–133. ACM (2018)
Google Scholar
Blumeyer, D.: Relative frequencies of english phonemes (2012). https://cmloegcmluin.wordpress.com/2012/11/10/relative-frequencies-of-english-phonemes/. Accessed 27 Apr 2019
Carlini, N., et al.: Hidden voice commands. In: Proceedings of the 25th USENIX Security Symposium, pp. 513–530 (2016)
Google Scholar
Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. In: IEEE Security and Privacy Workshops, pp. 1–7. IEEE (2018)
Google Scholar
Chen, S., et al.: You can hear but you cannot steal: defending against voice impersonation attacks on smartphones. In: Proceedings of the 37th International Conference on Distributed Computing Systems, pp. 183–195. IEEE (2017)
Google Scholar
De Leon, P.L., Pucher, M., Yamagishi, J., Hernaez, I., Saratxaga, I.: Evaluation of speaker verification security and detection of HMM-based synthetic speech. Transactions on Audio, Speech and Language Processing (2012)
Google Scholar
Eberz, S., Rasmussen, K.B., Lenders, V., Martinovic, I.: Evaluating behavioral biometrics for continuous authentication. In: Proceedings of the 12th Asia Conference on Computer and Communications Security, pp. 386–399 (2017)
Google Scholar
Ellis, D.P.W.: PLP and RASTA (and MFCC, and inversion) in Matlab (2005). http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/. Accessed 8 Jul 2019
Ergünay, S.K., Khoury, E., Lazaridis, A., Marcel, S.: On the vulnerability of speaker verification to realistic voice spoofing. In: Proceedings of the 7th International Conference on Biometrics Theory, Applications and Systems, pp. 1–6. IEEE (2015)
Google Scholar
Evans, N., Kinnunen, T., Yamagishi, J.: Spoofing and countermeasures for automatic speaker verification. In: Proceedings of the Annual Conference of the International Speech Communication Association pp. 925–929 (2013)
Google Scholar
Fant, G.: Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations. No. 2, Walter de Gruyter (1970)
Google Scholar
Google: Set up Voice Match on Google Home - Google Home Help (2018). https://support.google.com/googlehome/answer/7323910. Accessed 8 Jul 2019
Helland, T., Kaasa, R.: Dyslexia in english as a second language. Dyslexia 11(1), 41–60 (2005)
Article Google Scholar
HSBC: Voice ID — HSBC UK (2018). https://www.hsbc.co.uk/1/2/voice-id. Accessed 8 Jul 2019
Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: Proceedings of the Signal and Information Processing Association Annual Summit and Conference, pp. 1–6. IEEE (2016)
Google Scholar
Khoury, E., El Shafey, L., Marcel, S.: Spear: an open source toolbox for speaker recognition based on Bob. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 1655–1659. IEEE (2014)
Google Scholar
Kinnunen, T., Wu, Z.Z., Lee, K.A., Sedlak, F., Chng, E.S., Li, H.: Vulnerability of speaker verification systems against voice conversion spoofing attacks: the case of telephone speech. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 4401–4404. IEEE (2012)
Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)
Article MathSciNet Google Scholar
Lau, Y.W., Tran, D., Wagner, M.: Testing voice mimicry with the YOHO speaker verification corpus. In: Proceedings of the 9th International Conference on Knowledge-Based Intelligent Information And Engineering Systems, vol. 3584, pp. 15–21 (2005)
Google Scholar
Lindberg, J., Blomberg, M.: Vulnerability in speaker verification-a study of technical impostor techniques. In: Proceedings of the 6th European Conference on Speech Communication and Technology (1999)
Google Scholar
Lloyds Bank: Voice ID — Lloyds Bank (2019). https://www.lloydsbank.com/contact-us/voice-id.asp. Accessed 8 Jul 2019
Matrouf, D., Bonastre, J.F., Fredouille, C.: Effect of speech transformation on impostor acceptance. In: Proceedings of the 31st International Conference on Acoustics Speech and Signal Processing, vol. 1. IEEE (2006)
Google Scholar
Mermelstein, P.: Distance measures for speech recognition, psychological and instrumental. Pattern Recogn. Artif. Intell. 116, 374–388 (1976)
Google Scholar
Microsoft ML Blog Team: Now available: Speaker & video apis from microsoft project oxford. https://blogs.technet.microsoft.com/machinelearning/2015/12/14/now-available-speaker-video-apis-from-microsoft-project-oxford/
Povey, D., et al.: The kaldi speech recognition toolkit. In: Proceedings of the 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE (2011)
Google Scholar
Smith, J.O.: Physical audio signal processing. https://ccrma.stanford.edu/~jos/pasp/Freeverb.html. Accessed 8 Jul 2019
Sun, L., Li, K., Wang, H., Kang, S., Meng, H.: Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In: Proceedings of the 2016 International Conference on Multimedia and Expo, pp. 1–6. IEEE (2016)
Google Scholar
Toda, T., et al.: The voice conversion challenge 2016. In: Proceedings of the Annual Conference of the International Speech Communication Association (2016)
Google Scholar
Vaidya, T., Zhang, Y., Sherr, M., Shields, C.: Cocaine noodles: exploiting the gap between human and machine speech recognition. In: Proceedings of the 9th USENIX Workshop on Offensive Technologies (2015)
Google Scholar
Voxforge Dataset: Free speech... recognition. http://www.voxforge.org/. Accessed 8 Jul 2019
Yuan, X., et al.: Commandersong: a systematic approach for practical adversarial voice recognition. In: Proceedings of the 27th USENIX Security Symposium, pp. 49–64 (2018)
Google Scholar
Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., Xu, W.: Dolphinattack: inaudible voice commands. In: Proceedings of the 24th SIGSAC Conference on Computer and Communications Security, pp. 103–117. ACM (2017)
Google Scholar
Zhang, L., Tan, S., Yang, J., Chen, Y.: Voicelive: a phoneme localization based liveness detection for voice authentication on smartphones. In: Proceedings of the 23rd SIGSAC Conference on Computer and Communications Security, pp. 1080–1091. ACM (2016)
Google Scholar

Download references

Acknowledgements

This work was supported by a grant from Mastercard and the Engineering and Physical Sciences Research Council [grant numbers EP/N509711/1 and EP/P00881X/1].

Author information

Authors and Affiliations

University of Oxford, Oxford, UK
Henry Turner, Giulio Lovisotto & Ivan Martinovic

Authors

Henry Turner
View author publications
You can also search for this author in PubMed Google Scholar
Giulio Lovisotto
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Martinovic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henry Turner .

Editor information

Editors and Affiliations

NEC Corporation, Kawasaki, Japan
Kazue Sako
University of Surrey, Guildford, UK
Steve Schneider
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Peter Y. A. Ryan

A Audio Collected

1.1 A.1 Commands

Command data was sourced as both utterances that could be presented to systems in existence, as well as commands used specifically by the Azure Speaker recognition system for verification. The utterances recorded were as follows:

1.
Hey Siri (Repeated 4 times)
2.
Ok Google (Repeated 4 times)
3.
What is the weather like?
4.
What time is it?
5.
Who am I?
6.
How tall is the shard?
7.
My voice is stronger than passwords (Repeated 4 times)
8.
My password is not your business (Repeated 4 times)
9.
Apple juice tastes funny after toothpaste (Repeated 4 times)
10.
Houston we have had a problem (Repeated 4 times)
11.
You can activate security system now (Repeated 4 times)
12.
My voice is my password (Repeated 4 times)

1.2 A.2 Conference

Conference talk transcripts were obtained from popular TED talks. The transcripts were shortened, so that they contained approximately the first 6 min of a given talk. The transcripts were then split into individual utterances, with each utterance being recorded as a separate audio file by the participant. Five different conference talk transcripts were used, which are the following:

1.
Do schools kill creativity? by Sir Ken Robinson - www.ted.com/talks/ken_robinson_says_schools_kill_creativity/transcript
2.
Your body language may shape who you are by Amy Cuddy - www.ted.com/talks/amy_cuddy_your_body_language_shapes_who_you_are/transcript
3.
What makes a good life? by Robert Waldinger - www.ted.com/talks/robert_waldinger_what_makes_a_good_life_lessons_from_the_longest_study_on_happiness/transcript
4.
How great leaders inspire action by Simon Sinek - www.ted.com/talks/simon_sinek_how_great_leaders_inspire_action/transcript
5.
The power of vulnerability by Brené Brown - www.ted.com/talks/brene_brown_on_vulnerability/transcript

1.3 A.3 Cafe

Our conversation audio is derived from TED talks where two people are having a conversation. A single speakers audio was extracted from each transcript, and the transcript was shortened until it was approximately 6 min in length. Five different conversation transcripts were used, which were dervied from the following talks:

1.
SpaceX’s plan to fly you across the globe in 20 min - Gwynne Shotwell - https://www.ted.com/talks/gwynne_shotwell_spacex_s_plan_to_fly_you_across_the_globe_in_30_minutes/transcript
2.
How Netflix changed entertainment - Reed Hastings - https://www.ted.com/talks/reed_hastings_how_netflix_changed_entertainment_and_where_it_s_headed/transcript
3.
Mammoths resurrected, geoengineering and other thoughts from a futurist - Stewart Brand - https://www.ted.com/talks/stewart_brand_and_chris_anderson_mammoths_resurrected_geoengineering_and_other_thoughts_from_a_futurist/transcript
4.
The future we’re building and boring - Elon Musk - https://www.ted.com/talks/elon_musk_the_future_we_re_building_and_boring/transcript
5.
What everyday citizens can do to claim power on the internet - Fadi Cehadé - https://www.ted.com/talks/fadi_chehade_what_everyday_citizens_can_do_to_claim_power_on_the_internet/transcript

1.4 A.4 Enrolment

Enrolment audio was used to enroll individual speakers with the Azure Speaker Recognition API for identification. Participants were asked to read the first 6 paragraphs of the speech given by UK Prime Minister David Cameron at the start of the London 2012 Olympics. The speech can be found on the UK government speeches website at the following URL: https://www.gov.uk/government/speeches/pms-speech-at-olympics-press-conference

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Turner, H., Lovisotto, G., Martinovic, I. (2019). Attacking Speaker Recognition Systems with Phoneme Morphing. In: Sako, K., Schneider, S., Ryan, P. (eds) Computer Security – ESORICS 2019. ESORICS 2019. Lecture Notes in Computer Science(), vol 11735. Springer, Cham. https://doi.org/10.1007/978-3-030-29959-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-29959-0_23
Published: 15 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29958-3
Online ISBN: 978-3-030-29959-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Attacking Speaker Recognition Systems with Phoneme Morphing

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Audio Collected

A Audio Collected

1.1 A.1 Commands

1.2 A.2 Conference

1.3 A.3 Cafe

1.4 A.4 Enrolment

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation