Solving Google’s Continuous Audio CAPTCHA with HMM-Based Automatic Speech Recognition

  • Shotaro Sano
  • Takuma Otsuka
  • Hiroshi G. Okuno
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8231)


CAPTCHAs play critical roles in maintaining the security of various Web services by distinguishing humans from automated programs and preventing Web services from being abused. CAPTCHAs are designed to block automated programs by presenting questions that are easy for humans but difficult for computers, e.g., recognition of visual digits or audio utterances. Recent audio CAPTCHAs, such as Google’s audio reCAPTCHA, have presented overlapping and distorted target voices with stationary background noise. We investigate the security of overlapping audio CAPTCHAs by developing an audio reCAPTCHA solver. Our solver is constructed based on speech recognition techniques using hidden Markov models (HMMs). It is implemented by using an off-the-shelf library HMM Toolkit. Our experiments revealed vulnerabilities in the current version of audio reCAPTCHA with the solver cracking 52% of the questions. We further explain that background stationary noise did not contribute to enhance security against our solver.


audio CAPTCHA human interaction proof reCAPTCHA automatic speech recognition hidden Marcov model 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Von Ahn, L., Blum, M., Langford, J.: Telling humans and computers apart automatically. Communications of the ACM 47(2), 56–60 (2004)CrossRefGoogle Scholar
  2. 2.
    Mori, G., Malik, J.: Recognizing objects in adversarial clutter: Breaking a visual CAPTCHA. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, p. I–134. IEEE (2003)Google Scholar
  3. 3.
    Yan, J., El Ahmad, A.S.: A low-cost attack on a microsoft captcha. In: Proceedings of the 15th ACM Conference on Computer and Communications Security, pp. 543–554. ACM (2008)Google Scholar
  4. 4.
    Tam, J., Simsa, J., Hyde, S., Von Ahn, L.: Breaking audio CAPTCHAs. Advances in Neural Information Processing Systems 1(4) (2008)Google Scholar
  5. 5.
    Bursztein, E., Beauxis, R., Paskov, H., Perito, D., Fabry, C., Mitchell, J.: The failure of noise-based non-continuous audio CAPTCHAs. In: IEEE Symposium on Security and Privacy, pp. 19–31. IEEE (2011)Google Scholar
  6. 6.
    Bursztein, E., Bethard, S., Fabry, C., Mitchell, J.C., Jurafsky, D.: How good are humans at solving CAPTCHAs? A large scale evaluation. In: IEEE Symposium on Security and Privacy, pp. 399–413. IEEE (2010)Google Scholar
  7. 7.
    Chellapilla, K., Larson, K., Simard, P., Czerwinski, M.: Computers beat humans at single character recognition in reading based human interaction proofs (HIPs). In: Proceedings of the Second Conference on Email and Anti-Spam, pp. 21–22 (2005)Google Scholar
  8. 8.
    Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: reCAPTCHA: Human-based character recognition via web security measures. Science 321(5895), 1465–1468 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  10. 10.
    Young, S.J., Young, S.: The HTK hidden Markov model toolkit: Design and philosophy. Citeseer (1993)Google Scholar
  11. 11.
    Tiwari, V.: MFCC and its applications in speaker recognition. International Journal on Emerging Technologies 1(1), 19–22 (2010)Google Scholar
  12. 12.
    Umesh, S., Cohen, L., Nelson, D.: Frequency warping and the mel scale. IEEE Signal Processing Letters 9(3), 104–107 (2002)CrossRefGoogle Scholar
  13. 13.
    Furui, S.: Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech and Signal Processing 34(1), 52–59 (1986)CrossRefGoogle Scholar
  14. 14.
    Welch, L.R.: Hidden Markov Models and the Baum-Welch Algorithm. IEEE Information Theory Society Newsletter 53(4) (2003)Google Scholar
  15. 15.
    Lee, K.F., Hon, H.W.: Large-vocabulary speaker-independent continuous speech recognition using hmm. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 123–126. IEEE (1988)Google Scholar
  16. 16.
    Nakagawa, S., Hanai, K., Yamamoto, K., Minematsu, N.: Comparison of syllable-based hmms and triphone-based hmms in japanese speech recognition. In: Proceedings of International Workshop on Automatic Speech Recognition and Understanding, pp. 393–396 (1999)Google Scholar
  17. 17.
    Halley, J.M., Kunin, W.E.: Extinction risk and the 1/f family of noise models. Theoretical Population Biology 56(3), 215–230 (1999)CrossRefzbMATHGoogle Scholar
  18. 18.
    Maekawa, K.: Corpus of spontaneous japanese: Its design and evaluation. In: ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (2003)Google Scholar
  19. 19.
    Chellapilla, K., Larson, K., Simard, P.Y., Czerwinski, M.: Building segmentation based human-friendly human interaction proofs (HIPs). In: Baird, H.S., Lopresti, D.P. (eds.) HIP 2005. LNCS, vol. 3517, pp. 1–26. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  20. 20.
    Bronkhorst, A.W.: The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica United with Acustica 86(1), 117–128 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Shotaro Sano
    • 1
  • Takuma Otsuka
    • 1
  • Hiroshi G. Okuno
    • 1
  1. 1.Graduate School of InformaticsKyoto UniversityKyotoJapan

Personalised recommendations