Journal on Multimodal User Interfaces

, Volume 4, Issue 2, pp 61–79 | Cite as

Automatic fingersign-to-speech translation system

  • Marek HrúzEmail author
  • Pavel Campr
  • Erinç Dikici
  • Ahmet Alp Kındıroğlu
  • Zdeněk Krňoul
  • Alexander Ronzhin
  • Haşim Sak
  • Daniel Schorno
  • Hülya Yalçın
  • Lale Akarun
  • Oya Aran
  • Alexey Karpov
  • Murat Saraçlar
  • Milos Železný
Original Paper


The aim of this paper is to help the communication of two people, one hearing impaired and one visually impaired by converting speech to fingerspelling and fingerspelling to speech. Fingerspelling is a subset of sign language, and uses finger signs to spell letters of the spoken or written language. We aim to convert finger spelled words to speech and vice versa. Different spoken languages and sign languages such as English, Russian, Turkish and Czech are considered.


Fingerspelling recognition Speech recognition Fingerspelling synthesis Speech synthesis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
    Allen J, Xu R, Jin J (2004) Object tracking using camshift algorithm and multiple quantized feature spaces. In: Proceedings of the Pan-Sydney area workshop on visual information processing, vol 36. Australian Computer Society, Inc, pp 3–7 Google Scholar
  5. 5.
    Aran O, Ari I, Akarun L, Dikici E, Parlak S, Saraclar M, Campr P, Hruz M (2008) Speech and sliding text aided sign retrieval from hearing impaired sign news videos. J Multimodal User Interfaces 2(2):117–131. CrossRefGoogle Scholar
  6. 6.
    Arısoy E, Can D, Parlak S, Sak H, Saraçlar M (2009) Turkish broadcast news transcription and retrieval. IEEE Trans Audio Speech Lang Process 17(5):874–883 CrossRefGoogle Scholar
  7. 7.
    Beutnagel M, Mohri M, Riley M (1999) Rapid unit selection from a large speech corpus for concatenative speech synthesis. In: ESCA, pp 607–610 Google Scholar
  8. 8.
    Can D, Saraçlar M (2009) Turkish broadcast news transcription with open-source software. In: IEEE 17th signal processing and communications applications conference (SIU), pp 325–328 CrossRefGoogle Scholar
  9. 9.
    Chen F (2003) Hand gesture recognition using a real-time tracking method and hidden Markov models. Image Vis Comput 21(8):745–758. CrossRefGoogle Scholar
  10. 10.
    Daniel T, Jiří K, Jindřich M (2010) Enhancements of viterbi search for fast unit selection synthesis, pp 174–177.
  11. 11.
    Duarte K, Gibet S (2010) Heterogeneous data sources for signed language analysis and synthesis: the signcom project. In: LREC—language resources and evaluation Google Scholar
  12. 12.
    Dutoit T, Bozkurt B (2009) Speech Synthesis, 1st edn. Springer, New York, pp 557–585 Google Scholar
  13. 13.
    Elliott R, Glauert JRW, Kennaway JR, Marshall I, Safar E (2008) Linguistic modelling and language-processing technologies for avatar-based sign language presentation. Univers Access Inf Soc 6:375–391 CrossRefGoogle Scholar
  14. 14.
    Exner D, Bruns E, Kurz D, Grundh A (2005) Fast and reliable CAMShift tracking Google Scholar
  15. 15.
    Ezzat T, Poggio T (1999) Visual speech synthesis by morphing visemes, pp 45–57 Google Scholar
  16. 16.
    Fang Y, Cheng J, Wang K, Lu H (2007) Hand gesture recognition using fast multi-scale analysis. In: Fourth international conference on image and graphics (ICIG 2007), pp 694–698. CrossRefGoogle Scholar
  17. 17.
    Fotinea SE, Efthimiou E, Caridakis G, Karpouzis K (2008) A knowledge-based sign synthesis architecture. Univers Access Inf Soc 6:405–418 CrossRefGoogle Scholar
  18. 18.
    Ganapathiraju A, Hamaker J, Picone J (2000) Hybrid svm/hmm architectures for speech recognition. In: Proceedings of speech transcription workshop, pp 504–507 Google Scholar
  19. 19.
    Grůber M, Tihelka D (2010) Expressive speech synthesis for Czech limited domain dialogue system—basic experiments, vol 1, Institute of Electrical and Electronics Engineers, Beijing, pp 561–564. Google Scholar
  20. 20.
    Hanzlíček Z (2010) Czech hmm-based speech synthesis. In: Text, speech and dialogue. Lecture notes in computer science, vol 6231. Springer, Berlin, pp 291–298. CrossRefGoogle Scholar
  21. 21.
    Heloir A, Kipp M (2010) Real-time animation of interactive agents: Specification and realization. Appl Artif Intell 24:510–529 CrossRefGoogle Scholar
  22. 22.
    Hoffmann R, Jokisch O, Lobanov B, Tsirulnik L, Shpilewsky E, Piurkowska B, Ronzhin A, Karpov A (2007) Slavonic TTS and STT conversion for let’s fly dialogue system. In: 12th international conference on speech and computer SPECOM, Moscow, Russia, pp 729–733 Google Scholar
  23. 23.
    Hu M (1962) Visual pattern recognition by moment invariants. IRE Trans Inf Theory 8(2):179–187 CrossRefGoogle Scholar
  24. 24.
    Jing Z, Min Z (2010) Speech recognition system based improved DTW algorithm. In: Proceedings of the international conference on computer, mechatronics. Control and electronic engineering CMCE-2010, vol 5, pp 320–323 Google Scholar
  25. 25.
    Kanis J, Krňoul Z (2008) Interactive HamNoSys notation editor for signed speech annotation. In: ELRA, pp 88–93 Google Scholar
  26. 26.
    Karpov A, Ronzhin A, Markov KMZ (2010) Viseme-dependent weight optimization for CHMM-based audio-visual speech recognition. In: Interspeech-2010 proceedings. ISCA Association, Makuhari, pp 2678–2681 Google Scholar
  27. 27.
    Kindiroglu AA, Yalcin H, Aran O, Hruz M, Campr P, Akarun L, Karpov A (2010) Multi-lingual fingerspelling recognition for handicapped kiosk. In: Pattern recognition and image analysis, St Petersburg, pp 33–37 Google Scholar
  28. 28.
    Krňoul Z (2010) New features in synthesis of sign language addressing non-manual component. In: 4th workshop on representation and processing of sign languages: corpora and sign language technologies Google Scholar
  29. 29.
    Krňoul Z, Kanis J, Železný M, Müller L (2008) Czech text-to-sign speech synthesizer. Mach Learn Multimodal Interact 1, 180–191 Google Scholar
  30. 30.
    Krňoul Z, Železný M (2004) Realistic face animation for a Czech Talking Head. Lecture notes in artificial intelligence, vol 3206, pp 603–610 Google Scholar
  31. 31.
    Krňoul Z, Železný M (2007) Translation and conversion for Czech sign speech synthesis. Lecture notes in artificial intelligence, vol 4629, pp 524–531 Google Scholar
  32. 32.
    Krňoul Z, Železný M, Müller L, Kanis J (2006) Training of coarticulation models using dominance functions and visual unit selection methods for audio-visual speech synthesis. In: Proceedings of INTERSPEECH 2006 - ICSLP. Bonn Google Scholar
  33. 33.
    Kuhl F, Giardina C (1982) Elliptic Fourier features of a closed contour. Comput Graph Image Process 18:236–258 CrossRefGoogle Scholar
  34. 34.
    Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10(8):707–710 MathSciNetGoogle Scholar
  35. 35.
    Lin C, Hwang C (1987) New forms of shape invariants from elliptic Fourier descriptors. Pattern Recognit 20:535–545 CrossRefGoogle Scholar
  36. 36.
    Liu R, Li Z, Jia J (2008) Image partial blur detection and classification. In: 2008 IEEE conference on computer vision and pattern recognition, pp 1–8 CrossRefGoogle Scholar
  37. 37.
    Liwicki S, Everingham M (2009) Automatic recognition of fingerspelled words in British Sign Language. In: 2009 IEEE computer society conference on computer vision and pattern recognition workshops (iv), pp 50–57 CrossRefGoogle Scholar
  38. 38.
    Lombardo V, Nunnari F, Damiano R (2010) A virtual interpreter for the Italian sign language. In: Proceedings of the 10th international conference on intelligent virtual agents IVA’10. Springer, Berlin, pp 201–207 Google Scholar
  39. 39.
    Marnik J (2007) The Polish finger alphabet hand postures recognition using elastic graph matching. In: Computer recognition systems 2, pp 454–461. CrossRefGoogle Scholar
  40. 40.
    Matoušek J, Hanzlíček Z, Tihelka D, Méner M (2010) Automatic dubbing of tv programmes for the hearing impaired. In: Proceedings of IEEE 10th international conference on signal processing, vol 1. Institute of Electrical and Electronics Engineers, Beijing, pp 589–592. CrossRefGoogle Scholar
  41. 41.
    Mohri M, Pereira F, Riley M (2002) Weighted finite-state transducers in speech recognition. Comput Speech Lang 16(1):69–88 CrossRefGoogle Scholar
  42. 42.
    Nguyen TT, Binh ND, Bischof H (2008) An active boosting-based learning framework for real-time hand detection. In: 2008 8th IEEE international conference on automatic face & gesture recognition, pp 1–6 Google Scholar
  43. 43.
    Ojala T (1996) A comparative study of texture measures with classification based on featured distributions. Pattern Recognit 29(1):51–59 CrossRefGoogle Scholar
  44. 44.
    Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987 CrossRefGoogle Scholar
  45. 45.
    Rabiner L, Juang BH (1993) Fundamentals of speech recognition. Prentice-Hall, Englewood Cliffs Google Scholar
  46. 46.
    Ronzhin A, Karpov A (2007) Russian voice interface. Pattern Recognit Image Anal 17(2):321–336., pleiades publishing CrossRefGoogle Scholar
  47. 47.
    Sak H, Güngör T, Saraçlar M (2010) Resources for Turkish morphological processing. Lang Resour Evaluation Google Scholar
  48. 48.
    Sak H, Saraçlar M, Güngör T (2010) On-the-fly lattice rescoring for real-time automatic speech recognition. In: Interspeech, Makahuri, Japan Google Scholar
  49. 49.
    Schnepp J, Wolfe R, McDonald JC (2010) Synthetic corpora: a synergy of linguistics and computer animation. In: Fourth workshop on the representation and processing of sign languages: corpora and sign language technologies Google Scholar
  50. 50.
    Schröder M, Trouvain J (2003) The German text-to-speech synthesis system MARY: A tool for research, development and teaching. Int J Speech Technol 6(4):365–377 CrossRefGoogle Scholar
  51. 51.
    Schwarz P, Matejka P, Cernocky J (2006) Hierarchical structures of neural networks for phoneme recognition. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing ICASSP-2006, Toulouse, France Google Scholar
  52. 52.
    Solera-Urena R, Martn-Iglesias D, Gallardo-Antoln A, Pelaez-Moreno C, Daz-de Mara F (2007) Robust ASR using support vector machines. Speech Commun 49(4):253–267 CrossRefGoogle Scholar
  53. 53.
    Stephenson TA, Escofet J, Magimai.-Doss M, Bourlard H (2002) Dynamic Bayesian network based speech recognition with pitch and energy as auxiliary variables. Idiap-RR Idiap-RR-24-2002 (0 2002). In: IEEE international workshop on neural networks for signal processing NNSP-2002 Google Scholar
  54. 54.
    Tort A (2003) Elliptical Fourier functions as a morphological descriptor of the genus Stenosarina (Brachiopoda, Terebratulida, New Caledonia). Math Geol 35(7):873–885 CrossRefGoogle Scholar
  55. 55.
    Trentin E, Gori M (2001) A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37(1–4):91–126 zbMATHCrossRefGoogle Scholar
  56. 56.
    Vanaken C, Hermans C, Mertens T, Fiore FD, Bekaert P, Reeth FV (2008) Strike a pose: image-based pose synthesis. In: VMV, pp 131–138 Google Scholar
  57. 57.
    Whittaker E (2000) Statistical language modelling for automatic speech recognition of Russian and English. PhD thesis, Cambridge University, Cambridge, UK Google Scholar
  58. 58.
    Yang M, Kpalma K, Ronsin J (2008) A survey of shape feature extraction techniques. Pattern Recognit.
  59. 59.
    Yörük E, Konukolu E, Sankur B, Darbon J (2006) Shape-based hand recognition. IEEE Trans Image Process 15(7):1803–1815 CrossRefGoogle Scholar
  60. 60.
    Young S, Evermann G, Gales M, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK book version 3.4 Google Scholar
  61. 61.
    Young S (2008) HMMs and related speech recognition technologies. In: Springer handbook of speech processing. Springer, Berlin, pp 539–557 CrossRefGoogle Scholar
  62. 62.
    Zen H, Braunschweiler N, Buchholz S, Knill KĆSK, Latorre J (2010) HMM-based polyglot speech synthesis by speaker and language adaptive training. In: Proceedings of the 7th ISCA workshop on speech synthesis, pp 186–191 Google Scholar

Copyright information

© OpenInterface Association 2011

Authors and Affiliations

  • Marek Hrúz
    • 1
    Email author
  • Pavel Campr
    • 1
  • Erinç Dikici
    • 2
  • Ahmet Alp Kındıroğlu
    • 2
  • Zdeněk Krňoul
    • 1
  • Alexander Ronzhin
    • 3
  • Haşim Sak
    • 2
  • Daniel Schorno
    • 5
  • Hülya Yalçın
    • 2
  • Lale Akarun
    • 2
  • Oya Aran
    • 4
  • Alexey Karpov
    • 3
  • Murat Saraçlar
    • 2
  • Milos Železný
    • 1
  1. 1.Faculty of Applied SciencesUniversity of West BohemiaPilsenCzech Republic
  2. 2.Bogazici UniversityIstanbulTurkey
  3. 3.SPIIRAS InstituteSt. PetersburgRussia
  4. 4.Idiap Research InstituteMartignySwitzerland
  5. 5.STEIMAmsterdamNetherlands

Personalised recommendations