Building the first comprehensive machine-readable Turkish sign language resource: methods, challenges and solutions

Abstract

This article describes the procedures employed during the development of the first comprehensive machine-readable Turkish Sign Language (TiD) resource: a bilingual lexical database and a parallel corpus between Turkish and TiD. In addition to sign language specific annotations (such as non-manual markers, classifiers and buoys) following the recently introduced TiD knowledge representation (Eryiğit et al. 2016), the parallel corpus contains also annotations of dependency relations, which makes it the first parallel treebank between a sign language and an auditory-vocal language.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    The Swedish Sign Language Corpus Project (2017) and Östling et al. (2017) present the first dependency treebank for a sign language (Swedish Sign Language).

  2. 2.

    ELAN (EUDICO Linguistic Annotator) is a professional tool for the creation of complex annotations on video and audio resources and is widely used for sign language annotation. There exist also other sign language annotation platforms such as iLEX (Hanke and Storz 2008) and SignStream (Neidle et al. 2001).

  3. 3.

    Camgöz et al. (2016) introduces a sign language recognition corpus consisting of TiD signs and phrases from health and finance domains and Selçuk-Şimşek and Çiçekli (2017) a parallel dataset solely depending on word order correspondences between TiD and Turkish.

  4. 4.

    This MT system is from written Turkish to avatar animated TiD.

  5. 5.

    The convention in sign language and Deaf studies is that the adjective Deaf (with capital D) is used when it refers to the community, culture and signers who identify themselves as part of the Deaf community culturally. The adjective deaf (with small d) is used for the medical condition.

  6. 6.

    CODA stands for Children of Deaf Adults. This acronym is used in sign language and Deaf studies to identify this special population. CODAs are special in that they usually are brought up as bilinguals: they can speak both the sign language of their parents and the local spoken language.

  7. 7.

    See Eryiğit et al. (2016) for a detailed description of the annotation scheme of TiD signs.

  8. 8.

    Plural formation in sign languages does not always involve simple concatenative inflection, and the form of the plurals of signs may depend on a number of factors (Kubuş 2008; Steinbach 2012).

  9. 9.

    ELAN TiD Tier hierarchy is built on “included in”, “time subdivision” and “symbolic subdivision” stereotypes as exemplified in Fig. 2. The reader may refer to ELAN guidelines http://www.mpi.nl/corpus/manuals/manual-elan_ug.pdf for further details on tier stereotypes.

  10. 10.

    Some Turkish sentences were difficult to translate into TiD. For instance, the sentence “How did you express this feeling of yours?” was not possible to translate directly to TiD since the Deaf consultants reported that there are no signs for the notions “feeling” and “express”. It was translated as: “How was it? Tell me.” In such cases, the TiD tier contains this later translation as well appended to its glossing within square brackets (e.g. Fig. 7).

  11. 11.

    In contrast to many other sign language annotation conventions (Crasborn et al. 2015; Johnston 2016), in our annotation scheme, manual signs are not annotated based on whether they are articulated with the left/right hand or with the dominant/non-dominant hand. Therefore, two-handed signs, for instance, are annotated only on the MainFlow. We adopted this approach in order not to lose the atomicity of a sign for machine-readability purpose.

  12. 12.

    It should be noted that it was not uncommon in the data for buoys and classifers to be signed with the dominant hand. In those cases, these signs were annotated in the MainFlow.

  13. 13.

    Classifiers are iconic signs; however, iconic representation of an entity with a classifier may change from context to context, and from language to language (Perniss et al. 2010; Zwitserlood 2012). In other words, different classifiers may represent the same entity in different contexts. For instance, PENCIL could be expressed both with an entity and a handling classifier. If the handshape is index finger selected, the index finger represents the pencil as an entity in the context. On the other hand, if the handshape is baby-O, then it represents holding the pencil.

  14. 14.

    Note that this use of the term “incorporation” differs from its common use in theoretical linguistics where it is interpreted as a morpho-syntactic operation that combines at least two syntactic heads into a complex word.

  15. 15.

    Although a very recent study (Sulubacak et al. 2016a) focuses on mapping the Turkish dependency grammar to Universal Dependencies (Nivre et al. 2016), we preferred to follow Sulubacak et al. (2016b) because of our annotators’ experience on this framework, and we left mapping to the Universal Dependencies to future work.

References

  1. Ahrenberg, L. (2007). Lines: An English-Swedish parallel treebank. In Proceedings of the 16th Nordic conference of computational linguistics, pp. 270–274, Tartu.

  2. Atalay, N. B., Oflazer, K. & Say, B. (2003). The annotation process in the Turkish treebank. In Proceedings of the 4th international workshop on linguistically interpreteted corpora, pp. 33–38, Budapest.

  3. Boz, S., Özçelik, U., & Kaygusuz, Çağla. (2013). Matematik 1 (4th ed.). Milli Eğitim Bakanlığı Yayınları, Ankara: T.C.

  4. Bungeroth, J. & Ney, H. (2004). Statistical sign language translation. In Proceedings of the 6th workshop on representation and processing of sign languages at the 4th international conference on language resources and evaluation, pp. 105–108, Lisbon.

  5. Bungeroth, J., Stein, D., Dreuw, P., Ney, H., Morrissey, S., Way, A. & Van Zijl, L. (2008). The ATIS Sign Language corpus. In Proceedings of the 6th international conference on language resources and evaluation, pp. 2943–2946, Marrakech.

  6. Bungeroth, J., Stein, D., Dreuw, P., Zahedi, M. & Ney, H. (2006). A German Sign Language corpus of the domain weather report. In Proceedings of the 5th international conference on language resources and evaluation, pp. 2000–2003, Genoa.

  7. Camgöz, N. C., Kindiroglu, A. A., Karabüklü, S., Kelepir, M., Özsoy, A. S. & Akarun, L. (2016). BosphorusSign: A Turkish Sign Language recognition corpus in health and finance domains. In Proceedings of the 10th international conference on language resources and evaluation, pp. 1383–1388, Portorož.

  8. Cmejrek, M., Curín, J., Hajic, J. & Havelka, J. (2005). Prague Czech-English dependency treebank: resource for structure-based MT. In Proceedings of the 11th annual conference of the European Association for Machine Translation, pp. 73–78, Budapest.

  9. Costello, B., Herrmann, A., Mantovan, L., Pfau, R. & Sverrisdottir, R. (2017). Section 3.10.1 Numerals. In Quer, J., Cecchetto, C., Donati, C., Geraci, C., Kelepir, M., Pfau, R. & Steinbach, M. (eds) SignGram Blueprint: A guide to sign language grammar writing, pp. 148–151, de Gruyter, Berlin, Boston.

  10. Crasborn, O. & Sloetjes, H. (2008). Enhanced ELAN functionality for sign language corpora. In Proceedings of the 3rd workshop on the representation and processing of sign languages: Construction and exploitation of sign language corpora at the 6th international conference on language resources and evaluation, pp. 39–43, Marrakech.

  11. Crasborn, O. A. & Zwitserlood, I. (2008). The corpus NGT: An online corpus for professionals and laymen. In Proceedings of the 3rd workshop on the representation and processing of sign languages: Construction and exploitation of sign language corpora at the 6th international conference on language resources and evaluation, pp. 44–49.

  12. Crasborn, O., Bank, R., Zwitserlood, I., van der Kooij, E., de Meijer, A., & Safar, A. (2015). Annotation conventions for the corpus NGT. Ms: Radboud University Nijmegen.

  13. Cuřín, J., Čmejrek, M., Havelka, J. & Kuboň, V. (2004). Building a parallel bilingual syntactically annotated corpus. In Proceedings of the international conference on natural language processing, pp. 168–176, Hyderabad.

  14. Dalkılıç, H., & Gölge, N. (2013). Hayat Bilgisi 1 (4th ed.). Milli Eğitim Bakanlığı Yayınları, Ankara: T.C.

  15. De Vos, C., van Zuilen, M., Crasborn, O. & Levinson, S. (2015). NGT interactive corpus. MPI for psycholinguistics, the language archive, https://hdl.handle.net/1839/00-0000-0000-0021-8357-B@view.

  16. Demiroğlu, R., & Gökahmetoğlu, E. (2013). Türkçe 1 (4th ed.). Milli Eğitim Bakanlığı Yayınları, Ankara: T.C.

  17. DeNeefe, S., Knight, K., Wang, W. & Marcu, D. (2007). What can syntax-based MT learn from phrase-based MT? In Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning, pp. 755–763, Prague.

  18. Dikyuva, H., Makaroğlu, B., & Arık, E. (2017). Turkish sign language grammar. Ankara: Ministry of Family and Social Policies Press.

    Google Scholar 

  19. Eryiğit, G. (2007a). ITU validation set for Metu-Sabancı Turkish treebank.

  20. Eryiğit, G. (2007b). ITU treebank annotation tool. In Proceedings of the linguistic annotation workshop at the 40th annual meeting on association for computational linguistics, pp. 117–120, Prague.

  21. Eryiğit, G. (2014). ITU Turkish NLP web service. In Proceedings of the demonstrations at the 14th conference of the European chapter of the association for computational linguistics, pp. 1–4, Gothenburg.

  22. Eryiğit, C. (2017). Text to sign language machine translation system for Turkish. Ph.D. thesis, Istanbul Technical University, Istanbul.

  23. Eryiğit, G., Adalı, K., Torunoğlu-Selamet, D., Sulubacak, U. & Pamay, T. (2015). Annotation and extraction of multiword expressions in Turkish treebanks. In Proceedings of the human language technology conference at the North American Chapter of the Association for Computational Linguistics, pp. 70–76, Denver, CO.

  24. Eryiğit, C., Köse, H., Kelepir, M., & Eryiğit, G. (2016). Building machine-readable knowledge representations for Turkish Sign Language generation. Knowledge-Based Systems, 108, 179–194.

    Article  Google Scholar 

  25. Galley, M., Graehl, J., Knight, K., Marcu, D., DeNeefe, S., Wang, W. & Thayer, I. (2006). Scalable inference and training of context-rich syntactic translation models. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, pp. 961–968, Sydney.

  26. Galley, M., Hopkins, M., Knight, K. & Marcu, D. (2004). What’s in a translation rule? In Proceedings of the human language technology conference at the North American Chapter of the Association for Computational Linguistics, pp. 273–280, Boston, MA.

  27. Hanke, T. & Storz, J. (2008). iLex–a database tool for integrating sign language corpus linguistics and sign language lexicography. In Proceedings of the 3rd workshop on the representation and processing of sign languages at the 6th international conference on language resources and evaluation, pp. 64–67, Marrakech.

  28. Johnston, T. (2008). Corpus linguistics and signed languages: No lemmata, no corpus. In The 3rd workshop on the representation and processing of sign languages: Construction and exploitation of sign language corpora at the 6th international conference on language resources and evaluation, Marrakech.

  29. Johnston, T. (2016). Auslan corpus annotation guidelines. Centre for Language Sciences, Department of Linguistics, Macquarie University (Sydney) and La Trobe University (Melbourne), http://media.auslan.org.au/attachments/Auslan_Corpus_Annotation_Guidelines_November2016.pdf.

  30. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics: Companion volume—proceedings of the demo and poster sessions, pp. 177–180, Prague.

  31. Koizumi, A., Sagawa, H. & Takeuchi, M. (2002). An annotated Japanese Sign Language corpus. In Proceedings of the 3rd international conference on language resources and evaluation, pp. 927–930, Las Palmas.

  32. Kubuş, O. (2008). An analysis of Turkish Sign Language (TiD) phonology and morphology. Master’s thesis, Middle East Technical University, Ankara.

  33. Leeson, L., Saeed, J., Leonard, C., Macduff, A. & Byrne-Dunne, D. (2006). Moving heads and moving hands: Developing a digital corpus of Irish Sign Language: The ‘Signs of Ireland’ corpus development project. In Proceedings of the information technology and telecommunications conference, Carlow.

  34. Liddell, S. K. (2003). Grammar, gesture, and meaning in American Sign Language. Cambridge: Cambridge University Press.

    Google Scholar 

  35. McCrae, J., Spohr, D. & Cimiano, P. (2011). Linking lexical resources and ontologies on the semantic web with Lemon. In The semantic web: Research and applications, pp. 245–259, Berlin, Springer.

  36. Megyesi, B., Dahlqvist, B., Pettersson, E. & Nivre, J. (2008). Swedish—Turkish parallel treebank. In Proceedings of the 6th international conference on language resources and evaluation, pp. 470–473, Marrakech.

  37. Miller, C. (2001). Section I: Some reflections on the need for a common sign notation. Sign Language & Linguistics, 4(1), 11–28.

    Article  Google Scholar 

  38. Neidle, C., Sclaroff, S., & Athitsos, V. (2001). Signstream: A tool for linguistic and computer vision research on visual-gestural language data. Behavior Research Methods, Instruments, & Computers, 33(3), 311–320. https://doi.org/10.3758/BF03195384.

    Article  Google Scholar 

  39. Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., et al. (2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th international conference on language resources and evaluation, pp. 1659–1666, Portorož.

  40. Oflazer, K., Say, B., Hakkani-Tür, D. Z., & Tür, G. (2003). Building a Turkish treebank. In A. Abeillé (Ed.), Treebanks: Building and using parsed corpora (pp. 261–277). London: Kluwer.

    Google Scholar 

  41. Östling, R., Börstell, C., Gaärdenfors, M. & Wirén, M. (2017). Universal dependencies for Swedish Sign Language. In Proceedings of the 21st Nordic conference on computational linguistics, pp. 303–308, Gothenburg.

  42. Othman, A., Tmar, Z. & Jemni, M. (2012). Toward developing a very big sign language parallel corpus. In Proceedings of the international conference on computers for handicapped persons, pp. 192–199, Paris.

  43. Özsoy, S., Arık, E., Göksel, A., Kelepir, M. & Nuhbalaoğlu, D. (2013). Documenting Turkish sign language: A report on a research project. In Current directions in TiD research, pp. 55–70, Cambridge Scholars.

  44. Pamay, T., Sulubacak, U., Torunoğlu-Selamet, D. & Eryiğit, G. (2015). The annotation process of the ITU web treebank. In Proceedings of the 9th linguistic annotation workshop at the North American Chapter of the Association for computational linguistics, pp. 95–101, Denver, CO.

  45. Perniss, P., Thompson, R. L., & Vigliocco, G. (2010). Iconicity as a general property of language: Evidence from spoken and signed languages. Frontiers in Psychology, 227(1), 1–15.

    Google Scholar 

  46. Pfau, R. & Quer, J. (2010). Nonmanuals: Their prosodic and grammatical roles. In Sign languages, pp. 381–402. Cambridge, Cambridge University Press.

  47. Prillwitz, S., Hanke, T., König, S., Konrad, R., Langer, G. & Schwarz, A. (2008). DGS corpus project–development of a corpus based electronic dictionary German Sign Language/German. In Proceedings of the 3rd workshop on the representation and processing of sign languages at the 6th international conference on language resources and evaluation, pp. 159–164, Marrakech.

  48. Şahin, M., Sulubacak, U. & Eryiğit, G. (2013) Redefinition of Turkish morphology using flag diacritics. In Proceedings of the 10th symposium on natural language processing, Phuket.

  49. Schembri, A., Fenlon, J., Rentelis, R., Reynolds, S., & Cormier, K. (2013). Building the British Sign Language corpus. Language Documentation & Conservation, 7, 136–154.

    Google Scholar 

  50. Selçuk-Şimşek, M., & Çiçekli, I. (2017). Bidirectional machine translation between Turkish and Turkish Sign Language: A data-driven approach. International Journal on Natural Language Computing, 6(3), 33–46.

    Article  Google Scholar 

  51. Steinbach, M. (2012). Plurality. In Sign language: An international handbook, pp. 112–136, De Gruyter Mouton.

  52. Sulubacak, U. & Eryiğit, G. (2013). Representation of morphosyntactic units and coordination structures in the Turkish dependency treebank. In Proceedings of the 4th workshop on statistical parsing of morphologically rich languages at the conference on empirical methods on natural language processing, p. 129, Seattle, WA.

  53. Sulubacak, U., Gokirmak, M., Tyers, F., Çöltekin, Ç., Nivre, J. & Eryiğit, G. (2016a). Universal dependencies for Turkish. In Proceedings of the 26th international conference on computational linguistics, pp. 3444–3454, Osaka.

  54. Sulubacak, U., Pamay, T. & Eryiğit, G. (2016b). IMST: A revisited Turkish dependency treebank. In Proceedings of the 1st international conference on turkic computational linguistics at the international conference on computational linguistics and intelligent text processing, pp. 1–6, Konya.

  55. Sulubacak, U., & Eryigit, G. (2018). Implementing universal dependency, morphology, and multiword expression annotation standards for Turkish language processing. Turkish Journal of Electrical Engineering & Computer Sciences, 26(3), 1662–1672.

    Google Scholar 

  56. Su, H.-Y., & Wu, C.-H. (2009). Improving structural statistical machine translation for sign language with small corpus using thematic role templates as translation memory. IEEE Transactions on Audio, Speech, and Language Processing, 17(7), 1305–1315.

    Article  Google Scholar 

  57. Swedish Sign Language Corpus Project, U. D. (2017). Universal dependencies for Swedish Sign Language. Stockholm University, https://www.ling.su.se/english/research/research-projects/sign-language/swedish-sign-language-corpus-project-1.59270.

  58. Tesnière, L. (1959). Eléments de syntaxe structurale. Klincksieck: Librairie C.

    Google Scholar 

  59. Tinsley, J., Hearne, M. & Way, A. (2009). Exploiting parallel treebanks to improve phrase-based statistical machine translation. In Proceedings of the international conference on intelligent text processing and computational linguistics, pp. 318–331, Mexico City.

  60. Uchimoto, K., Zhang, Y., Sudo, K., Murata, M., Sekine, S. & Isahara, H. (2004). Multilingual aligned parallel treebank corpus reflecting contextual information and its applications. In Proceedings of the workshop on multilingual linguistic resources at the 42th annual meeting on association of computational linguistics, pp. 63–70, Barcelona.

  61. Wallin, L. & Mesch, J. (2015). Swedish sign language corpus. In Proceedings of digging into signs workshop: Developing annotation standards for sign language corpora, London.

  62. Wittenburg, P., Brugman, H., Russel, A., Klassmann, A. & Sloetjes, H. (2006). ELAN: A professional framework for multimodality research. In Proceedings of the 5th international conference on language resources and evaluation, pp. 1556–1559, Genoa.

  63. Zwitserlood, I. (2012). Classifiers. In Sign languages: An international handbook, pp. 158–186, Mouton de Gruyter.

  64. Zwitserlood, I., Perniss, P., & Özyürek, A. (2012). An empirical investigation of expression of multiple entities in Turkish Sign Language (TİD): Considering the effects of modality. Lingua, 122(14), 1636–1667.

    Article  Google Scholar 

Download references

Acknowledgements

We are grateful for the support of our signers Jale Erdul, Elvan Tamyürek Özparlak, Neslihan Kurt, our Project advisors Prof. Dr. Sumru Özsoy and Hasan Dikyuva, and of our project members Pınar Uluer, Neziha Akalın, Kenan Kasarcı, Nevzat Kırgıç, Cüneyd Ancın. Finally, we want to thank our three reviewers for insightful comments and suggestions that helped us improve the final version of the article.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Gülşen Eryiğit.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research is supported under the project “A Signing Avatar System for Turkish to Turkish Sign Language Machine Translation” by The Scientific and Technological Research Council of Turkey (TUBITAK) with a 1003 Grant (No. 114E263) and under the project “Sign-Hub” by the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 693349.

The convention in sign linguistics is to use the acronyms of sign languages as they are used by the Deaf community, namely, with the capital letters of the sign language name in the local spoken language. Thus, TiD represents the first letters of the Turkish words Türk İşaret Dili ‘Turkish Sign Language’.

Appendix

Appendix

See Tables 3 and 4.

Table 3 Distribution of the parts-of-speech tags in the TiD corpus
Table 4 Distribution of the non-manual markers in the TiD corpus

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Eryiğit, G., Eryiğit, C., Karabüklü, S. et al. Building the first comprehensive machine-readable Turkish sign language resource: methods, challenges and solutions. Lang Resources & Evaluation 54, 97–121 (2020). https://doi.org/10.1007/s10579-019-09465-5

Download citation

Keywords

  • Turkish sign language
  • TiD
  • Parallel dependency treebank
  • Turkish
  • Machine-readable
  • Parallel corpus