Language Resources and Evaluation

, Volume 47, Issue 4, pp 973–1005 | Cite as

The Hebrew CHILDES corpus: transcription and morphological analysis

  • Aviad Albert
  • Brian MacWhinney
  • Bracha Nir
  • Shuly WintnerEmail author
Original Paper


We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus. The analyzer adequately covers the entire corpus, producing detailed correct analyses for all tokens. Evaluation on a new corpus reveals high coverage as well. Finally, we describe a morphological disambiguation module that selects the correct analysis of each token in context. The result is a high-quality morphologically-annotated CHILDES corpus of Hebrew, along with a set of tools that can be applied to new corpora.


CHILDES Hebrew Transcription of spoken language Morphological analysis Morphological disambiguation 



This research was supported by Grant No. 2007241 from the United States-Israel Binational Science Foundation (BSF). We are grateful to Hadass Zaidenberg, Maayan Bloch and Ezer Rasin for their meticulous lexicographic work, to Arnon Lazerson for developing the conversion script, and to Shai Gretz for helping with the manual annotation.


  1. Adam, G. (2002). From variable to optimal grammar: Evidence from language acquisition and language change. PhD thesis, Tel Aviv University.Google Scholar
  2. Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2011). A morphologically-analyzed CHILDES corpus of Hebrew. Presented at The International Association of the Study of Child Language (IASCL).Google Scholar
  3. Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2012). A morphologically annotated Hebrew CHILDES corpus. In Proceedings of the EACL-2012 workshop on computational models of language acquisition and loss.Google Scholar
  4. Bannard, C., Lieven, E., & Tomasello, M. (2009). Early grammatical development is piecemeal and lexically specific. Proceedings of the National Academy of Science, 106(41), 17284–17289.CrossRefGoogle Scholar
  5. Bat-El, O. (1994). Stem modification and cluster transfer in modern Hebrew. Natural Language and Linguistic Theory, 12, 571–593.CrossRefGoogle Scholar
  6. Berman, R. A. (1979). Lexical decomposition and lexical unity in the expression of derived verbal categories in modern Hebrew. Afroasiatic Linguistics, 6, 1–26.Google Scholar
  7. Berman, R. A. (1981). Language development and language knowledge: Evidence from the acquisition of Hebrew morphophonology. Journal of Child Language, 8, 609–626.CrossRefGoogle Scholar
  8. Berman, R. A. (1985). The acquisition of Hebrew. In D. I. Slobin (Ed.), The crosslinguistic study of language acquisition (pp. 255–372). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  9. Berman, R. A. (2009). Childrens acquisition of compound constructions. In R. Lieber & P. Stekauer (Eds.), The Oxford handbook of compounding. USA: Oxford University Press.Google Scholar
  10. Berman, R. A., & Ravid, D. (1986). Lexicalization of noun compounds. Hebrew Linguistics, 24, 5–22 (In Hebrew).Google Scholar
  11. Berman, R. A., & Weissenborn, J. (1991). Acquisition of word order: A crosslinguistic study. Final Report. German-Israel Foundation for Research and Development (GIF).Google Scholar
  12. Borensztajn, G., Zuidema, W., & Bod, R. (2009). Children’s grammars grow more abstract with age—evidence from an automatic procedure for identifying the productive units of language. Topics in Cognitive Science, 1, 175–188.CrossRefGoogle Scholar
  13. Borer, H. (1988). On the morphological parallelism between compounds and constructs. In G. Booij & J. van Marle (Eds.), Yearbook of morphology 1 (pp. 45–65). Dordrecht Holland: Foris publications.Google Scholar
  14. Borer, H. (1996). The construct in review. In L. Jacqueline, L. Jean & S. Ur (Eds.), Studies in afroasiatic grammar (pp. 30–61). The Hague: Holland Academic Graphics.Google Scholar
  15. Brown, R. (1973). A first language: The early stages. Cambridge, MA: Harvard University Press.Google Scholar
  16. Clark, E. V., & Berman, R. A. (1987). Types of linguistic knowledge: Interpreting and producing compound nouns. Journal of Child Language, 14(03), 547–567. doi: 10.1017/S030500090001028X.Google Scholar
  17. Crystal, D., Fletcher, P. J., & Garman, M. (1976). The grammatical analysis of language disability: A procedure for assessment and remediation. London: Edward Arnold. ISBN 0713158425.Google Scholar
  18. Freudenthal, D., Pine, J., & Gobet, F. (2010). Explaining quantitative variation in the rate of optional infinitive errors across languages: A comparison of mosaic and the variational learning model. Journal of Child Language, 37(3), 643–69. ISSN 1469-7602. URL
  19. Hausser, R. R. (1989). Principles of computational morphology. Technical report, Center for Machine Translation, Carnegie Mellon University.Google Scholar
  20. Itai, A., & Wintner, S. (2008). Language resources for Hebrew. Language Resources and Evaluation, 42(1), 75–98.CrossRefGoogle Scholar
  21. Leben, W. R. (1973). Suprasegmental phonology. PhD thesis, Massachusetts Institute of Technology.Google Scholar
  22. Leben, W. R. (1978). The representation of tone. In: V. Fromkin (Ed.), Tone: A linguistic survey (pp. 177–220). New York: Academic.Google Scholar
  23. Lee, L. L. (1974). Developmental sentence analysis. Evanston, IL: Northwestern University Press.Google Scholar
  24. MacWhinney, B. (1996). The CHILDES system. American Journal of Speech Language Pathology, 5, 5–14.Google Scholar
  25. MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk third edition. Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  26. MacWhinney, B. (2008). Enriching CHILDES for morphosyntactic analysis. In H. Behrens (Ed.), Corpora in language acquisition research: History, methods, perspectives volume 6 of trends in language acquisition research. Amsterdam: Benjamins.Google Scholar
  27. McCarthy, J. J. (1986). OCP effects: Gemination and antigemination. Linguistic Inquiry, 17, 207–263.Google Scholar
  28. Miller, J., & Chapman, R. (1983). SALT: Systematic analysis of language transcripts, user’s manual. Madison, WI: University of Wisconsin Press.Google Scholar
  29. Miyata, S., Hirakawa, M., Itoh, K., MacWhinney, B., Oshima-Takane, Y., Otomo, K., et al. (2009). Constructing a new language measure for Japanese: Developmental sentence scoring for Japanese. In S. Miyata (Ed.), Development of a developmental index of Japanese and its application to speech developmental disorders. Report of the Grant-in-Aid for Scientific Research (B) (2006–2008) No. 18330141, pp. 15–66. Nagoya, Japan: Aichi Shukutoku University.Google Scholar
  30. Miyata, S., & MacWhinney, B. (2011). The development of parallel language measures: The example of Japanese DSSJ. Presented at The International Association of the Study of Child Language (IASCL).Google Scholar
  31. Nir, B., & Berman, R. A. (2010). Parts of speech as constructions: The case of Hebrew ‘adverbs’. Constructions and Frames, 2(2), 242–274.CrossRefGoogle Scholar
  32. Nir, B., MacWhinney, B., & Wintner, S. (2010). A morphologically-analyzed CHILDES corpus of Hebrew. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10) (pp. 1487–1490). European Language Resources Association (ELRA). ISBN 2-9517408-6-7.Google Scholar
  33. Ordan, N., & Wintner, S. (2005). Representing natural gender in multilingual lexical databases. International Journal of Lexicography, 18(3), 357–370.CrossRefGoogle Scholar
  34. Ornan, U. (1986). Phonemic script: A central vehicle for processing natural language—the case of Hebrew. Technical Report 88.181, IBM Research Center, Haifa, Israel.Google Scholar
  35. Ornan, U. (1994). Basic concepts in “Romanization” of scripts. Technical Report LCL 94-5, Laboratory for Computational Linguistics, Technion, Haifa, Israel.Google Scholar
  36. Ornan, U., & Katz, M. (1995). A new program for Hebrew index based on the Phonemic Script. Technical Report LCL 94-7, Laboratory for Computational Linguistics, Technion, Haifa, Israel.Google Scholar
  37. Ravid, D. (2012). Spelling morphology: The psycholinguistics of Hebrew spelling. Berlin: Springer.CrossRefGoogle Scholar
  38. Ravid, D., Dressler, W. U., Nir-Sagiv, B., Korecky- Kröll, K., Souman, A., Rehfeldt, K., et al. (2008). Core morphology in child directed speech: Crosslinguistic corpus analyses of noun plurals. In H. Behrens (Ed.), Corpora in language acquisition research: Finding structure in data (pp. 25–60). Amsterdam: John Benjamins.Google Scholar
  39. Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on intelligent text processing and computational linguistics (CICLING 2002), Mexico City, Mexico, pp. 1–15.Google Scholar
  40. Sagae, K., Davis, E., Lavie, A., MacWhinney, B., & Wintner, S. (2007). High-accuracy annotation and parsing of CHILDES transcripts. In Proceedings of the ACL-2007 workshop on cognitive aspects of computational language acquisition (pp. 25–32), Prague, Czech Republic, June 2007. Association for Computational Linguistics.
  41. Sagae, K., Davis, E., Lavie, A., MacWhinney, B., & Wintner, S. (2010). Morphosyntactic annotation of CHILDES transcripts. Journal of Child Language, 37(3), 705–729. doi: 10.1017/S0305000909990407.Google Scholar
  42. Sagae, K., MacWhinney, B., & Lavie, A. (2004). Automatic parsing of parent-child interactions. Behavior Research Methods, Instruments, and Computers, 36, 113–126.CrossRefGoogle Scholar
  43. Scarborough, H. S. (1990). Index of productive syntax. Applied Psycholinguistics, 11, 1–22.CrossRefGoogle Scholar
  44. Shimron, J. (Ed.). (2003). Language processing and acquisition in languages of semitic, root-based, morphology. Number 28 in language acquisition and language disorders. John Benjamins.Google Scholar
  45. Slobin, D. I. (1985). The crosslinguistic study of language acquisition: The data. The crosslinguistic study of language acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates. ISBN 9780898593679.Google Scholar
  46. Ussishkin, A. (1999). The inadequacy of the consonantal root: Modern Hebrew denominal verbs and output–output correspondence. Phonology, 16(03), 401–442.CrossRefGoogle Scholar
  47. Waterfall, H. R., Sandbank, B., Onnis, L., & Edelman, S. (2010). An empirical generative framework for computational modeling of language acquisition. Journal of Child Language, 37(3), 671–703.CrossRefGoogle Scholar
  48. Wintner, S. (2004). Hebrew computational linguistics: Past and future. Artificial Intelligence Review, 21(2), 113–138. ISSN doi: 10.1023/B:AIRE.0000020865.73561.bc.
  49. Yona, S., & Wintner, S. (2008). A finite-state morphological grammar of Hebrew. Natural Language Engineering, 14(2), 173–190.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Aviad Albert
    • 1
  • Brian MacWhinney
    • 2
  • Bracha Nir
    • 3
  • Shuly Wintner
    • 4
    Email author
  1. 1.Department of LinguisticsTel Aviv UniversityRamat AvivIsrael
  2. 2.Department of PsychologyCarnegie Mellon UniversityPittsburghUSA
  3. 3.Department of Communication Sciences and DisordersUniversity of HaifaHaifaIsrael
  4. 4.Department of Computer ScienceUniversity of HaifaHaifaIsrael

Personalised recommendations