Skip to main content
Log in

The Hebrew CHILDES corpus: transcription and morphological analysis

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus. The analyzer adequately covers the entire corpus, producing detailed correct analyses for all tokens. Evaluation on a new corpus reveals high coverage as well. Finally, we describe a morphological disambiguation module that selects the correct analysis of each token in context. The result is a high-quality morphologically-annotated CHILDES corpus of Hebrew, along with a set of tools that can be applied to new corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. The MOR program was initially developed by Roland Hausser and Mitzi Morris. It is described in detail in Hausser (1989).

  2. Diacritics are produced through addition of overprinting Unicode characters and are not single Unicode characters.

  3. There are two other verbal templatic patterns, the passive counterparts of two of the five major binyanim. These are fully predictable from their active counterparts.

  4. Note that the features scat, root and ptn are the general verbal features that are propagated to the output (syntactic category, consonantal root and pattern, respectively). Unlike these, the features part, past, fut and imp are only required for the proper A-rule match within the designated subsections (participle/present tense, past tense, future tense and imperative forms, respectively).

  5. The categories are adjective, adverb, communicator, copula, existential, negation, numeral, onomatopoeia, preposition, pronoun, punctuation, quantifier, question, unknown, verb and vocalization.

  6. We did not measure inter-coder agreement, but we estimate that more than 90 % of the ambiguous tokens were identically annotated by both lexicographers. Consolidating the differences was a quick and easy task.

References

  • Adam, G. (2002). From variable to optimal grammar: Evidence from language acquisition and language change. PhD thesis, Tel Aviv University.

  • Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2011). A morphologically-analyzed CHILDES corpus of Hebrew. Presented at The International Association of the Study of Child Language (IASCL).

  • Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2012). A morphologically annotated Hebrew CHILDES corpus. In Proceedings of the EACL-2012 workshop on computational models of language acquisition and loss.

  • Bannard, C., Lieven, E., & Tomasello, M. (2009). Early grammatical development is piecemeal and lexically specific. Proceedings of the National Academy of Science, 106(41), 17284–17289.

    Article  Google Scholar 

  • Bat-El, O. (1994). Stem modification and cluster transfer in modern Hebrew. Natural Language and Linguistic Theory, 12, 571–593.

    Article  Google Scholar 

  • Berman, R. A. (1979). Lexical decomposition and lexical unity in the expression of derived verbal categories in modern Hebrew. Afroasiatic Linguistics, 6, 1–26.

    Google Scholar 

  • Berman, R. A. (1981). Language development and language knowledge: Evidence from the acquisition of Hebrew morphophonology. Journal of Child Language, 8, 609–626.

    Article  Google Scholar 

  • Berman, R. A. (1985). The acquisition of Hebrew. In D. I. Slobin (Ed.), The crosslinguistic study of language acquisition (pp. 255–372). Hillsdale, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  • Berman, R. A. (2009). Childrens acquisition of compound constructions. In R. Lieber & P. Stekauer (Eds.), The Oxford handbook of compounding. USA: Oxford University Press.

    Google Scholar 

  • Berman, R. A., & Ravid, D. (1986). Lexicalization of noun compounds. Hebrew Linguistics, 24, 5–22 (In Hebrew).

    Google Scholar 

  • Berman, R. A., & Weissenborn, J. (1991). Acquisition of word order: A crosslinguistic study. Final Report. German-Israel Foundation for Research and Development (GIF).

  • Borensztajn, G., Zuidema, W., & Bod, R. (2009). Children’s grammars grow more abstract with age—evidence from an automatic procedure for identifying the productive units of language. Topics in Cognitive Science, 1, 175–188.

    Article  Google Scholar 

  • Borer, H. (1988). On the morphological parallelism between compounds and constructs. In G. Booij & J. van Marle (Eds.), Yearbook of morphology 1 (pp. 45–65). Dordrecht Holland: Foris publications.

    Google Scholar 

  • Borer, H. (1996). The construct in review. In L. Jacqueline, L. Jean & S. Ur (Eds.), Studies in afroasiatic grammar (pp. 30–61). The Hague: Holland Academic Graphics.

    Google Scholar 

  • Brown, R. (1973). A first language: The early stages. Cambridge, MA: Harvard University Press.

    Google Scholar 

  • Clark, E. V., & Berman, R. A. (1987). Types of linguistic knowledge: Interpreting and producing compound nouns. Journal of Child Language, 14(03), 547–567. doi:10.1017/S030500090001028X.

    Google Scholar 

  • Crystal, D., Fletcher, P. J., & Garman, M. (1976). The grammatical analysis of language disability: A procedure for assessment and remediation. London: Edward Arnold. ISBN 0713158425.

  • Freudenthal, D., Pine, J., & Gobet, F. (2010). Explaining quantitative variation in the rate of optional infinitive errors across languages: A comparison of mosaic and the variational learning model. Journal of Child Language, 37(3), 643–69. ISSN 1469-7602. URL http://www.biomedsearch.com/nih/Explaining-quantitative-variation-in-rate/20334719.html.

  • Hausser, R. R. (1989). Principles of computational morphology. Technical report, Center for Machine Translation, Carnegie Mellon University.

  • Itai, A., & Wintner, S. (2008). Language resources for Hebrew. Language Resources and Evaluation, 42(1), 75–98.

    Article  Google Scholar 

  • Leben, W. R. (1973). Suprasegmental phonology. PhD thesis, Massachusetts Institute of Technology.

  • Leben, W. R. (1978). The representation of tone. In: V. Fromkin (Ed.), Tone: A linguistic survey (pp. 177–220). New York: Academic.

    Google Scholar 

  • Lee, L. L. (1974). Developmental sentence analysis. Evanston, IL: Northwestern University Press.

    Google Scholar 

  • MacWhinney, B. (1996). The CHILDES system. American Journal of Speech Language Pathology, 5, 5–14.

    Google Scholar 

  • MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk third edition. Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  • MacWhinney, B. (2008). Enriching CHILDES for morphosyntactic analysis. In H. Behrens (Ed.), Corpora in language acquisition research: History, methods, perspectives volume 6 of trends in language acquisition research. Amsterdam: Benjamins.

    Google Scholar 

  • McCarthy, J. J. (1986). OCP effects: Gemination and antigemination. Linguistic Inquiry, 17, 207–263.

    Google Scholar 

  • Miller, J., & Chapman, R. (1983). SALT: Systematic analysis of language transcripts, user’s manual. Madison, WI: University of Wisconsin Press.

    Google Scholar 

  • Miyata, S., Hirakawa, M., Itoh, K., MacWhinney, B., Oshima-Takane, Y., Otomo, K., et al. (2009). Constructing a new language measure for Japanese: Developmental sentence scoring for Japanese. In S. Miyata (Ed.), Development of a developmental index of Japanese and its application to speech developmental disorders. Report of the Grant-in-Aid for Scientific Research (B) (2006–2008) No. 18330141, pp. 15–66. Nagoya, Japan: Aichi Shukutoku University.

  • Miyata, S., & MacWhinney, B. (2011). The development of parallel language measures: The example of Japanese DSSJ. Presented at The International Association of the Study of Child Language (IASCL).

  • Nir, B., & Berman, R. A. (2010). Parts of speech as constructions: The case of Hebrew ‘adverbs’. Constructions and Frames, 2(2), 242–274.

    Article  Google Scholar 

  • Nir, B., MacWhinney, B., & Wintner, S. (2010). A morphologically-analyzed CHILDES corpus of Hebrew. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10) (pp. 1487–1490). European Language Resources Association (ELRA). ISBN 2-9517408-6-7.

  • Ordan, N., & Wintner, S. (2005). Representing natural gender in multilingual lexical databases. International Journal of Lexicography, 18(3), 357–370.

    Article  Google Scholar 

  • Ornan, U. (1986). Phonemic script: A central vehicle for processing natural language—the case of Hebrew. Technical Report 88.181, IBM Research Center, Haifa, Israel.

  • Ornan, U. (1994). Basic concepts in “Romanization” of scripts. Technical Report LCL 94-5, Laboratory for Computational Linguistics, Technion, Haifa, Israel.

  • Ornan, U., & Katz, M. (1995). A new program for Hebrew index based on the Phonemic Script. Technical Report LCL 94-7, Laboratory for Computational Linguistics, Technion, Haifa, Israel.

  • Ravid, D. (2012). Spelling morphology: The psycholinguistics of Hebrew spelling. Berlin: Springer.

    Book  Google Scholar 

  • Ravid, D., Dressler, W. U., Nir-Sagiv, B., Korecky- Kröll, K., Souman, A., Rehfeldt, K., et al. (2008). Core morphology in child directed speech: Crosslinguistic corpus analyses of noun plurals. In H. Behrens (Ed.), Corpora in language acquisition research: Finding structure in data (pp. 25–60). Amsterdam: John Benjamins.

  • Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on intelligent text processing and computational linguistics (CICLING 2002), Mexico City, Mexico, pp. 1–15.

  • Sagae, K., Davis, E., Lavie, A., MacWhinney, B., & Wintner, S. (2007). High-accuracy annotation and parsing of CHILDES transcripts. In Proceedings of the ACL-2007 workshop on cognitive aspects of computational language acquisition (pp. 25–32), Prague, Czech Republic, June 2007. Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W07/W07-0604.

  • Sagae, K., Davis, E., Lavie, A., MacWhinney, B., & Wintner, S. (2010). Morphosyntactic annotation of CHILDES transcripts. Journal of Child Language, 37(3), 705–729. doi:10.1017/S0305000909990407.

    Google Scholar 

  • Sagae, K., MacWhinney, B., & Lavie, A. (2004). Automatic parsing of parent-child interactions. Behavior Research Methods, Instruments, and Computers, 36, 113–126.

    Article  Google Scholar 

  • Scarborough, H. S. (1990). Index of productive syntax. Applied Psycholinguistics, 11, 1–22.

    Article  Google Scholar 

  • Shimron, J. (Ed.). (2003). Language processing and acquisition in languages of semitic, root-based, morphology. Number 28 in language acquisition and language disorders. John Benjamins.

  • Slobin, D. I. (1985). The crosslinguistic study of language acquisition: The data. The crosslinguistic study of language acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates. ISBN 9780898593679.

  • Ussishkin, A. (1999). The inadequacy of the consonantal root: Modern Hebrew denominal verbs and output–output correspondence. Phonology, 16(03), 401–442.

    Article  Google Scholar 

  • Waterfall, H. R., Sandbank, B., Onnis, L., & Edelman, S. (2010). An empirical generative framework for computational modeling of language acquisition. Journal of Child Language, 37(3), 671–703.

    Article  Google Scholar 

  • Wintner, S. (2004). Hebrew computational linguistics: Past and future. Artificial Intelligence Review, 21(2), 113–138. ISSN doi:10.1023/B:AIRE.0000020865.73561.bc.

  • Yona, S., & Wintner, S. (2008). A finite-state morphological grammar of Hebrew. Natural Language Engineering, 14(2), 173–190.

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported by Grant No. 2007241 from the United States-Israel Binational Science Foundation (BSF). We are grateful to Hadass Zaidenberg, Maayan Bloch and Ezer Rasin for their meticulous lexicographic work, to Arnon Lazerson for developing the conversion script, and to Shai Gretz for helping with the manual annotation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuly Wintner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Albert, A., MacWhinney, B., Nir, B. et al. The Hebrew CHILDES corpus: transcription and morphological analysis. Lang Resources & Evaluation 47, 973–1005 (2013). https://doi.org/10.1007/s10579-012-9214-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-012-9214-z

Keywords

Navigation