Abstract
We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus. The analyzer adequately covers the entire corpus, producing detailed correct analyses for all tokens. Evaluation on a new corpus reveals high coverage as well. Finally, we describe a morphological disambiguation module that selects the correct analysis of each token in context. The result is a high-quality morphologically-annotated CHILDES corpus of Hebrew, along with a set of tools that can be applied to new corpora.
Similar content being viewed by others
Notes
The MOR program was initially developed by Roland Hausser and Mitzi Morris. It is described in detail in Hausser (1989).
Diacritics are produced through addition of overprinting Unicode characters and are not single Unicode characters.
There are two other verbal templatic patterns, the passive counterparts of two of the five major binyanim. These are fully predictable from their active counterparts.
Note that the features scat, root and ptn are the general verbal features that are propagated to the output (syntactic category, consonantal root and pattern, respectively). Unlike these, the features part, past, fut and imp are only required for the proper A-rule match within the designated subsections (participle/present tense, past tense, future tense and imperative forms, respectively).
The categories are adjective, adverb, communicator, copula, existential, negation, numeral, onomatopoeia, preposition, pronoun, punctuation, quantifier, question, unknown, verb and vocalization.
We did not measure inter-coder agreement, but we estimate that more than 90 % of the ambiguous tokens were identically annotated by both lexicographers. Consolidating the differences was a quick and easy task.
References
Adam, G. (2002). From variable to optimal grammar: Evidence from language acquisition and language change. PhD thesis, Tel Aviv University.
Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2011). A morphologically-analyzed CHILDES corpus of Hebrew. Presented at The International Association of the Study of Child Language (IASCL).
Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2012). A morphologically annotated Hebrew CHILDES corpus. In Proceedings of the EACL-2012 workshop on computational models of language acquisition and loss.
Bannard, C., Lieven, E., & Tomasello, M. (2009). Early grammatical development is piecemeal and lexically specific. Proceedings of the National Academy of Science, 106(41), 17284–17289.
Bat-El, O. (1994). Stem modification and cluster transfer in modern Hebrew. Natural Language and Linguistic Theory, 12, 571–593.
Berman, R. A. (1979). Lexical decomposition and lexical unity in the expression of derived verbal categories in modern Hebrew. Afroasiatic Linguistics, 6, 1–26.
Berman, R. A. (1981). Language development and language knowledge: Evidence from the acquisition of Hebrew morphophonology. Journal of Child Language, 8, 609–626.
Berman, R. A. (1985). The acquisition of Hebrew. In D. I. Slobin (Ed.), The crosslinguistic study of language acquisition (pp. 255–372). Hillsdale, NJ: Lawrence Erlbaum Associates.
Berman, R. A. (2009). Childrens acquisition of compound constructions. In R. Lieber & P. Stekauer (Eds.), The Oxford handbook of compounding. USA: Oxford University Press.
Berman, R. A., & Ravid, D. (1986). Lexicalization of noun compounds. Hebrew Linguistics, 24, 5–22 (In Hebrew).
Berman, R. A., & Weissenborn, J. (1991). Acquisition of word order: A crosslinguistic study. Final Report. German-Israel Foundation for Research and Development (GIF).
Borensztajn, G., Zuidema, W., & Bod, R. (2009). Children’s grammars grow more abstract with age—evidence from an automatic procedure for identifying the productive units of language. Topics in Cognitive Science, 1, 175–188.
Borer, H. (1988). On the morphological parallelism between compounds and constructs. In G. Booij & J. van Marle (Eds.), Yearbook of morphology 1 (pp. 45–65). Dordrecht Holland: Foris publications.
Borer, H. (1996). The construct in review. In L. Jacqueline, L. Jean & S. Ur (Eds.), Studies in afroasiatic grammar (pp. 30–61). The Hague: Holland Academic Graphics.
Brown, R. (1973). A first language: The early stages. Cambridge, MA: Harvard University Press.
Clark, E. V., & Berman, R. A. (1987). Types of linguistic knowledge: Interpreting and producing compound nouns. Journal of Child Language, 14(03), 547–567. doi:10.1017/S030500090001028X.
Crystal, D., Fletcher, P. J., & Garman, M. (1976). The grammatical analysis of language disability: A procedure for assessment and remediation. London: Edward Arnold. ISBN 0713158425.
Freudenthal, D., Pine, J., & Gobet, F. (2010). Explaining quantitative variation in the rate of optional infinitive errors across languages: A comparison of mosaic and the variational learning model. Journal of Child Language, 37(3), 643–69. ISSN 1469-7602. URL http://www.biomedsearch.com/nih/Explaining-quantitative-variation-in-rate/20334719.html.
Hausser, R. R. (1989). Principles of computational morphology. Technical report, Center for Machine Translation, Carnegie Mellon University.
Itai, A., & Wintner, S. (2008). Language resources for Hebrew. Language Resources and Evaluation, 42(1), 75–98.
Leben, W. R. (1973). Suprasegmental phonology. PhD thesis, Massachusetts Institute of Technology.
Leben, W. R. (1978). The representation of tone. In: V. Fromkin (Ed.), Tone: A linguistic survey (pp. 177–220). New York: Academic.
Lee, L. L. (1974). Developmental sentence analysis. Evanston, IL: Northwestern University Press.
MacWhinney, B. (1996). The CHILDES system. American Journal of Speech Language Pathology, 5, 5–14.
MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk third edition. Mahwah, NJ: Lawrence Erlbaum Associates.
MacWhinney, B. (2008). Enriching CHILDES for morphosyntactic analysis. In H. Behrens (Ed.), Corpora in language acquisition research: History, methods, perspectives volume 6 of trends in language acquisition research. Amsterdam: Benjamins.
McCarthy, J. J. (1986). OCP effects: Gemination and antigemination. Linguistic Inquiry, 17, 207–263.
Miller, J., & Chapman, R. (1983). SALT: Systematic analysis of language transcripts, user’s manual. Madison, WI: University of Wisconsin Press.
Miyata, S., Hirakawa, M., Itoh, K., MacWhinney, B., Oshima-Takane, Y., Otomo, K., et al. (2009). Constructing a new language measure for Japanese: Developmental sentence scoring for Japanese. In S. Miyata (Ed.), Development of a developmental index of Japanese and its application to speech developmental disorders. Report of the Grant-in-Aid for Scientific Research (B) (2006–2008) No. 18330141, pp. 15–66. Nagoya, Japan: Aichi Shukutoku University.
Miyata, S., & MacWhinney, B. (2011). The development of parallel language measures: The example of Japanese DSSJ. Presented at The International Association of the Study of Child Language (IASCL).
Nir, B., & Berman, R. A. (2010). Parts of speech as constructions: The case of Hebrew ‘adverbs’. Constructions and Frames, 2(2), 242–274.
Nir, B., MacWhinney, B., & Wintner, S. (2010). A morphologically-analyzed CHILDES corpus of Hebrew. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10) (pp. 1487–1490). European Language Resources Association (ELRA). ISBN 2-9517408-6-7.
Ordan, N., & Wintner, S. (2005). Representing natural gender in multilingual lexical databases. International Journal of Lexicography, 18(3), 357–370.
Ornan, U. (1986). Phonemic script: A central vehicle for processing natural language—the case of Hebrew. Technical Report 88.181, IBM Research Center, Haifa, Israel.
Ornan, U. (1994). Basic concepts in “Romanization” of scripts. Technical Report LCL 94-5, Laboratory for Computational Linguistics, Technion, Haifa, Israel.
Ornan, U., & Katz, M. (1995). A new program for Hebrew index based on the Phonemic Script. Technical Report LCL 94-7, Laboratory for Computational Linguistics, Technion, Haifa, Israel.
Ravid, D. (2012). Spelling morphology: The psycholinguistics of Hebrew spelling. Berlin: Springer.
Ravid, D., Dressler, W. U., Nir-Sagiv, B., Korecky- Kröll, K., Souman, A., Rehfeldt, K., et al. (2008). Core morphology in child directed speech: Crosslinguistic corpus analyses of noun plurals. In H. Behrens (Ed.), Corpora in language acquisition research: Finding structure in data (pp. 25–60). Amsterdam: John Benjamins.
Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on intelligent text processing and computational linguistics (CICLING 2002), Mexico City, Mexico, pp. 1–15.
Sagae, K., Davis, E., Lavie, A., MacWhinney, B., & Wintner, S. (2007). High-accuracy annotation and parsing of CHILDES transcripts. In Proceedings of the ACL-2007 workshop on cognitive aspects of computational language acquisition (pp. 25–32), Prague, Czech Republic, June 2007. Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W07/W07-0604.
Sagae, K., Davis, E., Lavie, A., MacWhinney, B., & Wintner, S. (2010). Morphosyntactic annotation of CHILDES transcripts. Journal of Child Language, 37(3), 705–729. doi:10.1017/S0305000909990407.
Sagae, K., MacWhinney, B., & Lavie, A. (2004). Automatic parsing of parent-child interactions. Behavior Research Methods, Instruments, and Computers, 36, 113–126.
Scarborough, H. S. (1990). Index of productive syntax. Applied Psycholinguistics, 11, 1–22.
Shimron, J. (Ed.). (2003). Language processing and acquisition in languages of semitic, root-based, morphology. Number 28 in language acquisition and language disorders. John Benjamins.
Slobin, D. I. (1985). The crosslinguistic study of language acquisition: The data. The crosslinguistic study of language acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates. ISBN 9780898593679.
Ussishkin, A. (1999). The inadequacy of the consonantal root: Modern Hebrew denominal verbs and output–output correspondence. Phonology, 16(03), 401–442.
Waterfall, H. R., Sandbank, B., Onnis, L., & Edelman, S. (2010). An empirical generative framework for computational modeling of language acquisition. Journal of Child Language, 37(3), 671–703.
Wintner, S. (2004). Hebrew computational linguistics: Past and future. Artificial Intelligence Review, 21(2), 113–138. ISSN doi:10.1023/B:AIRE.0000020865.73561.bc.
Yona, S., & Wintner, S. (2008). A finite-state morphological grammar of Hebrew. Natural Language Engineering, 14(2), 173–190.
Acknowledgments
This research was supported by Grant No. 2007241 from the United States-Israel Binational Science Foundation (BSF). We are grateful to Hadass Zaidenberg, Maayan Bloch and Ezer Rasin for their meticulous lexicographic work, to Arnon Lazerson for developing the conversion script, and to Shai Gretz for helping with the manual annotation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Albert, A., MacWhinney, B., Nir, B. et al. The Hebrew CHILDES corpus: transcription and morphological analysis. Lang Resources & Evaluation 47, 973–1005 (2013). https://doi.org/10.1007/s10579-012-9214-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-012-9214-z