The Hebrew CHILDES corpus: transcription and morphological analysis

Albert, Aviad; MacWhinney, Brian; Nir, Bracha; Wintner, Shuly

doi:10.1007/s10579-012-9214-z

The Hebrew CHILDES corpus: transcription and morphological analysis

Original Paper
Published: 14 February 2013

Volume 47, pages 973–1005, (2013)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Aviad Albert¹,
Brian MacWhinney²,
Bracha Nir³ &
…
Shuly Wintner⁴

586 Accesses
8 Citations
Explore all metrics

Abstract

We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus. The analyzer adequately covers the entire corpus, producing detailed correct analyses for all tokens. Evaluation on a new corpus reveals high coverage as well. Finally, we describe a morphological disambiguation module that selects the correct analysis of each token in context. The result is a high-quality morphologically-annotated CHILDES corpus of Hebrew, along with a set of tools that can be applied to new corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A morphologically annotated longitudinal corpus of spoken Czech child–adult interactions

Article 30 March 2024

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Article 05 July 2023

WordSeg: Standardizing unsupervised word form segmentation from text

Article 01 April 2019

Notes

The MOR program was initially developed by Roland Hausser and Mitzi Morris. It is described in detail in Hausser (1989).
Diacritics are produced through addition of overprinting Unicode characters and are not single Unicode characters.
There are two other verbal templatic patterns, the passive counterparts of two of the five major binyanim. These are fully predictable from their active counterparts.
Note that the features scat, root and ptn are the general verbal features that are propagated to the output (syntactic category, consonantal root and pattern, respectively). Unlike these, the features part, past, fut and imp are only required for the proper A-rule match within the designated subsections (participle/present tense, past tense, future tense and imperative forms, respectively).
The categories are adjective, adverb, communicator, copula, existential, negation, numeral, onomatopoeia, preposition, pronoun, punctuation, quantifier, question, unknown, verb and vocalization.
We did not measure inter-coder agreement, but we estimate that more than 90 % of the ambiguous tokens were identically annotated by both lexicographers. Consolidating the differences was a quick and easy task.

References

Adam, G. (2002). From variable to optimal grammar: Evidence from language acquisition and language change. PhD thesis, Tel Aviv University.
Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2011). A morphologically-analyzed CHILDES corpus of Hebrew. Presented at The International Association of the Study of Child Language (IASCL).
Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2012). A morphologically annotated Hebrew CHILDES corpus. In Proceedings of the EACL-2012 workshop on computational models of language acquisition and loss.
Bannard, C., Lieven, E., & Tomasello, M. (2009). Early grammatical development is piecemeal and lexically specific. Proceedings of the National Academy of Science, 106(41), 17284–17289.
Article Google Scholar
Bat-El, O. (1994). Stem modification and cluster transfer in modern Hebrew. Natural Language and Linguistic Theory, 12, 571–593.
Article Google Scholar
Berman, R. A. (1979). Lexical decomposition and lexical unity in the expression of derived verbal categories in modern Hebrew. Afroasiatic Linguistics, 6, 1–26.
Google Scholar
Berman, R. A. (1981). Language development and language knowledge: Evidence from the acquisition of Hebrew morphophonology. Journal of Child Language, 8, 609–626.
Article Google Scholar
Berman, R. A. (1985). The acquisition of Hebrew. In D. I. Slobin (Ed.), The crosslinguistic study of language acquisition (pp. 255–372). Hillsdale, NJ: Lawrence Erlbaum Associates.
Google Scholar
Berman, R. A. (2009). Childrens acquisition of compound constructions. In R. Lieber & P. Stekauer (Eds.), The Oxford handbook of compounding. USA: Oxford University Press.
Google Scholar
Berman, R. A., & Ravid, D. (1986). Lexicalization of noun compounds. Hebrew Linguistics, 24, 5–22 (In Hebrew).
Google Scholar
Berman, R. A., & Weissenborn, J. (1991). Acquisition of word order: A crosslinguistic study. Final Report. German-Israel Foundation for Research and Development (GIF).
Borensztajn, G., Zuidema, W., & Bod, R. (2009). Children’s grammars grow more abstract with age—evidence from an automatic procedure for identifying the productive units of language. Topics in Cognitive Science, 1, 175–188.
Article Google Scholar
Borer, H. (1988). On the morphological parallelism between compounds and constructs. In G. Booij & J. van Marle (Eds.), Yearbook of morphology 1 (pp. 45–65). Dordrecht Holland: Foris publications.
Google Scholar
Borer, H. (1996). The construct in review. In L. Jacqueline, L. Jean & S. Ur (Eds.), Studies in afroasiatic grammar (pp. 30–61). The Hague: Holland Academic Graphics.
Google Scholar
Brown, R. (1973). A first language: The early stages. Cambridge, MA: Harvard University Press.
Google Scholar
Clark, E. V., & Berman, R. A. (1987). Types of linguistic knowledge: Interpreting and producing compound nouns. Journal of Child Language, 14(03), 547–567. doi:10.1017/S030500090001028X.
Google Scholar
Crystal, D., Fletcher, P. J., & Garman, M. (1976). The grammatical analysis of language disability: A procedure for assessment and remediation. London: Edward Arnold. ISBN 0713158425.
Freudenthal, D., Pine, J., & Gobet, F. (2010). Explaining quantitative variation in the rate of optional infinitive errors across languages: A comparison of mosaic and the variational learning model. Journal of Child Language, 37(3), 643–69. ISSN 1469-7602. URL http://www.biomedsearch.com/nih/Explaining-quantitative-variation-in-rate/20334719.html.
Hausser, R. R. (1989). Principles of computational morphology. Technical report, Center for Machine Translation, Carnegie Mellon University.
Itai, A., & Wintner, S. (2008). Language resources for Hebrew. Language Resources and Evaluation, 42(1), 75–98.
Article Google Scholar
Leben, W. R. (1973). Suprasegmental phonology. PhD thesis, Massachusetts Institute of Technology.
Leben, W. R. (1978). The representation of tone. In: V. Fromkin (Ed.), Tone: A linguistic survey (pp. 177–220). New York: Academic.
Google Scholar
Lee, L. L. (1974). Developmental sentence analysis. Evanston, IL: Northwestern University Press.
Google Scholar
MacWhinney, B. (1996). The CHILDES system. American Journal of Speech Language Pathology, 5, 5–14.
Google Scholar
MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk third edition. Mahwah, NJ: Lawrence Erlbaum Associates.
Google Scholar
MacWhinney, B. (2008). Enriching CHILDES for morphosyntactic analysis. In H. Behrens (Ed.), Corpora in language acquisition research: History, methods, perspectives volume 6 of trends in language acquisition research. Amsterdam: Benjamins.
Google Scholar
McCarthy, J. J. (1986). OCP effects: Gemination and antigemination. Linguistic Inquiry, 17, 207–263.
Google Scholar
Miller, J., & Chapman, R. (1983). SALT: Systematic analysis of language transcripts, user’s manual. Madison, WI: University of Wisconsin Press.
Google Scholar
Miyata, S., Hirakawa, M., Itoh, K., MacWhinney, B., Oshima-Takane, Y., Otomo, K., et al. (2009). Constructing a new language measure for Japanese: Developmental sentence scoring for Japanese. In S. Miyata (Ed.), Development of a developmental index of Japanese and its application to speech developmental disorders. Report of the Grant-in-Aid for Scientific Research (B) (2006–2008) No. 18330141, pp. 15–66. Nagoya, Japan: Aichi Shukutoku University.
Miyata, S., & MacWhinney, B. (2011). The development of parallel language measures: The example of Japanese DSSJ. Presented at The International Association of the Study of Child Language (IASCL).
Nir, B., & Berman, R. A. (2010). Parts of speech as constructions: The case of Hebrew ‘adverbs’. Constructions and Frames, 2(2), 242–274.
Article Google Scholar
Nir, B., MacWhinney, B., & Wintner, S. (2010). A morphologically-analyzed CHILDES corpus of Hebrew. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10) (pp. 1487–1490). European Language Resources Association (ELRA). ISBN 2-9517408-6-7.
Ordan, N., & Wintner, S. (2005). Representing natural gender in multilingual lexical databases. International Journal of Lexicography, 18(3), 357–370.
Article Google Scholar
Ornan, U. (1986). Phonemic script: A central vehicle for processing natural language—the case of Hebrew. Technical Report 88.181, IBM Research Center, Haifa, Israel.
Ornan, U. (1994). Basic concepts in “Romanization” of scripts. Technical Report LCL 94-5, Laboratory for Computational Linguistics, Technion, Haifa, Israel.
Ornan, U., & Katz, M. (1995). A new program for Hebrew index based on the Phonemic Script. Technical Report LCL 94-7, Laboratory for Computational Linguistics, Technion, Haifa, Israel.
Ravid, D. (2012). Spelling morphology: The psycholinguistics of Hebrew spelling. Berlin: Springer.
Book Google Scholar
Ravid, D., Dressler, W. U., Nir-Sagiv, B., Korecky- Kröll, K., Souman, A., Rehfeldt, K., et al. (2008). Core morphology in child directed speech: Crosslinguistic corpus analyses of noun plurals. In H. Behrens (Ed.), Corpora in language acquisition research: Finding structure in data (pp. 25–60). Amsterdam: John Benjamins.
Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on intelligent text processing and computational linguistics (CICLING 2002), Mexico City, Mexico, pp. 1–15.
Sagae, K., Davis, E., Lavie, A., MacWhinney, B., & Wintner, S. (2007). High-accuracy annotation and parsing of CHILDES transcripts. In Proceedings of the ACL-2007 workshop on cognitive aspects of computational language acquisition (pp. 25–32), Prague, Czech Republic, June 2007. Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W07/W07-0604.
Sagae, K., Davis, E., Lavie, A., MacWhinney, B., & Wintner, S. (2010). Morphosyntactic annotation of CHILDES transcripts. Journal of Child Language, 37(3), 705–729. doi:10.1017/S0305000909990407.
Google Scholar
Sagae, K., MacWhinney, B., & Lavie, A. (2004). Automatic parsing of parent-child interactions. Behavior Research Methods, Instruments, and Computers, 36, 113–126.
Article Google Scholar
Scarborough, H. S. (1990). Index of productive syntax. Applied Psycholinguistics, 11, 1–22.
Article Google Scholar
Shimron, J. (Ed.). (2003). Language processing and acquisition in languages of semitic, root-based, morphology. Number 28 in language acquisition and language disorders. John Benjamins.
Slobin, D. I. (1985). The crosslinguistic study of language acquisition: The data. The crosslinguistic study of language acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates. ISBN 9780898593679.
Ussishkin, A. (1999). The inadequacy of the consonantal root: Modern Hebrew denominal verbs and output–output correspondence. Phonology, 16(03), 401–442.
Article Google Scholar
Waterfall, H. R., Sandbank, B., Onnis, L., & Edelman, S. (2010). An empirical generative framework for computational modeling of language acquisition. Journal of Child Language, 37(3), 671–703.
Article Google Scholar
Wintner, S. (2004). Hebrew computational linguistics: Past and future. Artificial Intelligence Review, 21(2), 113–138. ISSN doi:10.1023/B:AIRE.0000020865.73561.bc.
Yona, S., & Wintner, S. (2008). A finite-state morphological grammar of Hebrew. Natural Language Engineering, 14(2), 173–190.
Article Google Scholar

Download references

Acknowledgments

This research was supported by Grant No. 2007241 from the United States-Israel Binational Science Foundation (BSF). We are grateful to Hadass Zaidenberg, Maayan Bloch and Ezer Rasin for their meticulous lexicographic work, to Arnon Lazerson for developing the conversion script, and to Shai Gretz for helping with the manual annotation.

Author information

Authors and Affiliations

Department of Linguistics, Tel Aviv University, Ramat Aviv, Israel
Aviad Albert
Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA
Brian MacWhinney
Department of Communication Sciences and Disorders, University of Haifa, Haifa, Israel
Bracha Nir
Department of Computer Science, University of Haifa, Haifa, Israel
Shuly Wintner

Authors

Aviad Albert
View author publications
You can also search for this author in PubMed Google Scholar
Brian MacWhinney
View author publications
You can also search for this author in PubMed Google Scholar
Bracha Nir
View author publications
You can also search for this author in PubMed Google Scholar
Shuly Wintner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuly Wintner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Albert, A., MacWhinney, B., Nir, B. et al. The Hebrew CHILDES corpus: transcription and morphological analysis. Lang Resources & Evaluation 47, 973–1005 (2013). https://doi.org/10.1007/s10579-012-9214-z

Download citation

Published: 14 February 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s10579-012-9214-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Hebrew CHILDES corpus: transcription and morphological analysis

Abstract

Access this article

Similar content being viewed by others

A morphologically annotated longitudinal corpus of spoken Czech child–adult interactions

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

WordSeg: Standardizing unsupervised word form segmentation from text

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Hebrew CHILDES corpus: transcription and morphological analysis

Abstract

Access this article

Similar content being viewed by others

A morphologically annotated longitudinal corpus of spoken Czech child–adult interactions

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

WordSeg: Standardizing unsupervised word form segmentation from text

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation