Language Resources and Evaluation

, Volume 47, Issue 4, pp 973–1005

The Hebrew CHILDES corpus: transcription and morphological analysis

  • Aviad Albert
  • Brian MacWhinney
  • Bracha Nir
  • Shuly Wintner
Original Paper

DOI: 10.1007/s10579-012-9214-z

Cite this article as:
Albert, A., MacWhinney, B., Nir, B. et al. Lang Resources & Evaluation (2013) 47: 973. doi:10.1007/s10579-012-9214-z

Abstract

We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus. The analyzer adequately covers the entire corpus, producing detailed correct analyses for all tokens. Evaluation on a new corpus reveals high coverage as well. Finally, we describe a morphological disambiguation module that selects the correct analysis of each token in context. The result is a high-quality morphologically-annotated CHILDES corpus of Hebrew, along with a set of tools that can be applied to new corpora.

Keywords

CHILDES Hebrew Transcription of spoken language Morphological analysis Morphological disambiguation 

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Aviad Albert
    • 1
  • Brian MacWhinney
    • 2
  • Bracha Nir
    • 3
  • Shuly Wintner
    • 4
  1. 1.Department of LinguisticsTel Aviv UniversityRamat AvivIsrael
  2. 2.Department of PsychologyCarnegie Mellon UniversityPittsburghUSA
  3. 3.Department of Communication Sciences and DisordersUniversity of HaifaHaifaIsrael
  4. 4.Department of Computer ScienceUniversity of HaifaHaifaIsrael