Skip to main content
Log in

A morphologically annotated longitudinal corpus of spoken Czech child–adult interactions

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The paper presents a longitudinal corpus of transcribed spontaneous child–adult interactions in Czech. It consists of 99,388 tokens in 42,103 utterances produced by seven children between ca 1.5 and 3.5 years of age, and 238,211 tokens in 61,252 utterances produced by their close caregivers in everyday situations at home. The corpus covers language production of the children from the mean length of 1.01 word per utterance up to 5.33 words per utterance. The length of the recorded period ranges for individual children from 11 to 27 months. The transcripts of both child and adult utterances were lemmatized and tagged using MorphoDiTa, a tool for automatic morphological analysis of Czech. The annotation was transformed into the MOR format used within CHILDES, a database dedicated to corpora of first language acquisition. Detailed manual checking was performed on the annotation of all children’s utterances. Data from three children were used for a comparison of part-of-speech classification before and after manual checking, data from one child was additionally analyzed for differences in morphological tagging proper. The number of differences was rather low, with (expected) limitations in the areas of part-of-speech classification for uninflected words, annotation of homonymous forms, and annotation of child-specific words. The corpus represents an important contribution to the research of child language with special significance for Slavic languages and other morphologically rich inflecting languages, which are still underrepresented in the study of first language acquisition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data availability

The Chroma corpus is openly available in the CHILDES database (https://childes.talkbank.org/) and in the LINDAT database (https://lindat.cz/) in several versions.

Code availability

The MorphoDiTa software used for the morphological tagging is a free software available at https://ufal.mff.cuni.cz/morphodita. The custom Python script written by JS is available at https://github.com/slamajak/The-Chroma-corpus.git.

Notes

  1. Information on available MOR grammars is given on the web page of the project TalkBank, which provides various data repositories (including child language banks) as well as CHAT, CLAN, and MOR manuals. Specifically, see https://talkbank.org/morgrams/.

  2. Available at http://lindat.mff.cuni.cz/services/morphodita/.

  3. See https://github.com/ufal/morphodita.

  4. See https://www.korpus.cz/.

  5. The dictionary, published by the Czech Language Institute of the Czech Academy of Sciences, is available at https://slovnikcestiny.cz/, and its dictionary-making principles were published in the monograph edited by Kochová and Opavská (2016). Entries starting with a- to f- have already been published, and entries starting with h- to j- are at different stages of revisions and editing. One of the co-authors of this paper is a co-author of the dictionary, which granted us access to unpublished entries as well and helped us ensure that the POS-classification in the corpus is in accordance with the most widely accepted POS treatment of moot words in the Czech linguistic tradition.

References

  • Akademický slovník současné češtiny [The Academic Dictionary of Contemporary Czech]. (2017–2023). Czech Language Institute of the Czech Academy of Sciences. https://slovnikcestiny.cz/

  • Brown, R. (1973). A first language: The early stages. Harvard University Press.

    Book  Google Scholar 

  • Chromá, A. & Matiasovitsová, K. (2022). CoCzeFLA Chroma 2022.07. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-4772

  • Chromá, A., Sláma, J., Matiasovitsová, K., & Treichelová, J. (2023a). CoCzeFLA Chroma 2023.07, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. http://hdl.handle.net/11234/1-5183

  • Chromá, A., Sláma, J., Matiasovitsová, K., & Treichelová, J. (2023b). Chromá Czech Corpus. CHILDES [www.childes.talkbank.org]. https://doi.org/10.21415/3ZNE-HX03.

  • Hajič, J. (2004). Disambiguation of Rich Inflection: Computational Morphology of Czech. Univerzita Karlova v Praze / Nakladatelství Karolinum.

  • Hnátková, M., Křen, M., Procházka, P., & Skoumalová, H. (2014). The SYN-series corpora of written Czech. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 9th international conference on language resources and evaluation (LREC 2014) (pp. 160–164). European Language Resources Association.

  • Kochová, P., Opavská, Z. (Eds.). (2016). Kapitoly z koncepce Akademického slovníku současné češtiny. Ústav pro jazyk český AV ČR, v. v. i.

  • Komárek, M., Kořenský, J., Petr, J., & Veselková, J. (Eds.) (1986). Mluvnice češtiny 2: Tvarosloví. Academia.

  • MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk. 3rd Edition. Lawrence Erlbaum Associates.

  • Matiasovitsová, K., Čechová, P., Sláma, J., Homolková, K., & Smolík, F. (in press). Mean length of utterance in Czech toddlers: Validity estimates and comparison of words, morphemes and syllables. Journal of Speech, Language, and Hearing Research.

  • Matiasovitsová, K., Čechová, P., Sláma, J., Treichelová, J., & Smolík, F. (2023). The validity of a transcript-based measure of child language development in Czech. In P. Gappmayr, & J. Kellogg (Eds.), BUCLD 47: Proceedings of the 47th annual Boston University Conference on Language Development (pp. 533–547). Cascadilla Press.

    Google Scholar 

  • Mooney, A., Bean, A., & Sonntag, A. M. (2021). Language sample collection and analysis in people who use augmentative and alternative communication: Overcoming obstacles. American Journal of Speech-Language Pathology, 30(1), 47–62.

    Article  Google Scholar 

  • Potratz, J. R., Gildersleeve-Neumann, C., & Redford, M. A. (2022). Measurement properties of Mean Length of Utterance in school-age children. Language, Speech, and Hearing Services in Schools, 53(4), 1088–1100. https://doi.org/10.1044/2022_LSHSS-21-00115

    Article  Google Scholar 

  • Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehensive grammar of the English language. Longman.

    Google Scholar 

  • Santos, A. L., Généreux, M., Cardoso, A., Agostinho, C., & Abalada, S. (2014). A corpus of European Portuguese child and child-directed speech. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 9th international conference on language resources and evaluation (LREC 2014) (pp. 1488–1491). European Language Resources Association. https://repositorio.ul.pt/handle/10451/30661

  • Scarborough, H. S. (1990). Index of productive syntax. Applied Psycholinguistics, 11(1), 1–22. https://doi.org/10.1017/S0142716400008262

    Article  Google Scholar 

  • Spoustová, D., Hajič, J., Raab, J., & Spousta, M. (2009). Semi-supervised training for the averaged perceptron POS tagger. In A. Lascarides, C. Gardent, & J. Nivre (Eds.), Proceedings of the 12th conference of the European chapter of the ACL (pp. 763–771). Association for Computational Linguistics.

  • Straková, J., Straka, M., & Hajič, J. (2014). Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In K. Bontcheva, & J. Zhu (Eds.), Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations (pp. 13–18). Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-5003

  • Štícha, F. (2018). Velká akademická gramatika spisovné češtiny: 1. Morfologie: Druhy slov / Tvoření slov. Část 1. Academia.

  • Vondráček, M. (1998). Citoslovce a částice – hranice slovního druhu. Naše Řeč, 81(1), 29–37.

    Google Scholar 

Download references

Acknowledgements

The creation of the Chroma corpus has been supported by the Ministry of Education, Youth and Sports of the Czech Republic, projects No. LM2018101 and LM2023062 (LINDAT/CLARIAH-CZ). The morphological annotation of the corpus has been supported by the funding program Grant Schemes at Charles University (CZ.02.2.69/0.0/0.0/19_073/0016935). We are grateful to the following students who participated in the transcription of recordings, the revision of transcripts, and the manual control of the automatic morphological annotation (in alphabetical order): Markéta Baslová, Kateřina Bělehrádková, Tereza Binderová, Barbora Blahnová, Iurii Bochkov, Jan Henyš, Alžběta Macháčková, Anna Marklová, Martin Pavlíček, Jan Pinc, Tereza Šátavová, Denisa Šebestová, Jana Segi Lukavská, Kateřina Šimková, Leona Straková, Tomáš Treichel, Štěpánka Tvrdíková, and Martina Vokáčová. We also thank our collaborator Petra Čechová who participated in the process of morphological annotation, and our mentor, Filip Smolík, whose contribution to the entire project has been invaluable.

Funding

This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic (Grants No. LM2018101 and LM2023062 LINDAT/CLARIAH-CZ) and by the funding program Grant Schemes at Charles University (CZ.02.2.69/0.0/0.0/19_073/0016935).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization and supervision [AC, KM]; project administration [AC, KM, JT]; data curation [AC, KM, JT]; methodology [KM, JS]; software [JS]; formal analysis [KM, JS, AC]; writing the original draft [AC, JS]; writing—review and editing [AC, JS, KM, JT].

Corresponding author

Correspondence to Anna Chromá.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethical approval

The approval of the Research Ethics Committee of Charles University was gained in 2018 under No. 2018UKFF01620.

Informed consent

For each participating child, both parents as well as other participating caregivers gave a written informed consent for the use and publication of their anonymized data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 14 kb)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chromá, A., Sláma, J., Matiasovitsová, K. et al. A morphologically annotated longitudinal corpus of spoken Czech child–adult interactions. Lang Resources & Evaluation (2024). https://doi.org/10.1007/s10579-023-09710-y

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10579-023-09710-y

Keywords

Navigation