Information structure in African languages: corpora and tools

Abstract

In this paper, we describe tools and resources for the study of African languages developed at the Collaborative Research Centre 632 “Information Structure”. These include deeply annotated data collections of 25 sub-Saharan languages that are described together with their annotation scheme, as well as the corpus tool ANNIS, which provides unified access to a broad variety of annotations created with a range of different tools. With the application of ANNIS to several African data collections, we illustrate its suitability for the purpose of language documentation, distributed access, and the creation of data archives.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. 1.

    Tense/ Aspect/ Modality, cf. the discussion of auxiliary focus in Hyman and Watters (1984).

  2. 2.

    We use the open source database management system PostgreSQL (http://www.postgresql.org).

  3. 3.

    In the Hausar Baka corpus, nominal chunks are currently not annotated, so \( {\mathsf{CHUNK=}}``{\mathsf{NC}}\text{''}\) substitutes for a variety of templates matching nominal chunks.

References

  1. Brants, T., & Plaehn, O. (2000). Interactive corpus annotation. In Proceedings of the second international conference on language resources and evaluation (LREC-2000) (pp. 453–459). Athens, Greece.

  2. Busemann, A., & Busemann, K. (2008). Toolbox self-training. tech. rep., Summer Institute of Linguistics (SIL). http://www.sil.org/ (Version 1.5.4 Oct 2008).

  3. Chafe, W. L. (1976). Givenness, contrastiveness, definiteness, subjects, topics and point of view. In C. N. Li (Ed.) Subject and topic (pp. 27–55). Academic Press, New York.

    Google Scholar 

  4. Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., & Stede, M. (2008). A flexible framework for integrating annotations from different tools and tag sets. Traitement Automatique des Langues, 49(2), 271–293.

    Google Scholar 

  5. Crysmann, B. (2009). Autosegmental representations in an HPSG of Hausa. In Proceedings of the ACL-IJCNLP workshop on grammar engineering across frameworks (GEAF 2009) (pp. 28–36). Singapore.

  6. Dipper, S. (2005). XML-based Stand-off representation and exploitation of multi-level linguistic annotation. In R. Eckstein & R. Tolksdorf (Eds.), Proceedings of Berliner XML tage (pp. 39–50).

  7. Dipper, S., & Götze, M. (2005). Accessing heterogeneous linguistic data—generic XML-based representation and flexible visualization. In Proceedings of the 2nd language and technology conference 2005 (pp. 23–30). Poznan, Poland.

  8. Dipper, S., Götze, M., & Skopeteas, S. (Eds.) (2007). Information structure in cross-linguistic corpora: Annotation guidelines for phonology, morphology, syntax, semantics, and information structure. Interdisciplinary Studies on Information Structure 7. Potsdam: Universitätsverlag Potsdam.

  9. Fiedler, I. (2009). Contrastive topic marking in Gbe. In Current issues in unity and diversity of languages. Collection of papers selected from the CIL 18 (pp. 295–308). Seoul: The Linguistic Society of Korea.

  10. Fiedler, I., Hartmann, K., Reineke, B., Schwarz, A., & Zimmermann, M. (2010). Subject Focus in West African Languages. In M. Zimmermann & C. Féry (Eds.), Information structure theoretical, typological, and experimental perspectives (pp. 234–257). Oxford: Oxford University Press.

    Google Scholar 

  11. Green, M., & Jaggar, P. (2003). Ex-situ and in-situ focus in Hausa: syntax, semantics and discourse. In J. Lecarme (Ed.), Research in Afroasiatic grammar 2 (current issues in linguistic theory) (pp. 187–213). Amsterdam: John Benjamins.

    Google Scholar 

  12. Hartmann, K., & Zimmermann, M. (2007a). Focus strategies in Chadic: The case of tangale revisited. Studia Linguistica, 61(2), 95–129.

    Article  Google Scholar 

  13. Hartmann, K., & Zimmermann, M. (2007b). In place—Out of place? Focus in Hausa. In K. Schwabe & S. Winkler (Eds.), On information structure, meaning and form: Generalizing across languages (pp. 365–403). Benjamins: Amsterdam.

    Google Scholar 

  14. Hartmann, K., & Zimmermann, M. (2009). Morphological focus marking in Gùrùntùm (West Chadic). Lingua, 119(9), 1340–1365.

    Article  Google Scholar 

  15. Hellwig, B., Van Uytvanck, D., & Hulsbosch, M. (2008). ELAN Linguistic annotator. Tech. rep., Max Planck Institute. http://www.lat-mpi.eu/tools/elan/ (June 13, 2011).

  16. Hyman, L., & Watters, J. (1984). Auxiliary focus. Studies in African Linguistics, 15, 233–273.

    Google Scholar 

  17. Krifka, M. (2008). Basic notions of information structure. Acta Linguistica Hungarica, 55, 243–76.

    Article  Google Scholar 

  18. Müller, C., & Strube, M. (2006). Multi-level annotation of linguistic data with MMAX2. In S. Braun, K. Kohn, & J. Mukherjee (Eds.), Corpus technology and language pedagogy: New resources, new tools, new methods (pp. 197–214). Frankfurt: Peter Lang.

    Google Scholar 

  19. Newman, P. (2000). The Hausa language. An encyclopedic reference grammar. Interdisciplinary studies on information structure 4. New Haven: Yale University Press.

    Google Scholar 

  20. O’Donnell, M. (2000). RSTTool 2.4—A markup tool for rhetorical structure theory. In Proceedings of the international natural language generation conference (INLG’2000) (pp. 253–256). Mitzpe Ramon, Israel.

  21. Orasan, C. (2003). PALinkA: a highly customisable tool for discourse annotation. In Proceedings of the 4th SIGdial workshop on discourse and dialogue (pp. 39–43). Sapporo, Japan.

  22. Randell, R., Bature, A., & Schuh, R. (1998). Hausar Baka. http://www.humnet.ucla.edu/humnet/aflang/hausarbaka/ (June 13, 2011).

  23. Schmidt, T. (2004). Transcribing and annotating spoken language with EXMARaLDA. In Proceedings of the LREC-workshop on XML based richly annotated corpora, Lisbon 2004 (pp. 69–74). Paris: ELRA.

  24. Schwarz, A. (2010). Verb-and-predication focus markers in Gur. In I. Fiedler & A. Schwarz (Eds.) The expression of information structure. A documentation of its diversity across Africa. (Typological Studies in Language 91) (pp. 287–314). Amsterdam Philadelphia: John Benjamins.

    Google Scholar 

  25. Schwarz, A., & Fiedler, I. (2007). Narrative focus strategies in Gur and Kwa. In E. Aboh, K. Hartmann, & M. Zimmermann (Eds.), Focus strategies in African languages. The interaction of focus and grammar in Niger-Congo and Afro-Asiatic(pp. 267–286). Berlin: Mouton de Gruyter.

    Google Scholar 

  26. Skopeteas, S., Fiedler, I., Hellmuth, S., Schwarz, A., Stoel, R., Fanselow, G., Féry, C., & Krifka, M. (2006). Questionnaire on information structure (QUIS). Interdisciplinary studies on information structure 4. Potsdam: Universitätsverlag Potsdam.

    Google Scholar 

  27. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd edn). San Francisco: Morgan Kaufman.

    Google Scholar 

  28. Zeldes, A., Ritz, J., Lüdeling, A., & Chiarcos, C. (2009). A search tool for multi-layer annotated corpora. In Proceedings of corpus linguistics 2009. Liverpool, UK.

  29. Zimmermann, M. (2008). Contrastive focus and emphasis. Acta Linguistica Hungarica, 55, 347–360.

    Article  Google Scholar 

  30. Zipser, F., & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In Proceedings of the workshop on language resource and language technology standards, LREC 2010 (pp. 7–18). Malta.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Julia Ritz.

Additional information

The Collaborative Research Centre 632 “Information Structure: the linguistic means for structuring utterances, sentences and texts” is funded by the German Research Foundation. The project associations are as follows: A5 (Focus from a cross-linguistic perspective, Mira Grubic, Malte Zimmermann), B1 (Gur and Kwa languages, Ines Fiedler, Katharina Hartmann, Anne Schwarz), B2 (Chadic languages, Katharina Hartmann), D1 (Linguistic database, Christian Chiarcos, Julia Ritz, Amir Zeldes).

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Chiarcos, C., Fiedler, I., Grubic, M. et al. Information structure in African languages: corpora and tools. Lang Resources & Evaluation 45, 361–374 (2011). https://doi.org/10.1007/s10579-011-9153-0

Download citation

Keywords

  • African language resources
  • Pragmatics
  • Corpus search infrastructure