Skip to main content

ChoCo: a multimodal corpus of the Choctaw language


This article presents a general use corpus for Choctaw, an American indigenous language (ISO 639-2: cho, endonym: Chahta). The corpus contains audio, video, and text resources, with many texts also translated in English. The Oklahoma Choctaw and the Mississippi Choctaw variants of the language are represented in the corpus. The data set provides documentation support for this threatened language, and allows researchers and language teachers access to a diverse collection of resources.

This is a preview of subscription content, access via your institution.

Fig. 1





  4. The irony is not lost on the authors that due to poor character support in LaTeX, these symbols in the present document also use non-standard encodings and special fonts.










  • Arppe, A., Schmirler, K., Harrigan, A.G., Wolvengrey, A. (1945). A morphosyntactically tagged corpus for Plains Cree. In: M. Macaulay, M. Noodin (eds.) Papers of the Forty-Ninth Algonquian Conference. Michigan State University Press (to appear). 1945.

  • Battiste, M., & Henderson, J. S. Y. (2000). Protecting Indigenous Knowledge and Heritage: A Global Challenge. Saskatoon: Purich Publishing Ltd.

    Google Scholar 

  • Brixey, J., Hoegen, R., Lan, W., Rusow, J., Singla, K., Yin, X., Artstein, R., Leuski, A.: SHIHbot: A Facebook chatbot for sexual health information on HIV/AIDS. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 370–373. Saarbrücken, Germany (2017)

  • Broadwell, G. A. (2005). Choctaw. In H. K. Hardy & J. Scancarelli (Eds.), Native Languages of the Southeastern United States (pp. 157–199). Lincoln: U of Nebraska Press.

    Google Scholar 

  • Broadwell, G. A. (2006). A Choctaw Reference Grammar. Lincoln: U of Nebraska Press.

    Google Scholar 

  • Broadwell, G.A.: Parallel affix blocks in Choctaw. In: S. Müller (ed.) Proceedings of the 24th International Conference on Head-Driven Phrase Structure Grammar, University of Kentucky, Lexington, pp. 103–119. CSLI Publications, Stanford, California (2017)

  • Byington, C. (1852). Holisso Anumpa Tosholi: An English and Choctaw Definer for the Choctaw Academies and Schools. New York: S. W. Benedict.

    Google Scholar 

  • Byington, C.: Grammar of the Choctaw language. Proceedings of the American Philosophical Society 11, 317–367. (1870). Edited by Daniel Garrison Brinton (p. 1870). Philadelphia: Also published as a monograph by McCalla and Stavely.

  • Byington, C.: A Dictionary of the Choctaw Language. US Government Printing Office. (1915). Edited by John R (p. 46). Smithsonian Institution Bureau of American Ethnology Bulletin: Swanton and Henry S. Halbert.

  • Campbell, L., & Mithun, M. (Eds.). (1979). Southeastern languages. The Languages of Native America: Historical and Comparative Assessment (pp. 299–326). Austin: University of Texas Press.

    Google Scholar 

  • Choctaw Tribal Language Program: Chahta im annopa hiyohli alhíha (2005).

  • Haag, M., & Willis, H. (2001). Choctaw Language and Culture: Chahta Anumpa (Vol. 1). Norman: University of Oklahoma Press.

    Google Scholar 

  • Haag, M., & Willis, H. (2007). Choctaw Language and Culture: Chahta Anumpa (Vol. 2). Norman: University of Oklahoma Press.

    Google Scholar 

  • Harrigan, A., Schmirler, K., Arppe, A., Antonsen, L., Moshagen, S. N., Trosterud, T., et al. (2017). Learning from the computational modeling of Plains Cree verbs. Morphology, 27(4), 565–598.

    Article  Google Scholar 

  • Hubert, I., Arppe, A., Lachler, J., Santos, E.A.: Training & quality assessment of an optical character recognition model for Northern Haida. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia (2016).

  • Klama, R.B.: Choctaw translation project eagerly anticipated. Mission Network News (March 5, 2015).

  • Leuski, A., & Traum, D. (2011). NPCEditor: Creating virtual human dialogue using information retrieval techniques. AI Magazine, 32(2), 42–56.

    Article  Google Scholar 

  • Li, X., Tracey, J., Grimes, S., Strassel, S.: Uzbek-English and Turkish-English morpheme alignment corpora. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia (2016).

  • Nicklas, T. D. (1971). Choctaw morphology. Southeastern State College, Durant, Oklahoma: Tech. rep.

  • Nicklas, T.D.: The elements of Choctaw. Ph.d. dissertation, University of Michigan (1972)

  • Nicklas, T. D. (1979). Reference grammar to the Choctaw Language. Choctaw Bilingual Education Program: Southeastern Oklahoma State University.

  • Simons, G.F., Fennig, C.D. (eds.): Ethnologue: Languages of the World, twenty-first edn. SIL International, Dallas, Texas (2018).

  • Snoek, C., Thunder, D., Lõo, K., Arppe, A., Lachler, J., Moshagen, S., Trosterud, T.: Modeling the noun morphology of Plains Cree. In: Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, pp. 34–42. Association for Computational Linguistics, Baltimore, Maryland, USA (2014).

  • Sturtevant, W. C. (2005). History of research on the native languages of the Southeast. In H. K. Hardy & J. Scancarelli (Eds.), Native Languages of the Southeastern United States (pp. 8–65). Lincoln: U of Nebraska Press.

    Google Scholar 

  • Swartout, W., Traum, D., Artstein, R., Noren, D., Debevec, P., Bronnenkant, K., Williams, J., Leuski, A., Narayanan, S., Piepol, D., Lane, C., Morie, J., Aggarwal, P., Liewer, M., Chiang, J.Y., Gerten, J., Chu, S., White, K.: Ada and Grace: Toward realistic and engaging virtual museum guides. In: J. Allbeck, N. Badler, T. Bickmore, C. Pelachaud, A. Safonova (eds.) Intelligent Virtual Agents: 10th International Conference, IVA 2010, Philadelphia, PA, USA, September 20–22, 2010 Proceedings, Lecture Notes in Artificial Intelligence, vol. 6356, pp. 286–300. Springer (2010)

  • The Choctaw Nation of Oklahoma Dictionary Committee: Chahta Anumpa Tosholi Himona: New Choctaw Dictionary, 1st edn. Choctaw Print Services (2016)

  • Traum, D., Jones, A., Hays, K., Maio, H., Alexander, O., Artstein, R., Debevec, P., Gainer, A., Georgila, K., Haase, K., Jungblut, K., Leuski, A., Smith, S., Swartout, W.: New Dimensions in Testimony: Digitally preserving a Holocaust survivor’s interactive storytelling. In: H. Schoenau-Fog, L.E. Bruni, S. Louchart, S. Baceviciute (eds.) Interactive Storytelling: 8th International Conference on Interactive Digital Storytelling, ICIDS 2015, Copenhagen, Denmark, November 30–December 4, 2015, Proceedings, Lecture Notes in Computer Science, vol. 9445, pp. 269–281. Springer (2015)

  • Ulrich, C. H. (1993). The glottal stop in Western Muskogean. International Journal of American Linguistics, 59(4), 430–441.

    Article  Google Scholar 

  • Watkins, B. (1892). Complete Choctaw Definer: English with Choctaw Definition. Van Buren: JW Baldwin.

    Google Scholar 

  • Williams, R. S. (1999). Referential tracking in Oklahoma Choctaw: Language obsolescence and attrition (pp. 54–74). : Anthropological linguistics.

  • Wright, A.: Chahta Leksikon: A Choctaw in English Definition for the Choctaw Academies and Schools, first edn. The Presbyterian Publishing Company, St. Louis (1880).

  • York, K., Scott, J.R.: Bilingual education for Choctaws of Mississippi. Annual Evaluation Report FY 75–76, Mississippi Band of Choctaw Indians (1976).

Download references


Many thanks to Timothy Vizthum for his invaluable work to correct the OCR errors in the Allen Wright lexicon. The first author was supported by a National GEM Consortium fellowship and a USC Graduate Research Enhancement fellowship. This work was sponsored in part by the U.S. Army Research Laboratory (ARL) under contract number W911NF-14-D-0005. Statements and opinions expressed and content included do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred. We thank the anonymous reviewers for their helpful suggestions and feedback on this work.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jacqueline Brixey.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Brixey, J., Artstein, R. ChoCo: a multimodal corpus of the Choctaw language. Lang Resources & Evaluation 55, 241–257 (2021).

Download citation

  • Published:

  • Issue Date:

  • DOI:


  • American indigenous language
  • Endangered languages
  • Multimodal
  • Corpus