CINWA (database of terminology for cultivated plants in indigenous languages of northwestern South America): introducing a resource for research in ethnobiology, anthropology, historical linguistics, and interdisciplinary research on the neolithic transition in South America

This article introduces CINWA, a freely accessible online database of terminology for cultivated plants in indigenous languages of South America based on FAIR principles for scientific data management and stewardship. In the pre-release version we present here, CINWA assembles more than 2700 terms from more than 60 indigenous languages of northwestern South America, and coverage will be continuously expanded. CINWA is primarily designed for use in historical linguistics to explore patterns of lexical borrowing that might be used as a proxy for tracing the pathways by which knowledge of individual cultivated plants and the associated know-how spread from speech community to speech community in pre-Columbian South America. In spite of intensifying research, this is still unclear for most cultivars as the locales of initial cultivation are heterogeneous and spatially diffuse. However, possible uses of the CINWA database are manifold and go beyond this research question. The database can be used as a resource for ethnobiological and comparative anthropological research on South American communities, South American agricultural ecosystems and practices, and for studies in lexical borrowing, language contact, and historical linguistics broadly.


Introduction
This article introduces CINWA, a freely accessible online database of terminology for cultivated plants in the languages of South America.Its focus lies on the northern and western parts of the continent.Here, in the early Holocene, people in various locales began experimenting with the tending of plants and the selection of phenotypes that had properties that were beneficial to them (Pearsall, 2015;Piperno, 2011).From these early onsets, South America would become one of the sites of the world that would witness the development of full-blown agriculture and a myriad of agricultural systems that range from large-scale agriculture and sophisticated technologies in the Andes and lowland areas such as the Llanos de Moxos to no less refined agroforestry systems in lowland Amazonia.These have shaped, and continue to shape, the forest ecologies of the lowlands in significant ways (Balée, 2013;Denevan, 2001).
Along with the development of such farming and agroforestry regimes, knowledge systems associated with agricultural techniques, but especially also individual plants suitable for cultivation in different environments, have developed.These knowledge systems are or were enshrined in the many indigenous languages of the South American continent.It originally boasted, according to some estimates, as many as 1000 individual languages and more than 100 independent linguistic lineages (Campbell, 1997;Kaufman, 1990).Many of these languages, and with them the knowledge regarding the physical environment of its speakers, have become dormant already (Harrison, 2007), but a multitude of local languages is still spoken, some severely endangered (Crevels, 2012;Urban, 2021).
The state of documentation of the indigenous languages of South America varies in quality and quantity.For a significant number of languages, including some that were abandoned as a means of daily communication early on, there are wordlists or dictionaries that document the vocabulary and, thereby, often also significant parts of the knowledge of their speakers on the world surrounding them and their daily activities-including, for food producers, the cultivation of plants.Present-day indigenous people of South America know of almost 200 plants that are cultivated or semi-cultivated in different parts of the continent (Denevan, 2001).While some of these are widespread across the entire continent, others are specific to particular climatic and ecological conditions that are needed for the plant to thrive.Some are tied intimately to specific cultural practices, such as the use of annatto, made from the seeds of Bixa orellana, which are valued for their bright red color as a dye and raw material for body paint in many cultures of lowland South America.Documentation of regional indigenous terminology of cultivated plants, whether rudimentary or advanced, is often not readily accessible to interested parties.The terminological distinctions indigenous languages make and the associated knowledge of the speakers are enshrined in sources that are typically far-flung and mostly not available in digital format.In the case of many sources, printed copies are available in only few libraries around the world.So far this makes it next to impossible to obtain a quick and reliable overview of 1 3 CINWA (database of terminology for cultivated plants in… terminology associated with cultivated plants in the indigenous languages of native South America.In particular, the development of comparative perspectives encounters a hindrance in the fragmented way that relevant information is distributed over different, often hard to get or "grey" literature.In addition, most of the sources present the information utilizing the national languages of present-day South American nation states, Spanish and Portuguese, as metalanguages, and not all interested parties may have a sufficient command of these. One of the general purposes of CINWA is to facilitate such comparative perspectives for a variety of research purposes.We provide a more detailed yet non-exhaustive list of possible applications of CINWA in Section 7, after having contextualized the database with the CLLD framework which it utilizes and comparable comparative linguistic sources (Sect.2), providing an overview of CINWA's coverage in terms of languages (Sect.3), cultivars (Sect.4), its data collection and representation principles (Sect.5), and the functionalities of the online interface of the database (Sect.6).

The CLLD framework and comparable work
CINWA utilizes the CLLD (Cross-Linguistic Linked Data) framework (Forkel et al., 2018(Forkel et al., , 2019)), which has established itself over the last years as the quasi-standard for online publication of linguistic data resources.CLLD relies on Linked Data principles for the processing and storage of linguistic data, and has to this end developed CLDF (Cross-Linguistic Data Format), which specifies standards for preparing linguistic data in interrelated .tsvfiles that can be stored and accessed in the long term.
Developers also furnish a web framework to build and display CLLD data from CLDF-formatted files in an aesthetically uniform way across different CLLD resources, thus ensuring a coherent and intuitive user experience.
CINWA is, to the best of our knowledge, the first online database on names for cultivated plants specifically for any part of the world.However, comparable work in a broader sense does exist.On the one hand, there are databases that reunite lexical and structural information on American languages, also in a CLLD framework: CSD (Rankin et al., 2016) deals with the comparative lexicon of the Siouan languages of North America, while TuLeD (Ferraz Gerardi et al., 2021) provides lexical data for Tupian, one of the largest language families of lowland South America.Both CSD and TuLeD seek to cover information on the vocabulary of the languages of the two families generally, and both also include a few plant terms. 1 Of more general scope is the venerable Intercontinental Dictionary Series (Key & Comrie, 2021), which has been begun in the 1980s and is now also available in CLLD format.The Intercontinental Dictionary Series provides vocabularies of around 1500 words for a wide range of languages of the world, but, reflecting the research interests of the founding editor Mary Ritchie Key, one particular focus is on South American languages.The local flora and fauna is not well represented in the standardized datasets, however.On the other hand, with Tsammalex (Naumann et al., 2020), there is a database with an outlook and goals that are similar to those of CINWA.Tsammalex is a multilingual lexical database on plants and animals with a highly developed structure that incorporates information on biological classification of the flora and fauna the vocabulary, the world's terrestrial ecoregions it occurs in, and, for some entries, images of the designated species.However, in terms of languages, reflecting the principal interests of its creators, Tsammalex at present focuses on African languages, mostly sub-saharan.

Coverage: languages
In terms of coverage, CINWA, at its current state of development that is reflected in the preliminary release version 0.9 that we describe here, focusses on the northern and western quadrant of South America, covering indigenous languages spoken from Colombia in the north to Bolivia in the south, and in all of the three major geographical regions nowadays recognized: coastal lowlands, Andean highlands, and the lowlands of Colombia, Ecuador, Peru and Bolivia to the east of the Andean cordilleras.CINWA pays special attention on this part of the continent at this stage because it harbors the greatest number of sites where, according to current knowledge, the initial cultivation of a significant number of South American cultivars took place (Piperno, 2011).For those researchers interested in the historical linguistics of plant domestication, this is therefore the most relevant region to look into.In further updates to the database, we plan to expand coverage to also include data from languages beyond.
Like other parts of the continent, northwestern South America was and still is eminently multilingual.The preliminary release version of CINWA which we present here has data for 62 languages which belong to as many as 32 distinct language families (including isolates).These languages, together with their genealogical classification, are listed in Appendix A; their locations are shown in Fig. 1, which is a screenshot from CINWA's online interface, with color-symbol type combinations used for different language families.For the genealogical classification, we rely on Glottolog 4.4.(Hammarström et al., 2021).Itself a CLLD resource, Glottolog is the most comprehensive and reliable source of information about the different languages, dialects, and language families of the world, and provides a conservative classification that reflects currently accepted groupings among the experts on the relevant languages.

Coverage: cultivars
We have relied on Denevan (2001, pp. 307-320, Appendix 1A) for a near-exhaustive list of cultivated plants of South America.The table in this appendix, which provides a common name, other names, a botanical classification, and comments on location 1 3 CINWA (database of terminology for cultivated plants in… and use of individual cultivars, forms the basis of our work with primary sources for CINWA as described in Sect. 5. Denevan's table is itself based on prior sources, and the author informs that his compilation includes "Domesticates, semi-domesticates, and other selected planted and protected (cultivated) plants, native to the Americas, and prehistoric or probably prehistoric in South America" and that species were included in the table if "[l]isted by at least two sources as domesticated, semidomesticated, or cultivated" (Denevan, 2001, p. 320, table note a).While surely an independent perusal of botanical and ethnobotanical literature would have resulted in a different, perhaps expanded list, this would have been a research undertaking of such an extent that would have exceeded available resources by far, and would also have engendered its own perils given that we are a team of linguists rather than (ethno)botanists.Our impression was that Denevan's compilation comes as close to definite as any with scope over South America as a whole, so that we found it preferable to adopt his compilation-highlighted as particularly useful in reviews (Newson, 2002;Stadel, 2005)-for our purposes.As adapted for perusal in CINWA, the list features 189 cultivated plants, along with additional information on each one, including English names, local vernacular names, binomial nomenclature, author citation, plant family, the region in which the plant grows, and its principal use or uses.Pertinent data are in Appendix B; we emphasize again that this list is not our work, but relies heavily on Denevan (2001).

Data collection and representation principles
As primary data sources, we draw on published dictionaries, wordlists, and sometimes, where available to us, also dedicated literature that deals with the ethnobotanical knowledge and practices of particular South American speech communities.We used our own expertise in South American languages to identify possible sources, but we additionally also relied on Glottolog's extensive bibliography on the world's languages.Generally, we preferred modern dictionaries that have been compiled by lexicographers with training in linguistics and that utilize phonemic transcription principles.However, such high-quality sources are not available for all languages within the scope of CINWA as delimited in Sect.3. Given that language loss is a phenomenon that has been going on since the European invasion and that is even accelerating (Crevels, 2012), for some languages it has not been possible to compile such sources.For a subset of languages that are no longer spoken, shorter and less complete documentary material may nevertheless be available, even though such sources are often problematic for representing the sound structure of the described language poorly.Such sources also carry an increased risk of presenting inaccurate or vague information on the meaning of the recollected words.Cultivars may therefore not always be correctly identified.These problematic aspects notwithstanding, we have chosen to incorporate data from such sources since, in spite of their shortcomings, they are the best documentary material available for many a language that may harbor crucial evidence for historical processes related to plant cultivation in South America.Nevertheless, we urge CINWA users to treat such data critically and to be cognizant of the nature of the material and its particular characteristics that reflect the time of its creation.
We have not made it a policy to rely on only one source for data per language.For some languages, more than one source was consulted to increase coverage.
We have extracted the terminology of cultivated plants that the sources document manually by going through all relevant entries.In bilingual dictionaries, i.e. those that feature a section in which an English, Spanish, or Portuguese lexical item is translated to an indigenous language, we have used that section to identify the relevant entries in the other complementary section that translates lemmata in the indigenous language to the European-based vernacular.As a matter of principle, we have always included forms of items and all additional information (part of speech, gloss, etc.) from that section.The reason is that in our experience this section often provides either identical or more complete information vis-à-vis the section that translates from a national language of European origin or English to the indigenous language, which for some of the sources we used is no more than a mere finder list.
Where dictionaries only feature one section that is lemmatized according to native language forms, we have had to look through the entire dictionary in search for cultivar terminology.

CINWA (database of terminology for cultivated plants in…
A significant issue is that the name of a particular cultivar in the local variety of Spanish, and to a minor extent also Portuguese, often differs from region to region.Sometimes this is a matter of distinctions related to the boundaries of modern national states.But the problem trickles down further to the regional level, and sometimes cross-cuts national boundaries.Such terminological variation in the former colonial European languages is a significant obstacle to the precise identification of dictionary or wordlist entries.We have chosen to depart from the names mentioned in the column "Other Names" in Denevan's (2001) appendix on which we also based our list of cultivated plants in Appendix B. To be maximally inclusive and to reduce chances of overlooking data, we have made it our policy to always look up all the local names mentioned in all dictionaries even in cases where we had some prior knowledge on what the name in the local variety of Spanish of the plant might be in each individual case.
One important principle that CINWA follows, not least in response to the above issue, is rich annotation: for each entry, CINWA provides the original entry found in the utilized primary source, including a part of speech code and the gloss, where available, relevant annotations on the characteristics of the plant, cultural associations that prevail in the respective speech communities, and other information.These entries have only been abbreviated if they were excessively long and contained information that was not judged directly relevant to the name of the cultivar and its use in the speech community.Experience shows that such information, while unsystematic across sources, can contain crucial information in different research contexts; therefore, to fulfil its aim to facilitate comparative perspectives on the terminology for cultivated plants for a variety of research purposes, we considered it a high priority to (partially) include the original dictionary gloss.
All lexical items are represented in CINWA in the orthography of the respective sources.We provide an additional phonological representation where the source itself does so, but otherwise do not attempt to infer standardized phonological representations for our datasets-while this would be unproblematic for some sources, for others it would be an enterprise fraught with difficulties that would require a significant amount of philological preparation and even then be associated with considerable uncertainty.We also adopt the citation forms as given in the sources rather than trying to reduce these to stems or roots, but we have not taken over alternate forms that sources provide (for instance, different genders or noun-classifier combinations).
The sources that we utilized to create CINWA rely on a multitude of different orthographies for representing the sound structure of the indigenous languages.Early sources tend to simply utilize the Latin alphabet in an uncritical and problematic way that potentially obscures or distorts the phonology of the described language.We nevertheless include such sources, as per the general principles above, in situations where we have no more adequate modern sources that uses an orthography based on phonemic principles.Such more recent sources typically use an alphabet that is adapted to the phonology of the described languages, and may contain special characters that are not part of the standard Latin alphabet.Often, diacritics are used; in other cases, special phonetic characters from the International Phonetic Alphabet are used alongside standard characters; combinations are also frequent.One problem that arises is that in Unicode, the same character can be created in multiple ways.For instance, a < o > with a dot below and a circumflex diacritic, < ộ > , can be created in at least three ways: on the one hand, the character is actually available precomposed by Unicode (U + 1ED9).But there are alternatives: the same result could be achieved by using the precomposed o with dot below < ọ > (U + 1ECD) and adding an acute as a combining character (U + 1ECD + 0301), by using the precomposed o with acute < ó > (U + 00F3) and adding a dot below as a combining character (U + 00F3 + 0323), or by building it up on the basis of < o > without diacritics (U + 006F) and adding both acute and dot below through combining characters (U + 006F + 0301 + 0323).Depending on the font used, the complex characters thus created may be displayed in slightly different ways, but, more importantly, the characters thus created might behave differently in searches in data processing software, so that a search for either of the characters thus created might fail to find those created by a different strategy.We therefore have prepared a list of characters with diacritics that are used in CINWA and the way they were generated.Our workflow involved a step in which coders, every time they encountered a special character in a source they were about to begin to evaluate for CINWA, checked whether this particular special character was already used in one of the datasets that already had been extracted.If so, they proceeded to create the special character in the same manner.If the particular character in question had not been used before in the data for any other language, they added it to the list along with its name in Unicode and the way they created it so that it can be created in the same way for future datasets.While we placed most emphasis on the uniformity in the way each single character was created, we have usually made it a habit to work with precomposed characters that are made available by Unicode as much as possible (approximating Unicode NFC, see Moran & Cysouw, 2018, p. 28), and to employ combining characters only where this is not the case.

Features and functionalities of the CINWA online interface
As is true of CLLD databases generally, CINWA allows its users to access and visualize the data in different ways that cater to the anticipated heterogeneous needs of its future users.We describe the main manners of accessing the database here.From the start page, which provides a general short description of the database, some basic statistics and a citation recommendation, the user is able to navigate to three main sections: one that represents the data per language, one that represents the data per plant, and bibliographic data.
The "Languages" view allows the user to access the list of languages covered in CINWA together with associated metadata as also listed in Appendix A. Each language name leads to a full list of entries available for that language, so that researchers interested in the nomenclature of a certain language have the possibility to access all names for cultivated plants that are represented for that language.The resulting list -exemplified in the screenshot in Fig. 2 for Jebero, a language of the Peruvian Amazon, with data from Valenzuela Bismarck (2012)-features, as described in Sect.5, an (abbreviated) representation of the original gloss as found in the sources we consulted as well as a link to the bibliographic data of that source.

CINWA (database of terminology for cultivated plants in…
In turn, this view also has the English names of available plants linked to a view in which users can conveniently access all names for that particular plant across languages covered in CINWA.This is exemplified in the screenshot in Fig. 3 for the arrowroot (Canna edulis).
Together with the visual representation of the distribution of the terms and the languages in which they are found on the associated map, this facilitates initial comparison for similarities, which is one of the prime intended uses of the database.7 Possible applications CINWA's possible uses are manifold.Most generally, it is a resource for scholars interested in ethnobotanics and its relation to linguistics and culture (Maffi, 2001).The database may serve as an initial source of information for researchers interested in the cultivated plants and associated names of a particular indigenous community or a particular part of South America.However, given its inherently comparative nature, we expect CINWA to be of greatest use for researchers interest in comparative perspectives on cultivar terminology.
These perspectives might be different ones: since names are the "foundation of linguistic ethnobiology" (Hunn & Brown, 2011, p. 319) ethnobiologists might be interested in the relationships between communities and the cultivars they tend as reflected in the lexical distinctions they enshrine.We imagine that our principle of rich annotation will be particularly useful here, as this will allow researchers to assess the polysemy structure associated with individual cultivars in individual languages that may reveal cultural associations, but also reflect on the diachronic development of the system (Brown, 1986).
There are also a multitude of possible application for more properly historical linguistic questions.One such topic are so-called marking reversals (Witkowski & Brown, 1983), a process of linguistic acculturation in which a native word for a known referent is recruited to designate a newly introduced referent -typically, flora and fauna-together with a modifier.As this new referent grows in importance to such an extent that the original referent ends up with a modifier, the original term comes to refer to the item of acculturation exclusively.
Such questions often relate intimately to South American prehistory.In fact, answering them might make significant contributions to elucidate the vectors of spread of cultivars that can usefully complement the archaeological record that is still sparse in many regions, not least because of rapid decomposition of organic material in hot humid environments.One pertains to the spread of agriculture and the cultivation of particular plants in pre-Columbian South America, and this is the main goal which we had in mind when creating CINWA.For instance, it is known that the coastal valleys of present northern-Peru harbor some of the earliest evidence for cultivars whose initial focus of cultivation is thought to lie to the east of the Andes (Dillehay et al., 2007), and the Huancabamba depression, a sector of the Andes characterized by reduced altitude and width that makes east-west transversal movement relatively easy, could have served as a convenient corridor for the introduction of cultivars from the east (Kaulicke, 2020).Data such as that found in CINWA, in fact, furnishes evidence that this is the case: in the upper Amazon region of northern Peru, indigenous languages know the peanut butter fruit tree (Bunchosia armeniaca) by names like oshon in Chayahuita (Hart, 1988), and this term has been borrowed into the lowland Quechua variety of San Martín as ushun (Park et al., 1976).In this case, we can be very confident about borrowing and its directionality not just because of the general formal similarity, but also because of the fact that the form was adapted to the threevocalic vowel system in Quechua by replacement of mid vowels (provided that Chayahuita was the direct source of borrowing), and especially because cognates of the word are absent in any other Quechua varieties 1 3 CINWA (database of terminology for cultivated plants in… (and indeed, the tree would not grow in the regions where these are spoken).But there is more: in the extinct Mochica language of coastal Peru on the other side of the Andes, Brüning (2004) documented the form < ošórre > in work with the last speakers of the language.Consistent with a more remote borrowing history, this form shows greater distance from the present-day Upper Amazonian forms (though the final < e > is probably Spanish-induced, reducing the difference basically to the quality of the sonorant).In this sense, CINWA data can also contribute to resolving questions of the origin and spread of individual cultivars that have already amply been debated, but not answered conclusively yet.
Some related questions have occupied researchers from different disciplines already for a long time, and CINWA is eminently suited to contribute to resolving these.One of these is whether the introduction of the banana to South America pre-or postdates European contact (Nordenskiöld, 1922;Jeffreys, 1963;Langdon, 1993;Balée, 2013, pp. 94-95), and linguistic arguments have played a role in that debate.Another open question that should be of renewed interest since Ioannidis et al.'s (2020) finding of a genomic signature of Native American-Polynesian contact events is the origin and geographical extension of use of South American terms for the sweet potato in Polynesia.Polynesian designations are so strikingly similar in some South American languages, prominently of the Quechuan family, that they have been understood as a signature of precisely such contact events (Adelaar, 1998).One issue is that the distribution of the relevant term is hitherto not known to have extended to those regions of South America in coastal Colombia that are most likely implicated in the contact events.Thanks to its scope, CINWA will allow a more rigorous assessment than possible before.

Anticipated future development
Here we present a preliminary yet fully functional pre-release version (0.9) of CINWA.We plan to expand CINWA in the following phase of development in two ways: on the one hand, while we already have achieved a fairly dense coverage that allows to look into research questions on the basis of a principled empirical dataset, we aim to extend coverage further with the ultimate long-term goal of full complete coverage of indigenous languages of northwestern South America.On the other hand, we plan to implement a system of coding for morphological structure of the items as certain types of morphological structure, in particular what in English would be endocentric compounds, are often of importance in ethnobiological classification (in ethnobiological nomenclature, so-called "composite labels", "binomial terms" or "secondary lexemes, " Hunn & Brown, 2011).
Finally, we plan to implement a system of assessing similarities of forms between families that is similar to cognate coding in databases that cover lexical variation within language families (e.g.Dellert et al., 2020;Ferraz Gerardi et al., 2021).In our case, the goal would, however, rather be to establish a rating system of formal similarities between terms that might hint at prehistoric borrowing of designations across languages and language families which occurs sometimes repeatedly and then gives rise to so-called "Wanderwörter"-widespread formally similar terms that reflect different patterns of repeated borrowing (Haynie et al., 2014).With this aim in mind, we plan to develop a rating scheme that would take into account the known phonological histories of the involved language families and principles of loanword adaptation (e.g.Uffmann, 2015) to derive hypotheses as to borrowing pathways and networks.

Conclusions
In this article, we describe and officially release CINWA 0.9, a new database and resource which aims to cover the terminology for cultivated plants in South American indigenous languages, focusing on the Northwest of the continent, and have described the rationale, layout, and structure of CINWA and described possible applications.We have explained the CLLD framework on which the database draws and discussed comparable work (Sect.2), and discussed CINWA's coverage in terms of languages (Sect.3) and cultivated plants (Sect.4).The preliminary release version 0.9 provides around 2700 such terms, covering 62 languages from 32 families for a subset of around 189 cultivated plants known to the indigenous societies of South America.We have described the principles of data collection and curation, emphasizing our rich annotation approach (Sect.5).We have described the functionalities of the CINWA online interface (Sect.6) and the possible applications it aims to facilitate (Sect.7).In the coming years, we aim to expand CINWA further in terms of data coverage and functionality (Sect.8).
Appendix A: languages in CINWA 0.9 with associated metadata

Fig. 2
Fig.2CINWA 0.9 data for the Jebero language of the Upper Amazon (extracted from the primary source ValenzuelaBismarck, 2012)