1 Introduction

Central Pame (pbs, cent2154) is an Oto-Pamean (Otomanguean) language spoken in Mexico, around 200 kms north of Mexico City as the crow flies. Although the language can be regarded as threatened (e.g. by Hammarström et al., 2021), it still has several thousand speakers, in and around Santa María Acapulco (San Luís Potosí). The language is also still acquired by children as an L1 in this and several other smaller communities. Despite a notable documentation effort by SIL linguists around 70 years ago (see Gibson 1950a, 1950b, 1956) the language is still quite severely underdocumented, lacking a published (sketch) grammar or dictionary.

As in other languages in the family (e.g. Chichimec, see Angulo, 1933, Navarro & Zoé, 2018, Palancar & Avelino, 2019; Herce, 2022), arguably the most notable characteristic of Central Pame is the extraordinary complexity of its inflectional morphological system. While nominal inflection (based on prefixes, tone changes, stem alternations, and suffixes, and organized into numerous inflection classes) has at least been described and analyzed in one publication (Gibson & Bartholomew, 1979), the verbal inflectional system has not been the subject of any dedicated investigation. Unpublished descriptions from SIL linguists from the 50’s (most notably Gibson, 1950c and Olson, 1955) clearly attest to the system’s extraordinary complexity and unusual organization. These descriptions, however, are not adequate for modern computational analyses (they contain mistakes and inconsistencies), and represent an older state of the language that might differ in important respects from contemporary speech (quite notable changes can occur in unstandardized languages in comparatively short periods of time, see O’Shannessy, 2005; Feist & Palancar, 2021).

For these reasons, the documentation of contemporary Central Pame verbal inflection is timely and likely to lead to interesting insights relevant to the quantitative exploration of paradigmatic morphology. Given the extreme challenges that the system poses for learnability and use, the main goal is to analyze paradigmatic morphological predictability relations in the system quantitatively (i.e. what has come to be known as the Paradigm Cell-Filling Problem, PCFP, Ackerman et al., 2009). This is achieved through an ad-hoc-compiled inflected lexicon on which every single one of its 12528 word forms has been independently elicited and checked multiple times with native speakers.

This paper is structured as follows: Section 2 provides a basic introduction to Central Pame phonology and phonological processes, and an exposition of the phonological transcription choices adopted throughout this paper. Section 3 provides a basic qualitative description of the broad characteristics of Central Pame verbal inflection: features and values, morphological layers, most general patterns, etc. Section 4 presents the verbal inflected lexicon of Central Pame (VeLePa), and explains how it was compiled. Section 5 is the core of the paper and contains a quantitative analysis of every inflectional layer, and of the whole word. Section 6 discusses issues beyond paradigmatic predictability, namely how speakers can infer the lexical and morphosyntactic values from a word form. Finally, Section 7 summarizes and concludes the paper, and presents some avenues for future research.

2 Central Pame phonology and (morpho-)phonology

The description of Central Pame phonology by Gibson (1956) is the most complete one available to date and still captures quite accurately the synchronic sound patterns of the language. Although much research still needs to be done on phonetic aspects of the language, my experience is that the basic contrasts and processes identified by Gibson still apply, particularly in older speakers. I therefore only seldom deviate from her original analysis here, to which I am heavily indebted.

Table 1 shows the phonemic inventory of Central Pame: 20 simple consonants and 10 vowels (5 qualities + nasality). Some analytical matters could substantially expand the basic inventory in Table 1. Consonants followed by a glottal are generally considered complex segments, that is pʰ, tˀ, t͡sʰ, kˀ, etc. in related languages (e.g. Berthiaume, 2003, Knapp Ring, 2008), and VʔV and VhV sequences might also be considered single creaky and breathy vowels respectively, or ‘rearticulated’ vowels (see Avelino, 2010).Footnote 1 These phonetic details await to be investigated in Central Pame but constitute issues orthogonal to the morphological ones that constitute my focus here. Of special morphological relevance is the distinction between short and long consonants, which is phonemic in intervocalic environments. This contrast occurs in most consonants (i.e. p/pː, t/tː, etc.) and is indicated orthographically throughout this paper with a doubling of the consonant (e.g. apa [short] vs appa [long]). Where consonant length is not contrastive (e.g. in consonant clusters ampa, or word initially pa), consonants are transcribed as short.

Table 1 Phonological inventory of Central Pame (phonemic length and tone not shown)

Also of great morphological relevance are tone and stress, which are intertwined in the language so that only stressed syllables bear one of three contrastive tones: High (indicated with an acute accent á), Low (indicated with a grave accent à), and Falling (indicated through a circumflex accent â). These contrasts, however, are further limited to those words where stress is found on the final syllable, which coincides with the stem. When stress occurs in the penultimate or antepenultimate syllable, i.e. in the prefix, only the high tone can occur.

There are several phonological processes relevant to morphology. The most productive and influential one is a process of palatalization, by which a closed-front vowel /i/ triggers the palatalization of any following alveolar or velar consonant or consonant cluster. In this phonological environment, therefore, the contrast between alveolar and velar points of articulation (see Table 1) is neutralized, with both series becoming post-palatal instead (e.g. i+t = ik̘, i+k = ik̘). These neutralized consonants are transcribed here as velar (see Table 2) because i) this is closest to their phonetic point of articulation, and ii) this follows local practice and Gibson’s (1956) original phonemic transcription. Depending on the consonant, a palatal glide (written “y” here) can follow the palatalized consonant. Although a prefix of the form /i/ always triggers stem palatalization, some exceptional forms show a palatalized stem in the absence of any prefix (e.g. protect.2SG.IMP ∅-kyònt), which makes this morphophonological process fall short of full predictability.

Table 2 Neutralization of alveolar and velar consonants after /i/

Similar to palatalization, a process of labialization also occurs in many stems, by which prefixes containing the back vowel /o/ generate a labial-velar glide (written “u” here) on the following stem (e.g. hunt.3SG.PST ko-ŋguǽʔæ < ko + ŋgǽʔæ). This process is somewhat morphologized as well because labialization also appears in some zero-prefixed forms (e.g. hunt.1SG.IRR ∅-ŋguǽʔæ) and occurs in a partially unpredictable subset of verbal stems.

One last morphophonological process with important morphological repercussions is the fusion of stems and suffixes. It is a pervasive trait of Central Pame that the various inflectional suffixes in the language never add an extra syllable to the word, instead fusing with the ends of stems into a sequence that is permissible by the language as a syllable coda. Thus, while all suffixes are free to concatenate unproblematically with stems whose final sound is a vowel (e.g. -kkò + i = -kkòi, -kkò + t = -kkòt, -kkò + n = -kkòn, etc.), they undergo various morphophonological changes when they combine with consonant-final stems (e.g. -ttòŋ + i = -ttòi, -ttòŋ + t = ttònt, ttòŋ + n = ttòn, etc.). Although these changes are predictable if the underlying (i.e. unsuffixed) form of the stem is known, they sometimes generate morphological neutralizations (indicated in Table 3 through same shades of gray) comparable to those in Table 2.

Table 3 Some morphophonological changes as a result of suffixation

As Table 3 shows, a stem ending -V (see ‘belittle’), one with stem -Vŋ (see ‘protect’), and one with stem -Vi (see ‘send’) all surface as -Vi after the addition of a 1DU.INC suffix -i. It thus becomes impossible to deduce the underlying/SG form of the stem from this suffixed form. A stem ending in -V, in -Vŋ, and in -Vn, in turn, all surface as -Vn when a 1PL.INC suffix -n is added, which again generates uncertainty as to what the form of the stem is in other unsuffixed or differently-suffixed word forms. For the sake of brevity, I do not provide here an exhaustive analysis of underlying and derived forms for the palatalization, labialization, or suffixal fusion of different stems. This is part-and-parcel of the morphological predictability relations analyzed in Sect. 4 of this paper. The whole inflected lexicon is freely accessible, so the interested reader is free to inspect all forms individually.

3 Central Pame verbal inflectional morphology in a nutshell

As some of the morphophonological insights of Sect. 2 have already begun to reveal, verbal inflection in Central Pame is organized in quite an unusual way, with prefixes, suffixes, stem alternations, and tone-stress alternations all playing a role in the signalling of inflectional features and values. I refer to these as (morphological) layers in the rest of this paper. Central Pame verbs, the category on which I focus in this paper, inflect for person, number, and clusivity of the subject (S/A) and the objects (O and IO), as well as for different values of TAM. The usual values are distinguished for person (1, 2, 3) and for clusivity (INC, EX). For number, three values are distinguished (SG, DU, PL), and for TAM, six categories exist that I refer to as present (PRS), past (PST), irrealis (IRR), subordinate (SUB), future (FUT), and imperative (IMP).Footnote 2 The latter has 2nd person forms exclusively. As other languages in its family, Central Pame lacks nonfinite forms. Although adjectival or agent-noun derivations do exist I consider these adjectives and nouns respectively, not verbal forms, and hence members of a different paradigm.

Direct-object and indirect object morphology in the language is comparatively regular morphologically (see Brunner, 2016, and Table 4), with only minor morphophonological changes and no lexically-determined inflection class distinctions. Unlike the rest of the morphology (i.e. S/A and TAM, which is often cumulated and expressed together through several morphological layers), the morphology for O and IO is separative (does not change due to TAM or the person or number of the agent), and orthogonal to the one analyzed in this paper.

Table 4 Some O-indexing morphology in Central Pame

Although some intransitive verbs (see ‘cough’ in Table 4) do index their S argument with the same morphology that usually expresses the object of transitive verbs (see ‘find’ and ‘dry’) this is an exceptional pattern limited to a few verbs which have an experiencer S. Generally, S and A are marked (along with TAM) in the same broad way morphologically speaking: through a complex combination of prefixes, stem alternation, tone-stress alternations, and suffixes. This does not detract from the fact that intransitive and transitive verbs tend to inflect according to different inflection classes, which means that S and A are often indexed through somewhat different morphology. This might lead us to describe the argument indexing of Central Pame as tripartite (rather than nominative-accusative) were it not for the fact that the S argument in different verbs, or the A marking in different verbs, are also often indexed with different morphology due to arbitrary inflection class distinctions.

The paradigm of the verb ‘find’ can be displayed in Table 5 to illustrate S/A+TAM inflection. Tone (indicated with bold italic vs normal font) distinguishes 1/3.nonPRS from the rest of the paradigm (i.e. PRS+2.nonPRS). Stem alternation (leaving palatalization and labialization aside) distinguishes three more orthogonal domains in the paradigm: 3PL vs PRS+1IRR/SUB+1/2FUT vs rest. Prefixes (the most active inflectional layer) indicate most TAM and person distinctions (but not all, consider the triplet , , , where tone and stem alternations are the ones generating whole-word morphological contrasts). Suffixes express most number distinctions (consider ki-kkyèhe, ki-kkyèhei, ki-kkyèhen) but again not consistently (cf. wa-kkèhe, wa-kkèhei,-kʔèhe).

Table 5 Paradigm of ‘find’

The division of labour between inflectional layers is, therefore, complex. However, even more so is the fact that within every single one of these layers, verb-to-verb morphological differences are commonplace. Inflection classes, thus, can be identified in the behaviour of prefixes, stem alternations, tone-stress alternations, and suffixes individually (see Baerman, 2013; Ackerman & Malouf, 2013: 447–453, Mansfield, 2016; Beniamine, 2018: 106–123 and Parker & Sims, 2020 for analyses of other similarly-structured systems). The classes in every layer cross-classify, so that the overall (i.e. whole-word) number of inflection classes is extremely large. At the whole-word level, almost every single verb in the database underlying this paper ends up being morphologically unique, which is one of the reasons to pursue a layer-by-layer analysis here instead.

Table 6 shows an illustrative subset of prefixes of those Central Pame prefixal classes with 3 or more members in my sample. It shows very clearly that, given that hardly any prefix identifies its inflection class unmistakably (but see e.g. the 2.PRS la- in ‘leave.tr’), and given that some classes are entirely made up of prefixes that also appear in other classes (e.g. the largest one), speakers of the language face an important challenge when predicting the form of one cell from the form of another. This is what is known as the Paradigm Cell Filling Problem (PCFP, Ackerman et al., 2009).

Table 6 Prefixes of PRS and PST forms of the major conjugation classes in Central Pame

In other layers, even though the properties of the particular subsystem may vary substantially, the basic situation regarding predictive uncertainty is comparable. Table 7 shows the tone-stress of all PRS and PST tense cells, with -H, -F, and -L indicating high, falling, and low tone respectively in the last syllable (e.g. lappái ‘1SG.PRS.send’, lassô ‘1SG.PRS.tie’, lakkò ‘1SG.PRS.belittle’), and H- indicating high tone-stress in the penult (e.g. láppo ‘1SG.PRS.give’). One of the most characteristic traits of Central Pame tone-stress is that around two thirds of verbs (the first four classes in Table 7) have an invariable tone- stress across the paradigm, while the rest of verbs fall into a large number of smaller classes. Despite the much lower inflectability of tone-stress in the language, thus, the fact remains that since all cells can adopt any tone-stress value, and all can be involved in alternations, speakers also face considerable uncertainty when predicting whether a given verb has lexical (i.e. unchanging) or inflectional (i.e. alternating) stress, and hence predicting the tone-stress of one cell from another.

Table 7 Tone-stress of PRS and PST cells of the major Central Pame tone-stress classes

To continue with a succinct overview of every inflectional layer in the language, Table 8 presents the same partial paradigms of the largest stem alternation classes (those with 4 or more verbs in my database). Only stem onsets are provided (i.e. the stem up to but excluding the stem vowel) because it is here that most of the stem alternation changes take place (although about a dozen verbs show SG/DU vs PL suppletion, in which case not only the stem but also prefixal and tone-stress class can change). Given that, as in other layers, one and the same stem onset can occur in multiple classes (e.g. 1.PRS /pp/ in ‘borrow’ vs in ‘exchange’, or 2PRS /kky/ in ‘copy’ vs ‘buy’ vs ‘laugh’), speakers also face a PCFP when predicting the stem in one cell from the stem in another cell.

Table 8 Stem onsets of PRS and PST cells of the major stem-alternation classes. Note that columns capture classes as defined segmentally and not as defined by alternation patterns

As the forms in Table 8 show, the stem in the 3PL is the most variable one, being sometimes based on the glottalization (i.e. +/h/ or +/ʔ/) of the stem found elsewhere in the paradigm (see ‘copy’, ‘erect’, ‘call.self’), sometimes on the addition of an alveolar (i.e. /l/+ or /t/+) to that stem (see ‘bury’, ‘answer’), on combinations of both strategies (see ‘buy’, ‘laugh’), and also on other changes (see ‘borrow’) or no changes (see ‘exchange’). Outside these cells, alternation patterns often rely on stem consonant length. In a manner resembling consonant gradation in Finnic and Sami languages, an (unnatural) set of cells has the lenis and another set the fortis version of the stem (see ‘borrow’, ‘copy’, ‘buy’, ‘laugh’). Also like in some of these Uralic languages, there are two different paradigmatic distributions of gradation. Only the more frequent one is shown in Table 8, but a lenis consonant in the PRS and a fortis one in the PST is also common.

Finishing the tour around Central Pame morphological layers, word endingsFootnote 3 (see Table 9) can also exhibit inflection class behavior. Unpredictability here can be the result of the morphophonological processes explained in Table 3 and of suffixal allomorphy. Due to the morphophonological adjustment (for example, that ∅+n gives n but ŋ+n also gives n), it is uncertain, given a 1PL.INC.PRS in -n whether the 1SG.PRS might be -∅ as in ‘erect’, or -ŋ as in ‘burn’. Unpredictability can also result from different suffixes being added in different verb classes. Notice that there is a 2DU suffix -k in the classes of ‘frightened’, ‘be.busy’, ‘hurry’, or ‘run’, but no such suffix in the classes of ‘erect’, ‘burn’, ‘deny’, or ‘deceive2’. Notice similarly, that there is a 3PL suffix -t in the inflection class of ‘frightened’, ‘fast’, and ‘run’, but no suffix in the other classes. Unlike other layers, suffixes are mostly concerned with subject person and particularly number marking and tend to be invariable across TAM.

Table 9 Word endings of PRS and PST cells of the major stem-coda classes

A cautionary note is in order concerning the extra-morphological motivation of the inflectional classifications that have been briefly introduced throughout this Section. Although we are restricting our focus to morphological traits in this paper, it is not the case that, as in perfectly canonical inflection classes (Corbett, 2009), membership is completely unmotivated. The most salient extramorphological generalization is that verbs in certain classes (the prefixal conjugations of ‘accuse’, ‘able.to’, and ‘leave.tr’) are transitive but those of other classes are intransitive. Regarding suffixal classes, those with the less common suffixes 3PL -t or 2DU -k are also composed overwhelmingly of intransitive verbs, while those without these suffixes are overwhelmingly transitive. Despite this or other possible extramorphological predictors of inflectional classification, the remainder of this paper is concerned with morphological properties and generalizations exclusively.

4 Building VeLePa: an inflected lexicon of Central Pame

The data presented in Sect. 3, as well as the quantitative analysis presented in Sect. 5, rely on an original inflected lexicon which documents all S/A and TAM forms from 216 lexemes. Given the size of Pame paradigms, with 58 cells,Footnote 4 this resource contains 12528 word forms and is hence in the top half (32nd of 94) by size among non-(Indo-)European languages at present (see Kirov et al. 2018). Most of these words (9315, 74%) have no homophones, and they are never systematically syncretic with other values like e.g. English go 1SG.PRS, go 2SG.PRS, go 1PL.PRS, go 2PL.PRS, go 3PL.PRS. All these forms were collected through elicitation from two speakers of the language: a 30 year-old male speaker (30M), which was my main informant, and a 45 year old female (F45). Elicitation took place both face-to-face (in Santa María Acapulco, and Cárdenas, San Luis Potosí, Mexico) and through regular online fieldwork sessions over the last four years. Every single one of the 12528 inflectional forms it contains has been independently elicited (i.e. not extrapolated or predicted from other elicited form(s), as is common practice in this kind of resources), and has been checked, often multiple times, to ensure its accuracy, particularly in cases of irregular, variable, unexpected, or seemingly “aberrant” forms.

The present inflected lexicon of Central Pame, named VeLePa, has been built and designed to allow for accurate computational assessments of morphological complexity. The absence of extrapolated forms prevents the underestimation of morphological complexity (e.g. due to missing possible irregularities in otherwise predictable word forms). Two more practices have been adopted to avoid also the overestimation of morphological complexity:

The first concerns the necessity to gloss over systematic interspeaker differences. Although the forms from both of my main informants are most commonly the same, whenever consistent morphological differences were found between them, the variant preferred by the younger/main informant has been selected for consistency. Some examples of this are: i) 45F shows a morphophonological process whereby /aʔɛ/ and /ahɛ/ sequences show anticipatory vowel quality assimilation to /ɛʔɛ/ and /ɛhɛ/ respectively, while 30M does not. Thus, for example, ‘draw.1SG.PRS’ is lɛʔɛ̀s for F45 but laʔɛ̀s for M30. The latter was favoured and encoded in the inflected lexicon. ii) 45F has a prefix i-, in the 1PL and 2PL irrealis of Conjugation 3, while 30M has wi-. Thus, for example, ‘be.busy.1PL.EX.IRR’ is ikyénʔ for F45 but wikyénʔ for M30, the latter has been chosen.

The second concerns ensuring within-lexeme consistency as well in cases of variability and overabundance (i.e. multiple acceptable forms, see Thornton, 2012). In unstandardized languages like Central Pame, substantial variation exists concerning the concrete form to be used for particular values. Often, variation spans multiple forms in the paradigm. It becomes essential, in these cases, to understand which of these forms belong together from the point of view of paradigmatic structure in the language. Consider, for example, the case of the PST forms of German backen ‘bake’. The verb can be inflected as a strong one (buk bukst buk buken bukt buken) and as a weak one (backte backtest backte backten backtet backten). When describing and analyzing German verbal paradigmatic structure, it is crucial to register either the former set of forms, or the latter (or both), but not to mix the two (e.g. buk backtest backte backten backtet backten), as this would constitute a misrepresentation of the system and lead to an incorrect understanding of predictive relations in the language (1SG.PST and 3SG.PST are always syncretic in German, 1SG.PST and 2SG.PST are always mutually predictable, etc.). While these relations are well-known (and inflection generally simpler) in German and other well-researched languages, it required a great amount of work to ensure consistency in the case of Central Pame: observing trends and exceptions at every morphological layer, checking apparent exceptions through repeated elicitation, checking the (un)grammaticality of potential alternative word forms, etc. After this painstaking cleaning process, however, the documented forms can be taken to constitute a faithful representation of morphological complexity in Pame verbal inflection.

In cases of multi-lexeme overabundance, between-lexeme consistency is also needed to avoid the spurious complexification of the morphological relations. Consider, for example, the Spanish past subjunctive (Hanna, 2012; Rosemeyer & Schwenter, 2019). Every verb allows for two synonymous realizations of this tense (e.g. amaras and amases for the 2SG of ‘love’). Documenting some verbs with the former form (i.e. -ra) and some with the latter (-se) would introduce a spurious inflection class distinction in the language where none exists. For this reason, in cases of multiple well-formed candidates of this type, one variant was selected systematically across verbs too.Footnote 5

The practices and quality controls described in this section are seldom incorporated in the build-up of inflectional databases in underdocumented languages. This is understandable, since it is an extremely time-consuming process and the quantitative analysis of morphological predictability relations tends to be a somewhat niche secondary use of these primary resources. Documentary linguists, hence, cannot be generally expected to have this type of considerations in mind when they produce an inflectional database of their language of study. The fact remains, however, that, in the absence of these controls, morphological complexity could be systematically overestimated in some languages, particularly in underresearched and unstandardised languages. This has very notable implications for cross-linguistic research on morphological-paradigmatic complexity (Ackerman & Malouf, 2013; Stump & Finkel, 2013; Beniamine, 2018), as well as for the general validity of results and claims regarding the greater morphological complexity of low-contact languages with small speech communities (Kusters, 2003; McWhorter, 2007; Lupyan & Dale, 2010; Trudgill, 2011). Although, as the remainder of this section shows, very high levels of complexity do hold for Central Pame, it should be carefully explored whether the extraordinary complexity reported for some other underresearched inflectional systems could be to some extent an artefact of variation.

To further enhance the usefulness of this resource for computational morphologists, I have also supplemented it with token frequency estimates of the usage frequency of the different lemmas and of the different paradigm cells. Usage frequency, as is well known, varies dramatically both between cells and lemmas, and this uneven distribution in input is guaranteed to impact significantly which forms can be rote-learned and which predictive relations can(not) play a role in language users’ productive inflectional system. Frequencies were estimated in the following way: I registered the lemma and morpho-syntactic feature value array of every single verb form (1171 in total) in the extant Central Pame corpus (texts in Gibson, 1950a; Gibson et al., 1963; Gibson, 1966, and Hurch, 2022). This corpus is, of course, small (few texts and hence limited thematic diversity) and unbalanced due to the prevalence of narrative vs dialogue. To overcome some of these limitations and achieve a more balanced estimate, I asked my two main informants about the frequencyFootnote 6 with which they use the various lemmas in my database (it is well-known that subjective frequencies correlate very strongly with actual frequencies, see Carroll, 1971, Balota et al., 2001) and I averagedFootnote 7 objective (i.e. corpus) and subjective (i.e. speaker-provided) frequencies. With respect to cell frequencies, in addition, I applied a corrective index to raise the frequencies of non-PST tenses and non-3rd persons to compensate for the overrepresentation of narration in the extant corpus. The resulting lemma and cell frequency estimates (summarized in Fig. 1) are likely to be imperfect but are the best I can do given the current state of Central Pame documentation.

Fig. 1
figure 1

Freqs. of lemmas (tokens pmw, up) and cells (proportion of verb tokens, down) (Color figure online)

The inflected lexicon VeLePa is made freely available (CC Attribution 4.0 International) online at https://osf.io/xhyzm/?view_only=3e7ea64cd07c4dd994cd69bac71c9485.

5 Quantitative analysis of Central Pame verb inflection, predictability, and the PCFP

On the basis of this VeLePa inflected lexicon, I will analyze the predictive complexity of Central Pame verbal morphology in this section. The analyses have been conducted with Qumín (Beniamine, 2018) and Inflectional Networks (Sims, 2020). The former is a set of Python scriptsFootnote 8 that extract form-to-form morphological alternations which are then used to calculate Information-Theoretic measures of uncertainty like conditional entropies, to cluster inflection classes, etc. The latter is a set of R scriptsFootnote 9 that calculate Graph-Theoretic measures of the complexity and allomorphic overlap of inflection classes.

Two comments are in order regarding the application of these tools in the present paper. First, while Inflectional Networks is designed to work on what Stump & Finkel call ‘distinguishers’, i.e. the material that is variable across the paradigm and explicitly distinguishes some inflected forms from others within one lexeme, Qumín is explicitly designed to work on whole unsegmented forms and to infer distinguishers, as it were, from form-to-form morphological contrasts. Although for this paper I will often work with word subparts rather than whole wordforms, I consider this largely unproblematic. This is an analytical choice largely imposed by the nature of the system, and the application of Qumín to subword elements is a usage envisaged already by Beniamine in comparable multilayered inflectional systems (e.g. Navajo, Russian, and Chatino, see Beniamine 2016:106–122).

The second comment concerns the size of the present Pame lexicon. The reliability of the complexity metrics that these or other similar tools generate is, of course, dependent on the amount of data they have available, that is, on the size of the lexicon they are fed. Ideally, and maybe typically, research with these tools involves somewhat larger databases than the one which is presented in this paper. A smaller database simply means a lower degree of confidence that the measurements reported in this paper are representative of the language as a whole. That said, databases in the low hundreds by number of lexemes are by no means uncommon in the aforementioned quantitative literature (e.g. Kirov et al., 2018), and the curation of this resource for computational-morphological use specifically, provides a reasonable degree of confidence that VeLePa does allow meaningful quantitative explorations into the structure of Central Pame verbs. The overall aim of this section will thus be to obtain overviews and measurements of the morphological complexity of each inflectional layer of the language’s verbal inflection, and to discover their broad paradigmatic organizational principles.

5.1 The morphological complexity of prefixes

The 216 verbs in VeLePa classify into 22 different prefixal inflection classes,Footnote 10 whose number of members follows a Zipf (1935) distribution, with a few classes accruing a majority of verbs plus a long tail of low type frequency or singleton classes. It is a quite unusual characteristic of Central Pame that every inflection class has a pattern of syncretisms different from every other class. Looking back at the subparadigms in Table 6, the class of ‘be.busy’ is the only one to syncretize 2SG.PRS with 3PL.PRS, the class of ‘leave.tr’ is the only one to syncretize 2SG.PST and 3SG.PST, etc. This constitutes a notable complication for language learners’ induction of morphomic paradigm cells (i.e. which cells or value contrasts might be associated with morphological differences (see Boyé & Schalchli, 2019).

A second observation that emerges from the quantitative analysis of prefixes is that, although they are the most active of all layers, in that they make the largest number of morphosyntactic distinctions, some feature-values never correspond to different prefixes. The 11 person-number paradigm cells that every tense inflects for are reduced to 8 ‘morphomic cells’ (see Stump & Finkel, 2013, Boyé & Schalchli, 2019) that represent the maximal possibilities for contrast in this morphological layer. As displayed in Table 10, clusivity (i.e. exclusive vs inclusive in first person dual and plural) is never distinguished through prefixal marking. In addition, SG and DU are also always expressed through the same prefixal morphology in the third person.Footnote 11 In some tenses (PST and FUT), 1SG and 1DU are also always identical to each other when it comes to prefixes, as well as the 1DU and 2DU, and 1PL and 2PL in FUT. The result of these systematic syncretisms and predictabilities is 39 areas of perfect prefixal interpredictability (aka ‘distillations’, see Stump & Finkel, 2013) within the 58-cell paradigm of Central Pame. Note that this number, higher than that of any other inflectional system analyzed by Stump and Finkel (2013), and higher than that of the other inflectional layers analyzed throughout this section, is indicative of a very high complexity.

Table 10 Interpredictability areas in Central Pame verb prefixes. Different numbers correspond to different zones of interpredictability (non-contiguous ones are colored)

Between these 39 areas or cells that are not mutually predictable, we can calculate conditional entropies to estimate the predictive uncertainty that Central Pame language users face to predict some prefixes from others. Conditional entropies were calculated by feeding to Qumín the morphology of the relevant layer (the one in Table 6 in this case). Results are shown in Table 11. Averaging across all cells we find the system average to be 0.58 bits. The most difficult predictive relation is guessing the 1.PL.PRS from the 3SG.SUB (1.85 bits), due to the high allomorphic diversity of the former cell (ta-, to-, ti-, ∅-, wa-, see Table 6), the most difficult to predict from other cells (0.96). PRS forms are, as Table 11 shows (see distillations 1 through 8) the most difficult to predict overall, along with the 1.Irrealis (i.e. distillations 16–18), 1.SUB (distillations 14–26), and 2.IMP (distillations 17–39). This is largely due to the neutralization of allomorphic distinctions between the two largest inflection classes outside of this domain (see Table 6), which make it complicated to predict for example the right form of the 2SG.PRS kito from a 2SG.PST ni but not vice versa. The 3SG/DU.SUB is, in turn, the easiest form to predict from other cells (0.18). When we change the focus to cells’ usefulness in guessing other forms, the 2SG.IMP is the cell which makes this easiest (0.195), while the 3SG/DU.SUB is the cell which makes this most difficult (1.08). Note that there is a general inverse correlation between the difficulty to predict a form (which is increased, all things equal, with increasing amounts of allomorphy) and the difficulty to predict other forms from that form (which is decreased with greater allomorphy).

Table 11 Conditional entropies between prefixal domains (darker shade indicates higher values of conditional entropy, Min=0, Max=1.85, Average=0.58, Mean=0.53)

Graph Theory offers a complementary way of exploring inflection-class structure and the morphological overlap between classes (see Sims, 2020). Figure 2A shows precisely this. The size of nodes represents the log type frequency of the class, and the connecting lines represent shared allomorphy, with thicker lines indicating more morphological overlap. The colour of the node indicates ‘betweenness centrality’, i.e. a measure of the neighbourhood density of classes, with darker shades of red corresponding to denser neighbourhoods. Figure 2A, thus, shows the considerable morphological overlap between the largest and second largest prefixal classes (those represented by ‘ask’ and ‘pay’ in Table 6, observe how their past forms no-, ni-, ndo- are identical), as well as the abundance of microclasses surrounding the third (‘run’) and fourth (‘fly’) largest inflection classes.

Fig. 2A
figure 2

Graph-Theoretic representation of Central Pame prefixal inflection classes (Color figure online)

Figure 2B shows, in turn, a distance matrix between the classes, and a hierarchical clustering of these into macro-classes following the UPGMA method (Sokal & Michener, 1958) first applied to inflectional classification by Bonami (2014), and continued by Beniamine (2018). On it we observe, as in Fig. 2A, the clustering of various classes into two or three larger macro-classes (see especially the larger clusters of morphologically similar micro-classes around ‘fly’ and ‘run’). As expected from the findings of Sims and Parker (2016) following Stump and Finkel’s (2013) Marginal Detraction Hypothesis, it tends to be the smaller classes that have more neighbours/overlaps with other classes. Verbs from small or singleton classes also tend to be more frequent on average than those belonging to larger classes. According to the frequency estimations of VeLePa (see Fig. 1), verbs from classes with 3 or more members have an estimated token frequency around three times higher than those from 2-member or singleton classes (3069 vs 1177 tokens per million words respectively), which probably allows for more of the latter’s word forms to be memorized compared to the former.

Fig. 2B
figure 3

Hierarchical clustering of prefixal conjugation class by similarity (Color figure online)

5.2 The morphological complexity of stems

Stem morphology conflates morphosyntactic values following a pattern quite similar to prefixal morphology. Taking into account only alternations which are fully morphological and logically orthogonal from prefixal morphology (i.e. excluding stem palatalization and labialization), the domains in Table 12 are the ones which may use different stems. As with prefixes, clusivity values are never distinguished by different morphology within 1DU nor 1PL, and SG vs DU values are also never distinguished within 3. In addition to these generalizations shared with prefixes, SG and DU are also never distinguished within 1 nor within 2. IRR and SUB tenses are also sometimes not distinguished. The result of these additional conflations is that a lower number of interpredictability domains/distillations is distinguished in stems (29) compared to prefixes (39).

Table 12 Interpredictability areas in Central Pame verb stems

In terms of conditional entropies (see Table 13), the average between these distillations (0.69 bits) is somewhat higher than in prefixes. The most salient trait is the pronounced split between 3PL cells (distillations 6, 12, 18, and 27) and the rest. The stem in the 3PL of any given tense is highly predictive of the 3PL stem in other tenses (entropy consistently < 0.3),Footnote 12 but is much less reliable as the predictor of non-3PL cells (entropy consistently > 1). Non-3PL stems, in turn, are comparatively highly informative of the stem in other non-3PL cells (entropy consistently < 1), with comparatively lower conditional entropy between those paradigmatic domains that tend to share the same stem (e.g. PRS+1.non-PRS, distillations 1–5, 13, 14, 19, 22, 23, see Table 5).

Table 13 Conditional entropies of stems (higher values correspond to darker shades, Min=0.02, Max=1.44, Average=0.69, Median=0.65)

While the number of distillations is lower than in prefixes, the number of inflection classes (93) is much higher, which is understandable given the principally lexical, rather than grammatical role of this layer. Despite this, and even if we ignore phonologically-triggered stem alternations due to palatalization or labialization (see Table 2), the vast majority of verbs in Central Pame (208 out of 216) have some stem alternation, with 3PL alternation being the most common of all (see Table 7). The number of phonotactically permissible stem onsets is comparatively restricted in Central Pame, so the large number of stem-onset classes is generally not due to paradigm-wide differences (e.g. the verb ‘lay down’ has a stem -dd-, while ‘exchange’ has a stem -pp-, ‘loosen’ has a stem -k- or -kh-, and ‘borrow’ has stem -pp-, -w-, -m-, or -b-). These morphologically non-overlapping classes exist (see the isolated classes in Fig. 3A) side-by-side with morphologically overlapping ones. Sometimes the overlap is comparatively small (see ‘borrow’ and ‘exchange’ in Table 7), which is represented by light connecting lines between relatively far-away classes, and sometimes (the smaller clusters in Fig. 3A) it extends to most of the paradigm (e.g. ‘buy’ and ‘laugh’ in Table 7, which differ only in their 3PL stem), which is represented by the various small and tight clusters of classes in Fig. 3A). It is these overlaps, of course, that generate the morphological predictive uncertainties explored in Table 13, for example the fact that given a stem onset -tt- in the 1SG.PRS (as in ‘buy’ and ‘laugh’), the 3PL.PRS stem could be either -lh- (in the former) or -lʔ- (in the latter).

Fig. 3A
figure 4

Graph-Theoretic representation of Central Pame stem alternation classes (Color figure online)

Figure 3B (Qumín) shows a somewhat different picture than Fig. 3A (Sims, 2020) because of a somewhat different methodology. In 3b, segments that are shared across the paradigm are subtracted, so that for example ‘lay down’ and ‘exchange’ would be regarded as the same ∅ class (both are nonalternating stems) and only minimally different from ‘loosen’, which would have ∅ everywhere and just add -h- in some cells (in the 3PL). Under this procedure, many classes are based on a basic onset/segment to which other segments (or length) are added in a subset of cells (e.g. -ʔ-∼-tʔ- ‘answer’, -h-∼-lh- ‘be happy’, -k-∼-kh-∼-kː- ‘copy’, -k-∼-kʔ-∼-kː- ‘find’). These classes count as minimally alternating and, after subtracting the shared segment, tend to differ only in the 3PL or some similarly small domain in the paradigm (see ‘call self’ vs ‘be able’ in Table 7). These weakly alternating stems cluster into one macro-class at the top of Fig. 3B, while more radically alternating stem alternation classes (e.g. -pː-∼-w-∼-wː-∼-m-∼-b- ‘borrow’) show larger differences between them.

Fig. 3B
figure 5

Hierarchical clustering of stem alternation classes by similarity (Color figure online)

5.3 The morphological complexity of stress-tone

Stress and tone, intertwined in the language as explained in Sect. 2, constitute yet another morphological device that Central Pame has available to distinguish morphosyntactic values. As with the other morphological layers, I first present the way in which stress-tone splits the paradigm into different interpredictability domains or distillations (see Table 14). As in the previous layer(s), we find the conflation of EX and INC values within 1DU and 1PL, and of SG and DU values across tenses. Further expanding on the domain consolidations shown in Table 12, IRR, SUB and IMP are never distinguished when it comes to stress-tone. We also see some domains which extend across 1 and 3 in PRS and PST. All in all, 19 distillations are found in Central Pame stress-tone, a further reduction from the number of distillations in the previous layer.

Table 14 Interpredictability areas in tone-stress in Central Pame verbs

As in other layers, we explore in Table 15 the difficulty involved in predicting the tone-stress of one cell from that in another. The first interesting observation is that, despite the fact that the number of possible exponents is orders of magnitude lower than in the other morphological layers (only 4 different values of stress-tone are possible), the predictive challenge is not less complex, with the average conditional entropy between distillations (1.12 bits) actually substantially higher than in the other two layers. This is so also despite the fact that predictive uncertainty between SG/DU and PL within a given person-tense (i.e. distillations 6–7, 8–9, 10–11, 12–13, etc.) is very low because only suppletive verbs can differ here (e.g. exit.1SG.PST na-nnéheiŋ vs 1PL.INC na-lhèn, speak.1SG.PST ta-ttæ̃́ʔæ̃ vs 1PL.INC nda-ʔã̂õn, lie.1SG.PST ní-ggyaʔa vs 1PL.INC nda-bbàn, compare to non-suppletive steal.1SG.PST no-ppè vs 1PL.INC no-ppèn, go.1SG.PST ko-wwà vs 1PL.INC ko-wwàn, grind.1SG.PST ta-ndáhao vs 1PL.INC wi-ŋgyáhaon, etc.). Conditional entropies are also high despite the fact that, unlike in the other layers, two thirds of verbs are uninflectable (i.e. are invariable across the paradigm) regarding tone-stress.

Table 15 Conditional entropies of tone-stress (higher values correspond to darker shades, Min=0.04, Max=1.88, Average=1.12, Median=1.25)

Besides the abovementioned similarity of SG/DU and PL, Table 15 shows that there is less uncertainty involved in predicting stress-tone in the first person from the one in the third person or vice versa than in predicting second person stress-tone from that in either of the other persons. This is due to the fact that it is a common feature of many stem alternation patterns that tone-stress changes in 2 (particularly in non-PRS tenses) compared to the (‘default’) tone-stress found elsewhere in the paradigm (see Table 16).

Table 16 Some common patterns of tone-stress alternation in Central Pame

Despite the low number of different possible tone-stress values, only four, 44 different classes are found in VeLePa regarding this morphological trait. It is unsurprising, thus, that these classes have a considerable degree of morphological overlap (as shown for example by the forms in Table 16, note that all the verbs there have a low tone in 1.PST and 3.PST). Figure 4A provides a visual representation of this tonal overlap between classes. Notice that the largest classes (the four largest circles) are among the few which lack overlaps between themselves. This is so because of the invariant but different tone in each of them (see the classes of ‘clean’ [-L], ‘able.to’ [-H], ‘exchange’ [H-] and ‘heal.self’ [-F] in Table 7).Footnote 13 Further clustering into macro-classes is complicated (see Figs. 4A and 4B) due to the nature of the system but generally proceeds by grouping together classes which have the same ‘default’/majority tone (see e.g. how the fifth largest class/circle in 4A, corresponding to the class of ‘change’ in Table 7 is grouped closely with the largest one ‘clean’ because it has a -L tone everywhere except in the 2nd person non-past, which as mentioned above often has a different or reversed tone relative to the rest of the paradigm). Note that, interestingly, the class of ‘heal.self’ (the lighter-coloured big circle in Fig. 4A) only has weak morphological overlaps to other smaller classes because the falling tone never acts in Central Pame as the ‘default’/majority tone of alternating classes, and can only ever occur as the ‘reversed’ tone of 2 or 2.non-PRS (see Table 16).

Fig. 4A
figure 6

Graph-Theoretic representation of Central Pame tone alternation classes (Color figure online)

Fig. 4B
figure 7

Hierarchical clustering of tone alternation classes by similarity (Color figure online)

5.4 The morphological complexity of suffixes

Suffixes are also employed in Central Pame to signal morphosyntactic values, in particular, and unlike the rest of the inflectional layers, they serve to encode S/A number distinctions and clusivity. This can be observed clearly in the paradigmatic distribution of the various distillations (see Table 17), which unlike in the other layers are largely horizontal and therefore tend to collapse cells with the same number value (and generally also the same person value, but see distillations 1, 3, 10 and 11) across tenses.

Table 17 Interpredictability domains as per stem codas/suffixes in Central Pame verbs

‘Cry’ is the only verb with a PST vs rest stem-coda alternation (1SG.PRS la-wài vs 1SG.PST ta-mbàiŋʔ), and ‘leave’ is the only one with a person-based stem-coda alternation (1SG.PRS to-huáiŋ vs 2SG.PRS la-nhâ). Similarly, ‘live/be’ is the only verb with a 1PL.EX in -mʔ where the 1PL.INC has -n. Were it not for these three extremely irregular verbs there would be only 8 distillations in Central Pame suffix codas instead of 14, as 1=10, 3=11, 6=12, 7=13 and 1=6, 10=12, and 4=5. As the conditional entropies in Table 18 suggest, were it not for these two exceptions, distillations 1, 6, 10, and 12 would be merged into a single one. These (unsuffixed) SG cells are the most useful ones to predict the stern codas in other parts of the paradigm (average conditional entropy 0.45). Distillations 5, 4, and 2 are the least informative ones (average conditional entropy around 1.5), because the morphophonological operations described in Sect. 2 obscure the underlying/unsuffixed form of the stem.

Table 18 Conditional entropies of stem coda (higher values correspond to darker shades, Min=0, Max=2.38, Average=0.66, Median=0.47)

Looking at the ease with which forms are predicted, 2 (i.e. 1.DU.EX) is the most predictable one regarding stem coda (average conditional entropy 0.18), while 9 (i.e. the 3PL) is the least predictable one (entropy 1.61 on average) due to the unpredictable presence of a 3PL suffix -t or -nt, or its absence. 7/13 and 8 (i.e. 2DU and 2PL) are also comparatively difficult to predict because, as explained in Table 9, they are the locus of similar allomorphic complexities. The average conditional entropy of the whole layer is 0.66 bits, virtually identical to that of stem onsets.

The exploration of between-class morphological overlaps in Fig. 5A shows an interesting mix of the strongly-clustered and weakly-clustered structure that we saw in previous morphological layers (see Fig. 2A vs 3A), which is indicative of the mixed lexical-grammatical character of these stem codas. While some verbs contain stem codas which do not overlap with those of other classes (e.g. the -ʎ at the end of ‘bathe’, e.g. 1SG.PRS ti-wyâʔaiʎ, which is preserved under suffixation across all person-number values), those with more common codas and/or those which contain segments also present in suffixes (e.g. -ŋ, -ʔ, -∅, -n) all have strong morphological overlaps with each other as a result of the morphophonological changes discussed around Table 3. In addition, as explained in the discussion around Table 9, suffixal classes in the language consist of minor one-off deviations from the general default set of suffixes (see e.g. ‘deny’, ‘be busy’ and ‘run’ in Table 9, which have the same codas everywhere except in the 2DU: -∅, -k, -k respectively, and in the 3PL: -∅, -∅, -t respectively. These suffixal microclasses are hence only minimally different from each other (i.e. different often in just a single person-number form across tenses) and therefore cluster strongly together in Figs. 5A (see the tight clusters at the center and right) and 5B (top-left and bottom-right).

Fig. 5A
figure 8

Graph-Theoretic representation of Central Pame stem-coda classes (Color figure online)

Fig. 5B
figure 9

Hierarchical clustering of stem-coda classes by similarity (Color figure online)

5.5 Whole-word complexity

As briefly advanced in Sect. 3, there are good reasons to pursue a decompositional layer-by-layer analysis of Central Pame verbal inflection. The first is the ease of separation or segmentation of the morphology of different layers, and also their different albeit sometimes overlapping roles discussed along the present Sect. 5. Beyond these, the crucial observation is that, when we explore all the morphology within the word as a single unit, essentially all verbs and cells become unique and entropy reaches unwieldy levels. At the whole-word level, 214 distinct classes are found among the 216 verbs in VeLePa. With the exception of the verbs ‘able.to’ and ‘make.dry’, and ‘call.self’ and ‘understand/listen.each.other/self’, each verb in VeLePa represents an inflectional microclass of its own when whole word forms are explored. Similarly, at the paradigm-cell level, every single cell constitutes a distillation of its own (see Table 19); that is, no cell is perfectly mutually predictable with any other cell in the paradigm.

Table 19 Interpredictability domains as per whole-word morphology in Central Pame verbs

When we explore the whole-word conditional entropies between all these cells (Table 20), these recapitulate the predictive uncertainties of individual layers that were observed in Sects. 5.1 through 5.4. We observe very clearly, for example, the greater ease of predicting a 3PL cell from another 3PL cell, or a non-3PL cell from another non-3PL cell, compared to predicting across these categories. This recapitulates the stem-alternation- related uncertainties discussed around Table 13. Another aspect which clearly emerges in whole-word conditional entropies is the greater ease of predicting between cells with the same person and number value rather than across (notice the diagonal lighter-shade lines in Table 20). This derives from the fact that cells sharing person-number almost invariably share their suffixal morphology, as discussed in Sect. 5.4. Aspects from other layers are visually more difficult to apprehend in Table 20 but are also present. For example the greater predictability between first and third person cells (e.g. 1 and 9, 1 and 20, 1 and 31) compared to predictions from and to second person (notice the darker vertical and horizontal bands associated to cells 6–8, 17–19, 28–30, 39–41) reflects the structure of inflectional tone-stress in the language discussed around Table 14, by which tone is often the same in 1 and 3 but different in 2.

Table 20 Cond. entropies of whole words (higher values correspond to darker shades, Min=0.04, Max=7.34, Average=4.11, Median=4.14)

At the same time, however, the orthogonality of the subsystems discussed in previous sections means that at the whole-word level, uncertainty increases to extremely high levels (4.11 bits on average) that are arguably incompatible with the Low Conditional Entropy Conjecture of Ackerman and Malouf (2013) and with the system’s naturalistic acquisition by infants. The overall conclusion, thus, is that a look at individual layers (in other words segmentation) is more realistic as a model of how these systems are learned and structured in speakers’ minds. In other words, a look at individual layers yields more insights in this system than a strictly whole-word perspective. This conclusion is shared with the previous literature that has looked at bipartite systems like Navajo or distantly-related Chatino verbs (Beniamine et al., 2017; Beniamine, 2018:106–122). In a quadripartite system like the present Pame one, the problems this literature has observed, related mostly to data sparsity, would only be exacerbated. Given the large number of classes or alternations in the different layers (22 prefixal classes, 93 stem classes, 44 tone-stress classes, and 39 suffixal classes in my dataset), and given the orthogonality of these various classes, the number of possible options is such at the whole word level (22*93*44*39=3.5 million) that any database is guaranteed to miss existent combinations, so the results in Table 20 need to be taken with a grain of salt (it is revealing that 214 distinct classes are found among VeLePa’s 216 verbs at the whole-word level). Similarly, speakers and learners of the language are also virtually guaranteed to have never encountered some existent whole-word-level classes or alternations. Decomposition, in other words solving the PCFP separately in different layers, alleviates this problem.

6 Lexeme and value identification problem

The PCFP, i.e. predicting the form of some cells on the basis of other cells, is far from the only challenge that language users face when learning or making use of their inflectional systems. In fact, because it assumes a perfect knowledge of all of the language’s patterns and classes, PCFP-derived measures can sometimes be somewhat counterintuitive. As Boyé and Schalchli (2019) explain, a completely irregular suppletive verb can be taken to be the most predictable one in a language. This is generally because, for (conditional) entropy computations, stable/invariant segments (which are taken to be the lexical root) are discarded to keep variant segments only, so-called ‘distinguishers’ in the terminology of Stump and Finkel (2013), which are then taken to be the relevant grammatical inflections that express morphosyntactic values.

Consider the example in Table 21. In a regular verb like ‘belittle’, most segmental material is invariant, i.e. morphosyntactically uninformative and glossed over (i.e. not part of the ‘distinguisher’). In a suppletive verb like ‘exit’, by contrast, most segments become ‘distinguishers’ due to the use of different roots in different parts of the paradigm. As a result, distinguishers unmistakably (but maybe also trivially) identify the lemma they belong to in these cases, thus affording perfect predictive relations to the other cells.

Table 21 Three full forms and distinguishers from two Central Pame verbs (present tense)

Given a form from a previously unobserved lexeme (e.g. a 1SG láppo), however, the language user has no a priori way to know which of its segments change or stay constant, either across the paradigm, or with respect to another word form. Only if they know that they have observed every possible class and alternation pattern in the language can they make the inference that láppo must behave like lakkò, since its form is compatible with l_ but not with lanné_eiŋ. Needless to say, given the Zipfian nature of linguistic input (see Fig. 1), language learners can hardly ever be sure they have encountered every pattern in the language, particularly in a system like the presently analyzed one.

Besides the PCFP, thus, other measures and challenges of inflectional system use have been proposed in the literature. Erdmann et al. (2020), for example, discuss the Paradigm Identification Problem, i.e. the challenge that language users face to learn which of the words they come across in their input (e.g. tannéhei and tiʎhyèn) belong to the same lexeme, and which words (e.g. takkòn and tiʎhyèn) belong to the same cell. In some languages both problems can be relatively straightforward. Focusing on the latter, i.e. inferring a form’s morphosyntactic value(s), this can be easy, for example, if the sentential context provides robust cues. In German and other non-pro-drop languages, for example, subject pronouns are compulsory, which makes it easier to learn the person and number values associated with every verbal form. Another trait that facilitates inference of which words belong to the same cell is to have a morphological affix or other trait common to all of them (e.g. all 1PL verbal forms in Spanish end in /mos/). The same applies to inferring which forms belong to the same lexeme. If a language has long roots, easily segmentable boundaries between roots and affixes, and little or no suppletion, the word forms themselves will call attention to their lexical-semantic unity.

In Central Pame, however, pronouns are freely omitted, and most of the other facilitating ‘canonical’ (Corbett, 2007) morphological configurations are also not found. Due to inflection class differences, forms expressing the same value often look completely different (e.g. tóppes, lanã́ʔãĩ, tiʔyæ̂, taddoà, are all 1SG.PRS forms, tótsʔo, kippyàʔaik, tiʔyæ̂, lawâonʔ, taheóiŋ are all 2SG.PRS forms, nokò, nigyæ̂, tasséheiŋ, nannéheiŋ, kowwà, ndaddyóʔ, are all 1SG.PST forms, etc.). The challenge is similarly demanding with respect to lexemes. Due to multiple cases of suppletion and quite radical stem alternations, forms from the same lexeme can look very different (e.g. kiwyài, woŋgèiŋ, tambài, ndakèiŋ all belong to the same verb ‘cry’, lanĩã́, wáʔãõn, tattæ̃́ʔæ̃all belong to the same verb ‘speak’, and kikkyòn, wattòi, and lhòŋ all belong to the paradigm of ‘protect’, which is not even suppletive). Because of this, it must be a significant challenge in Central Pame to acquire the knowledge of features, values, and lexemes that the PCFP presupposes. In this section I want to briefly assess how challenging different form-to-meaning relations are in the language.

One of the most pervasive characteristics of Central Pame is the fact that lexical and grammatical information are provided often together across the length of the word. In a canonical prefixing language, grammatical information is provided first (i.e. at word beginnings) and lexical information last (i.e. at the end of words). In a canonical suffixing language, lexical information comes first and grammatical information follows at the end. Quantitative measures (conditional entropies, see Table 22) show that neither is the case in Central Pame, where significant amounts of lexical and grammatical information are given at almost every position along the word.

Table 22 Uncertainty (conditional entropy) of predicting meaning from word-form beginnings (left) and from word endings (right). Darker shades signal greater uncertainty

Despite this particularity, the tendency still holds that penultimate and antepenultimate segments provide more lexical information than second and third segments, and, conversely, that the first and second segments provide somewhat more grammatical information than the final or second-to-last segments. Furthermore, as we saw in Sect. 5, word beginnings (i.e. prefixes) are specialized in providing largely person and TAM information, while the stem codas provide number and clusivity information almost exclusively. Despite this tendency, the system is far from being a separative one where different syntagmatic positions are specialized in different agreement features.

Although this exceeds the scope and goals of the present research, it might be interesting to comment about possibilities for further research regarding the syntagmatic expression of different meanings (see also Herce et al., 2023). Given the cross-linguistic rarity of inflectional systems like the Central Pame one (i.e. with infixes, transfixes, circumfixes, distributed exponence, etc.), the question arises whether the separability of lexical and grammatical meaning (or different types of grammatical meaning) might confer processing or learning benefits, much like ease of processing has been claimed to explain the prevalence of suffixation relative to prefixation (Cutler et al., 1985, St. Clair et al., 2009, Berg, 2022). If lexical vs grammatical specialization of morphemes and slots is cognitively advantageous, this would explain why inflectional systems like the Central Pame one are uncommon.

7 Conclusion

This paper has presented a quantitative analysis of Central Pame verbal inflection and is accompanied by a freely-downloadable inflected lexicon, expressly adapted for computational use, with 12528 word forms, representing the complete subject and tense inflectional paradigms of 216 verbs. The first interesting aspect of this inflectional system revolves around its notable complexity, which stems from a large number of inflection classes and irregulars, as well as from morphophonological processes that hamper form-to-form morphological predictive relations from surface forms. The second main interest of this inflectional system concerns multi-layered morphology, as the language makes concurrent use of prefixes, stem alternations, tone-stress contrasts, and suffixes to express the relevant inflectional values: person-number and clusivity of the subject and six TAMs. All of these layers show a high degree of complexity, each being organized into dozens of inflection classes, which begs the question of how language users and learners of Central Pame can master the system. A comparatively large inflected lexicon like the one I present here (VeLePa), which includes cell and lemma frequency estimates, contributes to the documentation of endangered linguistic complexity and has the capacity to stimulate further research into the limits of linguistic cognition, paradigm architecture, the Paradigm Cell Filling Problem, and the related challenges that speakers of many highly-inflecting languages face.

The second part of the paper made use of Information-Theoretic and Graph-Theoretic methods to explore the morphological complexity of the system. For every layer, cell-to-cell predictive relations were explored to identify the main quantitative patterns and generalizations available to speakers, morphological overlaps between classes, their hierarchical clustering, etc. The main finding is that, although notable differences exist between inflectional subcomponents, their complexity is invariably high. An analysis of whole-word predictability uninformed by segmentation, however, would yield levels of complexity which are unreasonably high (i.e. almost every lexeme and cell is morphologically unique, and conditional is through the roof). The conclusion is that decomposition and segmentation are useful to both linguists and language users in multi-layered inflectional systems like the Central Pame one.

Future research could move from paradigmatic predictability to the exploration of syntagmatic and between-layer predictability in the language. How much segmentation helps is heavily dependent on whether the morphology of one layer predicts the morphology of another or whether morphological classification (i.e. inflection class membership) in different layers is mutually predictive. Unlike in paradigmatic predictability, linearization (i.e. the fact that prefixes precede stems, which in turn precede suffixes in speech) might play an important role. Conducting comparative cross-linguistic research on multi-layer inflectional systems emerges as a further promising avenue for future research. The present VeLePa inflected lexicon also offers the possibility to incorporate more and more diversely organized minority languages from culturally diverse societies into Natural Language Processing research, and into resources like Unimorph (McCarthy et al., 2020). On the documentation side, compiling larger and ever-expanding inflected lexicons of endangered languages is ultimately the way to move towards collaborations and cross-fertilization between domains of inquiry that would benefit enormously from paying closer attention to each other.