Realistic data and paradigms: the paradigm cell finding problem


Since Blevins (2006), there has been a shift in morphological frameworks away from what he called a constructive perspective towards an abstractive perspective based on data directly available to speakers (i.e whole words).

This evolution towards word-based morphology is part of a more general anticonstructionist movement in social sciences characterised by the quote in (1) about constructive approaches cited by Blevins et al. (2016a):

  1. (1)

    The main fallacy in this kind of thinking is that the reductionist hypothesis does not by any means imply a “constructionist” one: The ability to reduce everything to simple fundamental laws does not imply the ability to start from those laws and reconstruct the universe. In fact the more the elementary physicists tell us about the nature of the fundamental laws, the less relevance they seem to have to the very real problems of the rest of science, much less to those of society (Anderson 1972).

In this paper, we elaborate on Blevins (2006) to define a realistic perspective for the use of morphological data and give an illustration of its place in the emergence of both inflectional and derivation paradigms with the French verbs and the French Ethnics.

  1. 1.

    “In the ancient model the primary insight is not that words can be split into formatives, but they can be located in paradigms. They are not wholes composed of simple parts, but are themselves the parts within a complex whole.”

  2. 2.

    In theory, realistic data should start with utterances in context but, with morphology in mind, we consider a stage where the speaker is capable of breaking down utterances into words.

  3. 3.

    Paradigm Cell Filling/Finding Problem, see Ackerman et al. (2009) and Sect. 3 for explicit definitions.

  4. 4.

    Even though, the rules of referral Stump (2001) used alongside his rules of exponence are in fact paradigmatic relations.

  5. 5.

    At this level, Robins (1959) considers ‘Item & Process’ and ‘Item & Arrangement’ as variants.

  6. 6.

    The syntagmatic models cited above use formal paradigms but rather to define sets of morphosyntactic cells associated with the lexemes than to relate word-forms together.

  7. 7.

    We consider that the units of the syntactic frameworks and lexeme-based theories should be derived from primary data.

  8. 8.

    In fact, we do not argue that exponents should not be part of morphological analyses but that they should not be presupposed but rather abstracted.

  9. 9.

    A misleading but practical simplification for the purpose of their demonstration.

  10. 10.

    SdeWaC (Faaßand Eckart 2013) is a 880M word German corpus based on deWaC (Baroni et al. 2009), a 1700M word German corpus constructed from the web.

  11. 11.

    Zipf (1932).

  12. 12.

    See Henri (2010, Chap. 4, 125–150) for a detailed description of Mauritian verbal inflection.

  13. 13.

    The inflected variants counted here are purely based on forms as SdeWaC gives no information on the morphosyntactic properties of the noun forms. A form occupying several syncretic inflectional cells is counted only for 1. German nouns declension counts 8 morphosyntactic cells but many German nouns do not have 4 variants. The figure is reproduced as is from the article with the log-scale on the y axis.

  14. 14.

    Lexique3 is a French lexicon containing 135,000 word entries (55,000 lemmas) giving the frequencies for every word/lemma in Frantext and in a set of subtitles. See New (2006) for a complete description.

  15. 15.

    FrWaC is a 1600M word French corpus constructed from the web (Baroni et al. 2009).

  16. 16.

    There are three major differences between the accounts: i) Blevins et al. count inflected variants (the same form in 3 cells counts for 1) while Bonami and Beniamine count cell-forms (the same form in 3 cells counts for 3), ii) noun compounding is very productive in German, leading to a lot of hapax legomena compared to French verbs, iii) Bonami and Beniamine restrict their count of verb forms to the 6847 verbs documented in the Lefff lexicon (Sagot 2010) filtering out the huge amount of noise in FrWaC while Blevins et al. say nothing about filtering.

  17. 17.

    See Sect. 3.3 for an outline of the proposed adaptations.

  18. 18.

    Henri (2010, pp. 139–140) uses the typology of dialogical moves of Henri et al. (2008) which defines the counter-oriented move as follows: the Speaker’s take-up is in opposition to the Addressee’s move or the situation.

  19. 19.

    If the features could combine freely, the MSP would consist of the 8 following cells:

    • ne +, pc +, co + ne +, pc +, co - ne +, pc -, co + ne -, pc +, co +
      ne -, pc -, co - ne -, pc -, co + ne -, pc +, co - ne -, pc -, co -
  20. 20.

    In the Bonami et al. (2011) sample of 2079 verbs, 30% have only one form and 70% have two different forms.

  21. 21.

    Boyé and Schalchli (2016) term this a cell paradigm. Here we adopt optimal morphomic paradigm to emphasize the fact that this is a generalization of the morphomic paradigms of lexemes.

  22. 22.

    In line with speakers intuition about the meaning of inflected forms.

  23. 23.

    A sub-paradigm is a part of a paradigm concerned with a subset of inflectional attributes.

  24. 24.

    The 1st person has no high honorific grade and no contrast between low and mid grade.

  25. 25.

    The German adjectives also have a predicative form which we left aside here. It would make for an additional cell containing gut in both the morphosyntactic and the morphomic paradigm.

  26. 26.

    Here, we do not take into account the formal Dativ-e forms which would allow for a fifth form. The OMP calculations would be similar and the resulting OMP would have an additional cell as shown below.

    1. (i)
  27. 27.

    Because of the nature of intersective syncretism, the shadings in this table do not directly correspond to the shadings in Table 19.

  28. 28.

    Like Bonami and Beniamine (2015), we distinguish homophonous forms located in different cells. We refer to these as different cell-forms. We also use co-pair to designate co-forms corresponding to a pair of cells.

  29. 29.

    The method presented here uses only cotextual information but it could be applied in a similar way with contextual information.

  30. 30.

    Overabundance is the opposite of syncretism. It occurs when several inflected forms occupying the same inflectional cell for a given lexeme.

  31. 31.

    Including 845 overabundant forms.

  32. 32.

    We use the GRACE format recommended by Rajman et al. (1997) for the tagging of French conjugation.

  33. 33.

    This is the number of times the cell-form appeared in the sample, different from the Lexique3 frequency and variable from one sample to another.

  34. 34.

    See Fig. 9.

  35. 35.

    While French conjugation distinguishes 51 cells for verbs with past participle targeted by gender/number agreement (,,,, some intransitive verbs possess only the form and can present a full paradigm with 48 co-forms.

  36. 36.

    In this case, we could unify C1 and C2 or C2 and C3. When such a situation arises, we arbitrarily choose one of the largest unifiable set.

  37. 37.

    As noted by Boyé (2016), a large number of co-pairs is also crucial for the emergence of the right predictions between forms.

  38. 38.

    Grevisse and Goose (2007, §898, 1107–1108).

  39. 39.

    The minimal number of co-pairs necessary to distinguish the 57 inflectional classes defined by Stump and Finkel (2013).

  40. 40.

    The data presented here is voluntarily restricted to the citation forms of each lexeme to emphasize the importance of a paradigmatic approach of derivation.

  41. 41.

    Here we do not consider inflection but even if we did it would change little to the analysis. Number is neutralized for the nouns and for the adjectives. Both nouns and adjectives can display number contrasts in French (e.g. œil ‘eye’: sg vs pl; normal ‘normal’: vs but none of the Ethnics do. Following Bonami et al. (2004), we consider liaison contrasts between singulars and plurals to be related to differences in their linking elements (Ø vs ) rather than their inflected forms. Gender would only introduce one more cell as shown below. So, in fact, only taking into account the Ethnic and the country is almost the same as considering the complete set of cell-forms, especially when the relation between cells Sync2 and Sync3 is deemed to be inflectional.

    1. (i)
  42. 42.

    The entry point of all morphological research—morpheme-based as well as lexeme-based—is the word. In the first case, the analysis aims at breaking the word into morphemes and organizing them in a tree-like structure (...) In the second case, the analysis looks to describe the relations of forms and meanings between words.

  43. 43.

    The size of corpora and the number of examples are crucial factors for morphological studies and (...) to quote a well-know motto of corpus linguistics: More data is better data (...). More to the point, the extensive approach of morphology tries to base morphological analyses on the greatest possible number of examples considering that, in fact, the quantity of the examples taken into account directly affects the quality of the analyses (...). The accuracy of the description for the phenomena and the processes observed depend on that quantity. Analyses based on large numbers of examples are more precise and give a better understanding of less common data and more marginal phenomena.

  44. 44.

    Like English, French uses different allomorphs for this suffix (e.g. voir ‘see’, visible ‘visible’ and soudre ‘solve’, soluble ‘solvable’), we included all the allomorphs in our data.


