The advent and fall of a vocabulary learning bias from communicative efficiency

Biosemiosis is a process of choice-making between simultaneously alternative options. It is well-known that, when sufficiently young children encounter a new word, they tend to interpret it as pointing to a meaning that does not have a word yet in their lexicon rather than to a meaning that already has a word attached. In previous research, the strategy was shown to be optimal from an information theoretic standpoint. In that framework, interpretation is hypothesized to be driven by the minimization of a cost function: the option of least communication cost is chosen. However, the information theoretic model employed in that research neither explains the weakening of that vocabulary learning bias in older children or polylinguals nor reproduces Zipf's meaning-frequency law, namely the non-linear relationship between the number of meanings of a word and its frequency. Here we consider a generalization of the model that is channeled to reproduce that law. The analysis of the new model reveals regions of the phase space where the bias disappears consistently with the weakening or loss of the bias in older children or polylinguals. The model is abstract enough to support future research on other levels of life that are relevant to biosemiotics. In the deep learning era, the model is a transparent low-dimensional tool for future experimental research and illustrates the predictive power of a theoretical framework originally designed to shed light on the origins of Zipf's rank-frequency law.


Contents 1 Introduction
Biosemiotics can be defined as a science of signs in living systems (Kull, 1999, p. 386). Here we join the effort of developing such a science. Focusing on the problem of "learning" new signs, we hope to contribute (i) to place choice at the core of semiotic theory of learning (Kull, 2018) and (ii) to make biosemiotics compatible with the information theoretic perspective that is regarded as currently dominant in physics, chemistry, and molecular biology (Deacon, 2015).
Languages use words to convey information. From a semantic perspective, words stand for meanings (Fromkin et al., 2014). Correlates of word meaning have been investigated in other species (e.g. Hobaiter and Byrne, 2014;Genty and Zuberbühler, 2014;Moore, 2014). From a neurobiological perspective, words can be seen as the counterparts of cell assemblies with distinct cortical topographies (Pulvermuller, 2001;Pulvermüller, 2013). From a formal standpoint, the essence of that research is some binding between a sign or a form, e.g., a word or an ape gesture, and a counterpart, e.g. a 'meaning' or an assembly of cortical cells. Mathematically, that binding can be formalized as a bipartite graph where vertices are forms and their counterparts (Fig. 1). Such abstract setting allows for a powerful exploration of natural systems across levels of life, from the mapping of animal vocal or gestural behaviors ( Fig. 2 (a)) into their "meanings" down to the mapping from codons into amino acids (Figure 2 (b)) while allowing for a comparison against "artificial" coding systems such as the Morse code ( Fig. 2 (c)) or those emerging in artificial naming games (Hurford, 1989;Steels, 1996). In that setting, almost connectedness has been hypothesized to be the mathematical condition required for the emergence of a rudimentary form of syntax and symbolic reference (Ferrer-i-Cancho et al., 2005;Ferrer-i-Cancho, 2006). By symbolic reference, we mean here Deacon's revision of Pierce's view (Deacon, 1997). The almost connectedness condition is met when it is possible to reach practically any other vertex of the network by starting a walk from any possible vertex (as in Fig. 1 (a)-(b) but not in Figs. 1 (c)-(d)).
Since the pioneering research of G. K. Zipf (1949), statistical laws of language have been interpreted as manifestations of the minimization of cognitive costs (Zipf, 1949;Ellis and Hitchcock, 1986;Ferrer-i-Cancho and Díaz-Guilera, 2007;Gustison et al., 2016;Ferrer-i-Cancho et al., 2019). Zipf argued that the law of abbreviation, the tendency of more frequent words to be shorter, resulted from a minimization of a cost function involving, for every word, its frequency, its "mass" and its "distance", which in turn implies the minimization of the size of words (Zipf, 1949, p.59). Recently, it as been shown mathematically that the minimization of the average of the length of words (the mean code length in the language of information theory) predicts a correlation between frequency and duration that cannot be positive, extending and generalizing previous results from information theory (Ferrer-i-Cancho et al., 2019). The framework addresses the general problem of assigning codes as short as possible to counterparts represented by distinct numbers while warranting certain constraints, e.g., that every number will receive a distinct code (e.g. non-singular coding in the language of information theory). If the counterparts are word types from a vocabulary, it predicts the law of abbreviation as it occurs in the vast majority of languages (Bentz and Ferrer-i-Cancho, 2016). If these counterparts are meanings, it predicts that more frequent meanings should tend to be assigned smaller codes (e.g., shorter words) as found in real experiments (Kanwal et al., 2017;Brochhagen, 2021). Table 1 summarizes these and other predictions of compression.  Table 1 The application of the scientific method in quantitative linguistics (italics) with various concrete examples (roman). α is the exponent of Zipf's rank-frequency law (Zipf, 1949). The prediction that is the target of the current article is shown in boldface.

A family of probabilistic models
The bipartite graph of form-counterpart associations is the skeleton (Figs. 1 and 2) on which a family of models of communication has been built (Ferrer-i-Cancho and Díaz-Guilera, 2007;Ferrer-i-Cancho and Vitevitch, 2018). The target of the first of these models (Ferrer-i-Cancho and Sole, 2003) was Zipf's rank-frequency law, that defines the relationship between the frequency of a word f and its rank i, approximately as These early models were aimed at shedding light on mainly three questions: 1. The origins of this law (Ferrer-i-Cancho and Sole, 2003;Ferrer-i-Cancho, 2005b). 2. The range of variation of α in human language (Ferrer-i-Cancho, 2005a, 2006. 3. The relationship between α and the syntactic and referential complexity of a communication system (Ferrer-i-Cancho et al., 2005;Ferrer-i-Cancho, 2006).
The main assumption of these models is that word frequency is an epiphenomenon of the structure of the skeleton or the probability of the meanings. Following the metaphor of the skeleton, the models are bodies whose flesh are probabilities that are calculated from the skeleton. The first models defined p(s i |r j ), the probability that a speaker produces s i given a counterpart r j , as the same for all words connected to r j . In the language of mathematics, where a ij is a boolean (0 or 1) that indicates if s i and r j are connected and ω j is the degree of r j , namely the number of connections of r j with forms, i.e.
These models are often portrayed as models of the assignment of meanings to forms (Futrell, 2020;Piantadosi, 2014) but this description falls short because: -They are indeed models of production as they define the probability of producing a form given some counterparts (as in Eq. 1) or simply the marginal probability of a form. The claim that theories of language production or discourse do not explain the law (Piantadosi, 2014) has no basis and raises the questions of which theories of language production are deemed acceptable. -They are also models of understanding, as they define symmetric conditional probabilities such as p(r j |s i ), the probability that a listener interprets r j when receiving s i . -The models are flexible. In addition to "meaning", other counterparts were deemed possible from their birth. See for instance the use of the term "stimuli" (e.g. Ferrer-i-Cancho and Díaz-Guilera, 2007), as a replacement for meaning that was borrowed from neurolinguistics (Pulvermuller, 2001). -The models fit in the distributional semantics framework (Lund and Burgess, 1996) for two reasons: their flexibility, as counterparts can be dimensions in some hidden space, and also because of representing a form as a vector of their joint or conditional probabilities with "counterparts" that is inferred from the network structure, as we have already explained (Ferrer-i-Cancho and Vitevitch, 2018).
Contrary to the conclusions of (Piantadosi, 2014), there are derivations of Zipf's law that do account for psychological processes of word production, especially the intentionality of choosing words in order to convey a desired meaning.
The family of models assume that the skeleton that determines all the probabilities, the bipartite graph, is shaped by a combination of minimization of the entropy (or surprisal) of words (H) and the maximization of the mutual information between words and meanings (I), two principles that are cognitively motivated and that capture speaker and listener's requirements (Ferrer-i-Cancho, 2018). When only the entropy of words is minimized, configurations where only one form is linked as in Fig. 1 (d) are predicted. When only the mutual information between forms and counterparts is maximized, one-to-one mappings between forms and counterparts are predicted (when the number of forms and counterparts is the same) as in Figure 1 (c) or Fig. 2 (d). Real language is argued to be in-between these two extreme configurations (Ferrer-i-Cancho and Díaz-Guilera, 2007). Such a trade-off between simplicity (Zipf's unification) and effective communication (Zipf's diversification) is also found in information theoretic models of communication based on the information bottleneck approach (see Zaslavsky et al. (2021) and references there in).
In quantitative linguistics, scientific theory is not possible without taking into consideration language laws (Köhler, 1987;Debowski, 2020). Laws are seen as manifestations of principles (also referred as "requirements" by Köhler (1987)), which are key components of explanations of linguistic phenomena. As part of the scientific method cycle, novel predictions are key aim (Altmann, 1993) and key to validation and refinement of theory (Bunge, 2001). Table 1 synthesizes this general view as chains of the form: laws, principles that are inferred from them, and predictions that are made from those principles, giving concrete examples from previous research.
Although one of the initial goals of the family of models was to shed light on the origins of Zipf's law for word frequencies, a member of the family of models turned out to generate a novel prediction on vocabulary learning in children and the tendency of words to contrast in meaning (Ferrer-i-Cancho, 2017a): when encountering a new word, children tend to infer that it refers to a concept that does not have a word attached to it (Markman and Wachtel, 1988;Merriman and Bowman, 1989;Clark, 1993). The finding is cross-linguistically robust: it has been found in children speaking English (Markman and Wachtel, 1988), Canadian French (Nicoladis and Laurent, 2020), Japanese (Haryu, 1991), Mandarin Chinese (Byers-Heinlein and Werker, 2013;Hung et al., 2015), Korean (Eun-Nam, 2017). These languages correspond to four distinct linguistic families (Indo-European, Japonic, Sino-Tibetan, Koreanic). Furthermore, the finding has also been replicated in adults (Hendrickson and Perfors, 2019;Yurovsky and Yu, 2008) and other species Kaminski et al. (2004). This phenomenon is a example of biosemiosis, namely a process of choice-making between simultaneously alternative options (Kull, 2018, p. 454).
As an explanation for vocabulary learning, the information theoretic model suffers from some limitations that motivate the present article. The first one is that the vocabulary learning bias weakens in older children (Kalashnikova et al., 2016;Yildiz, 2020) or in polylinguals (Houston-Price et al., 2010;Kalashnikova et al., 2015), while the current version of the model predicts the vocabulary learning bias Strategies for linking a new word to a meaning. Strategy a consists of linking a word to a free meaning, namely an unlinked meaning. Strategy b consists of linking a word to a meaning that is already linked. We assume that the meaning that is already linked is connected to a single word of degree µ k . Two simplifying assumptions are considered. (a) Counterpart degrees do not exceed one, implying µ k ≥ 1. (b) Vertex degrees do not exceed one, implying µ k = 1.
only provided that mutual information maximization is not neglected (Ferrer-i-Cancho, 2017a). The second limitation is inherited from the family of models, where the definition of the probabilities over the bipartite graph skeleton leads to a linear relationship between the frequency of a form and its number of counterparts (Ferrer-i-Cancho and Vitevitch, 2018). However, this is inconsistent with Zipf's prediction, namely that the number of meanings µ a word of frequency f should follow (Zipf, 1945) with δ = 0.5. Eq. 2 is known as Zipf's meaning-frequency law (Zipf, 1949). To overcome such a limitation, Ferrer-i-Cancho and Vitevitch (2018) proposed different ways of modifying the definition of the probabilities from the skeleton. Here we borrow a proposal of defining the joint probability of a form and its counterpart as where φ is a parameter of the model and µ i and ω j are, respectively, the degree (number of connections) of the form s i and the counterpart r j . Previous research on vocabulary learning in children with these models (Ferrer-i-Cancho, 2017a) assumed φ = 0, which leads to δ = 1 (Ferrer-i-Cancho, 2016b). When φ = 1, the system is channeled to reproduce Zipf's meaning-frequency law, i.e. Eq. 2 with δ = 0.5 (Ferrer-i-Cancho and Vitevitch, 2018).

Overview of the present article
It has been argued that there cannot be meaning without interpretation (Eco, 1986). As Kull (2020) puts it, "Interpretation (which is the same as primitive decisionmaking) assumes that there exists a choice between two or more options. The options can be described as different codes applicable simultaneously in the same situation." The main aim to of this article is to shed light on the choice between strategy a, i.e. attaching the new form to a counterpart that is unlinked, and strategy b, i.e. attaching the new form to a counterpart that is already linked (Fig. 3). The remainder of the article is organized as follows. Section 2 considers a model of a communication system that has three components: 1. A skeleton that is defined by a binary matrix A that indicates the formcounterpart connections. 2. A flesh that is defined over the skeleton with Eq. 3, 3. A cost function, that defines the cost of communication as where λ is a parameter that regulates the weight of mutual information (I) maximization and word entropy (H) minimization such that 0 ≤ λ ≤ 1. I and H are inferred from matrix A and Eq. 3 (further details are given in Section 2).
This section introduces ∆, i.e. the difference in the cost of communication between strategy a and strategy b according to Ω (Fig. 3). ∆ < 0 indicates that the cost of communication of strategy a is lower than that of b. Our main hypothesis is that interpretation is driven by the Ω cost function and that a receiver will choose the option that minimizes the resulting Ω. By doing this, we are challenging the longstanding and limiting belief that information theory is dissociated from semiotics and not concerned about meaning (e.g. Deacon, 2015). This article is a just one counterexample (see also Zaslavsky et al. (2018)). Information theory, as any abstract powerful mathematical tool, can serve applications that do not assume meaning (or meaning-making processes) as in the original setting of telecommunication where it was developed by Shannon, as well as others that do, although they were not his primary concern for historical and sociological reasons. In general, the formula of ∆ is complex and the analysis of the conditions where a is advantageous (namely ∆ < 0) requires making some simplifying assumptions. If φ = 0, then one obtains that Ferrer-i-Cancho (2017a) where M is the number of edges in the skeleton and ω j is the degree of the already linked counterpart that is selected in strategy b (Fig. 3). Eq. 5 indicates that strategy a will be advantageous provided that mutual information maximization matters (i.e. λ > 0) and its advantage will increase as mutual information maximization becomes more important (i.e. for larger λ), the linked counterpart has more connections (i.e. larger ω j ) or when the skeleton has less connections (i.e. smaller M ). To be able to analyze the case φ > 0, we will examine two classes of skeleta that are presented next.
Counterpart degrees do not exceed one. In this class, the degrees of counterparts are restricted to not exceed one, namely a counterpart can only be disconnected or connected to just one form. If meanings are taken as counterparts, this class matches the view that "no two words ever have exactly the same meaning" (Fromkin et al., 2014, p. 256), based on the notion of absolute synonymy (Dangli and Abazaj, 2009). This class also mirrors the linguistic principle that any two words should contrast in meaning (Clark, 1987). Alternatively, if synonyms are deemed real to some extent, this class may capture early stages of language development in children or early stages in the evolution of languages where synonyms have not been learned or developed. From a theoretical standpoint, this class is required by the maximization of the mutual information between forms and counterparts when the number of forms does not exceed that of counterparts (Ferrer-i-Cancho and Vitevitch, 2018). We use µ k to refer to degree of the word that will be connected to meaning selected in strategy b (Fig. 3). We will show that, in this class, ∆ is determined by λ, φ, µ k and the degree distribution of forms, namely the vector of form degrees µ = (µ 1 , ..., µ i , ...µn).
Vertex degrees do not exceed one. In this class, the degrees of any vertex are restricted to not exceed one, namely a form (or a meaning) can only be disconnected or connected to just one counterpart (just one form). This class is narrower than the previous one because it imposes that degrees do not exceed one both for forms and counterparts. Words lack homonymy (or polysemy). We believe that this class would correspond to even earlier stages of language development in children (where children have learned at most one meaning of a word) or earlier stages in the evolution of languages (where the communication system has not developed any homonymy). From a theoretical stand point, that class is a requirement of maximizing mutual information between forms and counterparts when n = m (Ferrer-i-Cancho and Vitevitch, 2018). We will show that ∆ is determined just by λ, φ and M , the number of links of the bipartite skeleton.
Notice that meanings with synonyms have been found in chimpanzee gestures (Hobaiter and Byrne, 2014), which suggests that the two classes above do not capture the current state of the development of form-counterpart mappings in adults of other species. Section 2 presents the formulae of ∆ for each classes. Section 3 uses this formulae to explore the conditions that determine when strategy a is more advantageous, namely ∆ < 0, for each of the two classes of skeleta above, that correspond to different stages of the development of language in children. While the condition φ = 0 implies that strategy a is always advantageous when λ > 0, we find regions of the space of parameters where this is not the case when φ > 0 and λ > 0. In the more restrictive class, where vertex degrees do not exceed one, we find a region where a is not advantageous when λ is sufficiently small and M is sufficiently large. The size of that region increases as φ increases. From a complementary perspective, we find a region where a is not advantageous (∆ ≥ 0) when λ is sufficiency small and φ is sufficiently large; the size of the region increases as M increases. As M is expected to be larger in older children or in polylinguals (if the forms of each language are mixed in the same skeleton), the model predicts the weakening of the bias in older children and polylinguals (Liittschwager and Markman, 1994;Kalashnikova et al., 2016;Yildiz, 2020;Houston-Price et al., 2010;Kalashnikova et al., 2015Kalashnikova et al., , 2019. To ease the exploration of the phase space for the class where the degrees of counterparts do not exceed one, we will assume that word frequencies follow Zipf's rank-frequency law. Again, regions where a is not advantageous (∆ ≥ 0) also appear but the conditions for the emergence of this regions are more complex. Our preliminary analyses suggest that the bias should weaken in older children even for this class. Section 4 discusses the findings, suggests future research directions and reviews the research program in light of the scientific method.

The mathematical model
Below we give more details about the model that we use to investigate the learning of new words and outlines the arguments that take from Eq. 3 to concrete formulae of ∆. Section 2.1 just presents the concrete formulae ∆ for each of the two classes of skeleta. Full details are given in Appendix A. The model has four components that we review next.
Skeleton (A = a ij ). A bipartite graph that defines the associations between n forms and m counterparts that are defined by an adjacency matrix A = {a ij }.
Flesh (p(s i , r j )). The flesh consist of a definition of p(s i , r j ), the joint probability of a form (or word) and a counterpart (or meaning) and a series of probability definitions stemming from it. Probabilities depart from previous work (Ferrer-i-Cancho and Sole, 2003;Ferrer-i-Cancho, 2005b) by the addition of the parameter φ. Eq. 3 defines p(s i , r j ) as proportional to the product of the degrees of the form and the counterpart to the power of φ, which is a parameter of the model. By normalization, namely where From these expressions, the marginal probabilities of a form p(s i ) and a counterpart p(r j ) are obtained easily thanks to The cost of communication (Ω). The cost function is initially defined in Eq. 4 as in previous research (e.g. Ferrer-i-Cancho and Díaz-Guilera, 2007). In more detail, where I(S, R) is the mutual information between forms from a repertoire S and counterparts from a repertoire R, and H(S) is the entropy (or surprisal) of forms from a repertoire S. Knowing that I(S, R) = H(S) + H(R) − H(S, R) Cover and Thomas (2006), the final expression for the cost function in this article is The entropies H(S), H(R) and H(S, R) are easy to calculate applying the definitions of p(s i ), p(r j ) and p(s i , r j ), respectively.
The difference in the cost of learning a new word (∆). There are two possible strategies to determine the counterpart with which a new form (a previously unlinked form) should connect (Fig. 3): a. Connect the new form to a counterpart that is not already connected to any other forms. b. Connect the new form to a counterpart that is connected to at least one other form.
The question we intend to answer is "when does strategy a result in a smaller cost than strategy b?" Or, in the terminology of child language research, "for which strategy is the assumption of mutual exclusivity more advantageous?" To answer these questions, we define ∆, as a the difference between the cost of each strategy. More precisely, where Ω a (λ) and Ω b (λ) are the new value of Ω when a new link is created using strategy a or b respectively. Then, our research question becomes "When is ∆ < 0?". Formulae for Ω a (λ) and Ω b (λ) are derived in two steps. First, analyzing a general problem, i.e. Ω , the new value of Ω after producing a single mutation in A (Appendix A.2). Second, deriving expressions for the case where that mutation results from linking a new form (an unlinked form) to a counterpart, that can be linked or unlinked (Appendix A.3).

∆ in two classes of skeleta
In previous work, the value of ∆ was already calculated for φ = 0, obtaining expressions equivalent to Eq. 5 (see Appendix A.3.1 for a derivation). The next sections just summarize the more complex formulae that are obtained for each class of skeleta for φ ≥ 0 (see Appendix A for details on the derivation).

Vertex degrees do not exceed one
Here forms and counterparts both either have a single connection or are disconnected. Mathematically, this can be expressed as Fig. 3 (b) offers a visual representation of a bipartite graph of this class. In case b, the counterpart we connect the new form to is connected to only one form (ω j = 1) and that form is connected to only one counterpart (µ k = 1). Under this class, ∆ becomes which can be rewritten as linear function of λ, i.e.
Importantly, notice that this expression of ∆ is determined only by λ, φ and M (the total number of links in the model). See Appendix A.3.3 for thorough derivations.

Counterpart degrees do not exceed one
This class of skeleta is a relaxation of the previous class. Counterparts are either connected to a single form or disconnected. Mathematically, ω j ∈ {0, 1} for each j such that 1 ≤ j ≤ m. Fig. 3 (a) offers a visual representation of a bipartite graph of this class. The number of forms the counterpart in case b is connected to is still 1 (ω j = 1) but this form may be connected to any number of counterparts; µ k has to satisfy 1 ≤ µ k ≤ m. Under this class, ∆ becomes where Eq. 12 can also be expressed as a linear function of λ as Being a relaxation of the previous class, the resulting expressions of ∆ are more complex than those of the previous class, which are an in turn more complex than those of the case φ = 0 (Eq. 5). See Appendix A.3.2 for further details on the derivation of ∆.
Notice that X(S, R) (Eq. 13) and M φ (Eq. 14) are determined by the degrees of the forms (µ i 's). To explore the phase space with a realistic distribution of µ i 's, we assume, without any loss of generality, that the µ i 's are sorted decreasingly, i.e. µ 1 ≥ µ 2 ≥ ...µ i ≥ µ i+1 ≥ ...µn. In addition, we assume 1. µn = 0, because we are investigating the problem of linking and unlinked form with counterparts. 2. µ n−1 = 1. 3. Form degrees are continuous. 4. The relationship between µ i and its frequency rank is a right-truncated powerlaw, i.e.
Appendix B shows that forms then follow Zipf's rank-frequency law, i.e.
The value of ∆ is determined by λ, φ, µ k and the sequence of degrees of the forms, which we have parameterized with α and n. When τ = α φ+1 = 0, namely when α = 0 or when φ → ∞, we recover the class where vertex degrees do not exceed one but with just one form that is unlinked.
A continuous approximation to the number of edges gives (Appendix B) We aim to shed some light on the possible trajectory that children will describe on Fig. 4 as they become older. One expects that M tends to increase as children become older, due to word learning. It is easy to see that Eq. 16 predicts that, if φ and α remain constant, M is expected to increase as n increases (Fig. 4). Besides, when n remains constant, a reduction of α implies a reduction of M when φ = 0 but that effect vanishes for φ > 0 (Fig. 4). Obviously, n tends to increase as a child becomes older (Saxton, 2010) and thus children's trajectory will be from left to right in Fig. 4. As for the temporal evolution of α, there are two possibilities. Zipf's pioneering investigations suggest that α remains close to 1 over time in English children (Zipf, 1949, Chapter IV). In contrast, a wider study reported a tendency of α to decrease over time in sufficiently old children of different languages (Baixeries et al., 2013) but the study did not determine the actual number of children where that trend was statistically significant after controlling for multiple comparisons. Then children, as they become older, are likely to move either from left to right, keeping α constant, or from the left-upper corner (high α, low n) to the bottomright corner (low α, high n) within each panel of Fig. 4. When φ is sufficiently large, the actual evolution of some children (decrease of α jointly with an increase of n) is dominated by the increase of M that the growth of n implies in the long run ( Fig. 4). When exploring the space of parameters, we must warrant that µ k does not exceed the maximum degree that n, φ and α yield, namely µ k ≤ µ 1 , where µ 1 is defined according to Eq. 15 with i = 1, i.e.

Results
Here we will analyze ∆, that takes a negative value when strategy a (linking a new form to a new counterpart) is more advantageous than strategy b linking a new form to an already connected counterpart), and a positive value otherwise. |∆| indicates the strength of the bias towards strategy a if ∆ < 0; towards strategy b if ∆ > 0. Therefore, when ∆ < 0, the smaller the value of ∆, the higher the bias for strategy a whereas when ∆ > 0, the greater the value of ∆, the higher the bias for strategy b. Each class of skeleta is analyzed separately, beginning by the most restrictive class.

Vertex degrees do not exceed one
In this class of skeleta, corresponding to younger children, ∆ depends only on φ, M and λ. We will explore the phase space with the help of two-dimensional heatmaps of ∆ where the x-axis is always λ and the y-axis is M or φ. Figs. 5 and 6 reveal regions where strategy a is more advantageous (red) and regions where b is more advantageous (blue) according to ∆. The extreme situation is found when φ = 0 where a single red region covers practically all space except for λ = 0 (Fig. 5, top-left) as expected from previous work (Ferrer-i-Cancho, 2017a) and Eq. 5. Figs. 7 and 8 summarize these finding of regions, displaying the curve that defines the boundary between strategies a and b (∆ = 0). Figs. 7 and 8 show that strategy b is the optimal only if λ is sufficiently low, namely when the weight of entropy minimization is sufficiently high compared to that of mutual information maximization. Fig. 7 shows that the larger the value of λ the larger the number of links (M ) that is required for strategy b to be optimal. Fig. 7 also indicates that the larger the value of φ, the broader the blue region where b is optimal. From a symmetric perspective, Fig. 8 shows that the larger the value of λ the larger the value of φ that is required for strategy b to be optimal and also that the larger the number of links (M ), the broader the blue region where b is optimal.

Counterpart degrees do not exceed one
For this class of skeleta, corresponding to older children, we have assumed that word frequencies follow Zipf's rank-frequency law, namely the relationship between the probability of a form (the number of counterparts connected to each form) and its frequency rank follows a right-truncated power-law with exponent α (Section 2). Then ∆ depends only on α (the exponent of the right-truncated power law), n (the number of forms), µ k (the degree of the form linked to the counterpart in strategy b as shown in Fig. 3), φ and λ. We will explore the phase space with the help of two-dimensional heatmaps of ∆ where the x-axis is always λ and the y-axis is µ k , α or n. While in the class where vertex degrees do not exceed one we have found only one blue region (a region where ∆ > 0 meaning that b is more advantageous), this class yields up to two distinct blue regions located in opposite corners of the heatmap while keeping always a red region as show in Figs. 10, 12 and 14 for φ = 1 from different perspectives. For the sake of brevity, this section only presents heatmaps of ∆ for φ = 0 or φ = 1 (see Appendix C for the remainder). A summary of exploration of the parameter space follows.
Heatmaps of ∆ as a function of λ and µ k . The heatmaps of ∆ for different combinations of parameters in Figs. 9,10,16,17,18 and 19 are summarized in Fig. 11, showing the frontiers between regions where ∆ = 0. Notice how, for φ = 0, (f) Fig. 6 ∆, the difference between the cost of strategy a and strategy b, as a function of φ, the parameter that defines how the flesh of the model from the skeleton, and λ, the parameter that controls the balance between mutual information maximization and entropy minimization (Eq. 11). Red indicates that strategy a is more advantageous while blue indicates that b is more advantageous. The lighter the red, the stronger the bias for strategy a. The lighter the blue, the stronger the bias for strategy b. strategy a is optimal for all values of λ > 0, as one would expect from Eq. 5. The remainder of the figures show how the shape of the two areas changes with each of the parameters. For small n and α, a single blue region indicates that strategy b is more advantageous than a when λ is closer to 0 and µ k is higher. For higher n or α an additional blue region appears indicating that strategy b is also optimal for high values of λ and low values of µ k .
Heatmaps of ∆ as a function of λ and α. The heatmaps of ∆ for different combinations of parameters in Figs. 12, 20, 21, 22 and 23 are summarized in Fig. 13, showing the frontiers between regions. There is a single region where strategy b is optimal for small values of µ k and φ, but for larger values a second blue region appears.
Heatmaps of ∆ as a function of λ and n. The heatmaps of ∆ for different combinations of parameters in Figs. 14, 24, 25, 26 and 27 are summarized in Fig. 15. Again, one or two blue regions appear depending on the combination of parameters. See Appendix D for the impact of using discrete form degrees on the results presented in this section.  Fig. 3, the number of links and λ, the parameter that controls the balance between mutual information maximization and entropy minimization, when the degrees of counterparts do not exceed one (Eq. 11) and φ = 0. Red indicates that strategy a is more advantageous while blue indicates that b is more advantageous. The lighter the red, the stronger the bias for strategy a. The lighter the blue, the stronger the bias for strategy b. Each heatmap corresponds to a distinct combination of n and α. The heatmaps are arranged, from left to right, with α = 0.5, 1, 1.5 and, from top to bottom, with n = 10, 100, 1000. (a) α = 0.5 and n = 10, (b) α = 1 and n = 10, (c) α = 1.5 and n = 10, (d) α = 0.5 and n = 100, (e) α = 1 and n = 100, (f) α = 1.5 and n = 100, (g) α = 0.5 and n = 1000, (h) α = 1 and n = 1000, (i) α = 1.5 and n = 1000. Fig. 12 ∆, the difference between the cost of strategy a and strategy b, as a function of α, the exponent of the rank-frequency law, and λ, the parameter that controls the balance between mutual information maximization and entropy minimization, when the degrees of counterparts do not exceed one (Eq. 11) and φ = 1. Red indicates that strategy a is more advantageous while blue indicates that b is more advantageous. The lighter the red, the stronger the bias for strategy a. The lighter the blue, the stronger the bias for strategy b. Each heatmap corresponds to a distinct combination of n and µ k . The heatmaps are arranged, from left to right, with n = 10, 100, 1000 and, from top to bottom, with µ k = 1, 2, 4, 8. Gray indicates regions where µ k exceeds the maximum degree according to other parameters (Eq. 17). (a) µ k = 1 and n = 10, (b) µ k = 1 and n = 100, (c) µ k = 1 and n = 1000, (d) µ k = 2 and n = 10, (e) µ k = 2 and n = 100, (f) µ k = 2 and n = 1000, (g) µ k = 4 and n = 10, (h) µ k = 4 and n = 100, (i) µ k = 4 and n = 1000, (j) µ k = 8 and n = 10, (k) µ k = 8 and n = 100, (l) µ k = 8 and n = 1000. Each distinct heatmap corresponds to a distinct combination of µ k and n. (a) µ k = 1 and n = 10, (b) µ k = 1 and n = 100, (c) µ k = 1 and n = 1000, (d) µ k = 2 and n = 10, (e) µ k = 2 and n = 100, (f) µ k = 2 and n = 1000, (g) µ k = 4 and n = 10, (h) µ k = 4 and n = 100, (i) µ k = 4 and n = 1000, (j) µ k = 8 and n = 10, (k) µ k = 8 and n = 100, (l) µ k = 8 and n = 1000.
Fig. 14 ∆, the difference between the cost of strategy a and strategy b, as function of n, the number of forms, and λ, the parameter that controls the balance between mutual information maximization and entropy minimization, when the degrees of counterparts do not exceed one (Eq. 11) and φ = 1. We are taking values of n from 10 onwards (instead of one onwards) to see more clearly the light regions that are reflected on the color scales. Red indicates that strategy a is more advantageous while blue indicates that b is more advantageous. The lighter the red, the stronger the bias for strategy a. The lighter the blue, the stronger the bias for strategy b.

Vocabulary learning
In previous research with φ = 0, we predicted that the vocabulary learning bias (strategy a) would be present provided that mutual information minimization is not disabled (λ > 0) (Ferrer-i-Cancho, 2017a) as show in Eq. 5. However, the "decision" on whether assigning a new label to a linked or to an unlinked object is influenced by the age of a child and his/her degree of polylingualism. As for the effect of the latter, polylingual children tend to pick familiar objects more often than monolingual children, violating mutual exclusivity. This has been found for younger children below two years of age (17- , 2015). One possible explanation for this phenomenon is the lexicon structure hypothesis (Byers-Heinlein and Werker, 2013), which suggests that children that already have many multiple-word-tosingle-object mappings may be more willing to suspend mutual exclusivity. As for the effect of age on monolingual children, the so-called mutual exclusivity bias has been shown to appear at an early age and, as time goes on, it is more easily suspended. Starting at 17 months old, children tend to look at a novel object rather than a familiar one when presented with a new word while 16-month-olds do not show a preference (Halberda, 2003). Interestingly, in the same study, 14-month-olds systematically look at a familiar object instead of a newer one. Reliance on mutual exclusivity is shown to improve between 18 and 30 months (Bion et al., 2013). Starting at least at 24 months of age, children may suspend mutual exclusivity to learn a second label for an object (Liittschwager and Markman, 1994). In a more recent study, it has been shown that three year old children will suspend mutual exclusivity if there are enough social cues present (Yildiz, 2020). Four to five year old children continue to apply mutual exclusivity to learn new words but are able to apply it flexibly, suspending it when given appropriate contextual information (Kalashnikova et al., 2016) in order to associate multiple labels to the same familiar object. As seen before, at 3 years of age both monolingual and polylingual children have similar willingness to suspend mutual exclusivity (Nicoladis and Laurent, 2020;Frank and Poulin-Dubois, 2002), although polylinguals may still have a greater tendency to accept multiple labels for the same object (Kalashnikova et al., 2015).
Here we have made an important contribution with respect to the precursor of the current model (Ferrer-i-Cancho, 2017a): we have shown that the bias is not theoretically inevitable (even when λ > 0) according a more realistic model. In a more complex setting, research on deep neural networks has shed light on the architectures, learning biases and pragmatic strategies that are required for the vocabulary learning bias to emerge (e.g. Gandhi and Lake, 2020; Gulordava et al., 2020). In section 3, we have discovered regions of the space of parameters where strategy a is not advantageous for two classes of skeleta. In the restrictive class, where one where vertex degrees do no exceed one, as expected in the earliest stages of vocabulary learning in children, we have unveiled the existence of a region of the phase space where strategy a is not advantageous (Figs. 7 and 6). In the broader class of skeleta where the degree of counterparts does not exceed one we have found up to two distinct regions where a is not advantageous (Figs. 11 and 13). Crucially, our model predicts that the bias should be lost in older children. The argument is as follows. Suppose a child that has not learned a word yet. Then his skeleton belongs to the class where vertex degrees do not exceed one. Then suppose that the child learns a new word. It could be that he/she learns it following strategy a or b. If he applies b then the bias is gone at least for this word. Let us suppose that the child learns words adhering to strategy a for as long as possible. By doing this, he/she will increasing the number of links (M ) of the skeleton keeping as invariant a one-to-one mapping between words and meanings (Figs. 1 (c) and 2 (d)), which satisfies that vertex degrees do not exceed one. Then Figs. 7 and 8 predict that the longer the time strategy a is kept (when φ > 0) the larger the region of the phase space where a is not advantageous. Namely, as times goes on, it will become increasingly more difficult to keep a as the best option. Then it is not surprising that the bias weakens either in older children (e.g., Yildiz, 2020;Kalashnikova et al., 2016), as they are expected to have more links (larger M ) because of their continued accretion of new words (Saxton, 2010), or in polylinguals (e.g., Nicoladis and Secco, 2000;Greene et al., 2013), where the mapping of words into meanings combining all their languages, is expected to yield more links than in monolinguals. Polylinguals make use of code-mixing to compensate for lexical gaps, as reported for from one-year-olds onward (Nicoladis and Secco, 2000) as well as in older children (five year olds) (Greene et al., 2013). As a result, the bipartite skeleton of a polylingual integrates the words and association in all the languages spoken and thus polylinguals are expected to have a larger value of M . Children who know more translation equivalents (words from different languages but with same meaning), adhere to mutual exclusivity less than other children (Byers-Heinlein and Werker, 2013). Therefore, our theoretical framework provides an explanation for the lexicon structure hypothesis (Byers-Heinlein and Werker, 2013), but shedding light on the possible origin of the mechanism, that is not the fact that there are already synonyms but rather the large number of links ( Fig.  8) as well as the capacity of words of higher degree to attract more meanings, a consequence of Eq. 3 with φ > 0 in the vocabulary learning process (Fig. 3). Recall the stark contrast between Fig. 10 for φ = 1 and the Fig. 9 with φ = 0, where such attraction effect is missing. Our models offer a transparent theoretical tool to understand the failure of deep neural networks to reproduce the vocabulary learning bias (Gandhi and Lake, 2020): in its simpler form (vertex degrees do not exceed one), whether it is due to an excessive φ (Fig. 7) or an excessive M (Fig.  8).
We have focused on the loss of the bias in older children. However, there is evidence that the bias is missing initially in children, by the age of 14 months (Halberda, 2003). We speculate that this could be related to very young children having lower values of λ or larger values of φ as suggested by Figs. 7 and 6. This issue should be the subject of future research. Methods to estimate φ and λ in real speakers should be investigated. Now we turn our attention to skeleta where only the degree of the counterparts does not exceed one, that we believe to be more appropriate for older children. Whereas φ, λ and M sufficed for the exploration of the phase space when vertex degrees do not exceed one, the exploration of that kind of skeleta involved many parameters: φ, λ, n, µ k and α. The more general class exhibits behaviors that we have already seen in the more restrictive class. While an increase in M implies a widening of the region where a is not advantageous in the more restrictive class, the more general class experiences an increase of M when n is increased but α and φ remain constant (Section 2.1.2). Consistently with the more restrictive class, such increase of M leads to a growth of the regions where a is not advantageous as it can be seen in Figs. 16, 10, 17, 18 and 19 when selecting a column (thus fixing α and φ) and moving from the top to the bottom increasing n. The challenge is that α may not remain constant in real children as they become older and how to involve the remainder of the parameters in the argument. In fact, some of these parameters are known to be correlated with child's age: n tends to increase over time in children, as children are learning new words over time (Saxton, 2010). We assume that the loss of words can be neglected in children. -M tends to increase over time in children. In this class of skeleta, the growth of M has two sources: the learning of new words as well as the learning of new meanings for existing words. We assume that the loss of connections can be neglected in children. -The ambiguity of the words that children learn over time tends to increase over time (Casas et al., 2018). This does not imply that children are learning all the meanings of the word according to some online dictionary but rather than as times go on, children are able to handle words that have more meanings according to adult standards. α remains stable over time or tends to decrease over time in children depending on the individual (Baixeries et al., 2013;Zipf, 1949, Chapter IV).
For other parameters, we can just speculate on their evolution with child's age. The growth of M and the increase in the learning of ambiguous words over time leads to expect that the maximum value of µ k will be larger in older children. It is hard to tell if older children will have a chance to encounter larger values of µ k . We do not know the value of λ in real language but the higher diversity of vocabulary in older children and adults (Baixeries et al., 2013) suggests that λ may tend to increase over time, because the lower the value of λ, the higher the pressure to minimize the entropy of words (Eq. 4), namely the higher the force towards unification in Zipf's view (Zipf, 1949). We do not know the real value of φ but a reasonable choice for adult language is φ = 1 (Ferrer-i-Cancho and Vitevitch, 2018). Given the complexity of the space of parameters in the more general class of skeleta where only the degrees of counterparts cannot exceed one, we cannot make predictions that are as strong as those stemming from the class where vertex degrees cannot exceed one. However, we wish to make some remarks suggesting that a weakening of the vocabulary learning bias is also expected in older children for this class (provided that φ > 0). The combination of increasing n and a value of α that is stable over time suggests a weakening of the strategy a over time from different perspectives -Children evolve on a column of panels (constant α) of the matrix of panels in Figs. 16, 10, 17, 18 and 19, moving from top (low n) to the bottom (large n). That trajectory implies an increase of the size of the blue region, where strategy a is not advantageous.
-We do not know the temporal evolution of µ k but once µ k is fixed, namely a row of panels is selected in Figs. 20, 12, 21, 22 and 23, children evolve from left (lower n) to right (higher n), which implies an increase of the size of the blue region where strategy a is not advantageous as children become older. -Within each panel in Figs. 24,14,25,26 and 27, an increase of n, as a results of vocabulary learning over time, implies a widening of the blue region.
In the preceding analysis we have assumed that α remains stable over time. We wish to speculate on the combination of increasing n and decreasing α as time goes on in certain children. In that case, children would evolve close to the diagonal of the matrix of panels, starting from the right-upper corner (low n, high α, panel (c)) towards the lower-left corner (high n, low α, panel (g)) in Figs. 16, 10, 17, 18 and 19, which implies an increase of the size of the blue region where strategy a is not advantageous. Recall that we have argued that a combined increase of n and decrease of α is likely to lead in the long run to an increase of M (Fig. 4). We suggest that the behavior "along the diagonal" of the matrix is an extension of the weakening of the bias when M is increased in the more restrictive class (Fig.  8).
In our exploration of the phase space for the class of the skeleta where the degrees of counterparts do not exceed one, we assumed a right-truncated power-law with two parameters, α and n as a model for Zipf's rank-frequency law. However, distributions giving a better fit have been considered (Li et al., 2010) and function (distribution) capturing the shape of the law of what Piotrowski called saturated samples (Piotrowski and Spivak, 2007) should be considered in future research. Our exploration of the phase space was limited by a brute force approach neglecting the negative correlation between n and α that is expected in children where α and time are negatively correlated: as children become older, n increases as a result of word learning (Saxton, 2010) but α decreases (Baixeries et al., 2013). A more powerful exploration of the phase space could be performed with a realistic mathematical relationship of the expected correlation between n and α, which invites to empirical research. Finally, there might be deeper and better ways of parameterizing the class of skeleta.

Biosemiotics
Biosemiotics is concerned about building bridges between biology, philosophy, linguistics, and the communication sciences as announced in the front page of this journal https://www.springer.com/journal/12304. As far as we know, there is little research on the vocabulary learning bias in other species. Its confirmation in a domestic dog suggests that "the perceptual and cognitive mechanisms that may mediate the comprehension of speech were already in place before early humans began to talk" (Kaminski et al., 2004). We hypothesize that the cost function Ω captures the essence of these mechanisms. A promising target for future research are ape gestures, where there has been significant progress recently on their meaning (Hobaiter and Byrne, 2014). As far as we know, there is no research on that bias in other domains that also fall into the scope of biosemiotics, e.g., in unicellular organisms such as bacteria. Our research has established some mathematical foundations for research on the accretion and interpretation of signs across the living world, not only among great apes, a key problem in research program of biosemiotics (Kull, 2018).
The remainder of the discussion section is devoted to examine general challenges that are shared by biosemiotics and quantitative linguistics, a field that, as biosemiotics, aspires to contribute to develop a science beyond human communication.

Science and its method
It has been argued that a problem of research on the rank-frequency is law is the The absence of novel predictions... which has led to a very peculiar situation in the cognitive sciences, where we have a profusion of theories to explain an empirical phenomenon, yet very little attempt to distinguish those theories using scientific methods.
(Piantadosi, 2014). As we have already shown the predictive power of a model whose original target was the rank-frequency laws here and in previous research (Ferrer-i-Cancho, 2017a), we take this criticism as an invitation to reflect on science and its method (Altmann, 1993;Bunge, 2001).

The generality of the patterns for theory construction
While in psycholinguisics and the cognitive sciences a major source of evidence are often experiments involving restricted tasks or sophisticated statistical analyses covering a handful of languages (typically English and a few other Indo-European languages), quantitative linguistics aims to build theory departing from statistical laws holding in a typologically wide range of languages (Köhler, 1987;Debowski, 2020) as reflected in Fig. 1. In addition, here we have investigated a specific vocabulary learning phenomenon that is, however, supported cross-linguistically (recall Section 1). A recent review on the efficiency of languages, only pays attention to the law of abbreviation (Gibson et al., 2019) in contrast with the body of work that has been developed in the last decades linking laws with optimization principles (Fig. 1), suggesting that this law is the only general pattern of languages that is shaped by efficiency or that linguistic laws are secondary for deep theorizing on efficiency. In other domains of the cognitive sciences, the importance of scaling laws has been recognized (Chater and Brown, 1999;Kello et al., 2010;Baronchelli et al., 2013).

Novel predictions
In section 4.1, we have checked predictions of our information theoretic framework that matches knowledge on the vocabulary learning bias from past research. Our theoretical framework allows the researcher to play the game of science in another direction: use the relevant parameters to guide the design of new experiments with children or adults where more detailed predictions of the theoretical framework can be tested. For children who have about the same n and α, and φ = 1, our model predicts that strategy a will be discarded if (Fig. 10) (1) λ is low and µ k (Fig.3) is large enough.
(2) λ is high and µ k is sufficiently low.
Interestingly, there is a red horizontal band in Fig. 10, and even for other values of φ such that φ = 1 but keeping φ > 0 (Figs. 16, 17, 18, 19), indicating the existence of some value of µ k or a range of µ k where strategy a is always advantageous (notice however, that when φ > 1, the band may become too narrow for an integer µ k to fit as suggested by Figs. 31,32,33 in Appendix D). Therefore the 1st concrete prediction is that, for a given child, there is likely to be some range or value of µ k where the bias (strategy a) will be observed. The 2nd concrete prediction that can be made is on the conditions where the bias will not be observed. Although the true value of λ is not known yet, previous theoretical research with φ = 0 suggests that λ ≤ 1/2 in real language (Ferrer-i-Cancho and Sole, 2003;Ferrer-i-Cancho, 2005b, 2006, 2005a, which would imply that real speakers should satisfy only (1). Child or adult language researchers may design experiments where µ k is varied. If successful, that would confirm the lexicon structure hypothesis (Byers-Heinlein and Werker, 2013) but providing a deeper understanding. These are just examples of experiments that could be carried out.

Towards a mathematical theory of language efficiency
Our past and current research on the efficiency are supported by a cost function and a (analytical or numerical) mathematical procedure that links the minimization of the cost function with the target phenomena, e.g., vocabulary learning, as in research on how pressure for efficiency gives rise to Zipf's rank-frequency law, the law of abbreviation or Menzerath's law (Ferrer-i-Cancho, 2005b;Gustison et al., 2016;Ferrer-i-Cancho et al., 2019). In the cognitive sciences, such a cost function and the mathematical linking argument are sometimes missing (e.g., Piantadosi et al., 2011) and neglected when reviewing how languages are shaped by efficiency (Gibson et al., 2019). A truly quantitative approach in the context of language efficiency is two-fold: it has to comprise either a quantitative description of the data and a quantitative theorizing, i.e. it has to employ both statistical methods of analysis and mathematical methods to define the cost and the how cost minimization leads to the expected phenomena. Our framework relies on standard information theory (Cover and Thomas, 2006) and its extensions (Ferrer-i-Cancho et al., 2019;Debowski, 2020). The psychological foundations of the information theoretic principles postulated in that framework and the relationships between them have already been reviewed (Ferrer-i-Cancho, 2018). How the so-called noisychannel "theory" or noisy-channel hypothesis explains the results in (Piantadosi et al., 2011), others reviewed recently (Gibson et al., 2019) or language laws in a broad sense has not yet shown, to our knowledge, with detailed enough information theory arguments. Furthermore, the major conclusions of the statistical analysis of (Piantadosi et al., 2011) have recently been shown to change substantially after improving the methods: effects attributable to plain compression are stronger than previously reported (Meylan and Griffiths, 2021). Theory is crucial to reduce false positives and replication failures (Stewart and Plotkin, 2021). In addition, higher order compression can explain more parsimoniously phenomena that are central in noisy-channel "theorizing" (Ferrer-i-Cancho, 2017b).

The trade-off between parsimony and perfect fit.
Our emphasis is on generality and parsimony over perfect fit. Piantadosi (2014) makes emphasis on what models of Zipf's rank-frequency law apparently do not explain while our emphasis is on what the models do explain and the many predictions they make (Table 1), in spite of their simple design. It is worth reminding a big lesson from machine learning, i.e. a perfect fit can be obtained simply by overfitting the data and another big lesson from the philosophy of science to machine learning and AI: sophisticated models (specially deep learning ones) are in most cases black boxes that imitate complex behavior but neither explain nor yield understanding. In our theoretical framework, the principle of contrast (Clark, 1987) or the mutual exclusivity bias (Markman and Wachtel, 1988;Merriman and Bowman, 1989) are not principles per se (or core principles) but predictions of the principle of mutual information maximization involved in explaining the emergence of Zipf's rank-frequency law (Ferrer-i-Cancho and Sole, 2003;Ferrer-i-Cancho, 2005b) and word order patterns (Ferrer-i-Cancho, 2017b). Although there are computational models that are able to account for that vocabulary learning bias and other phenomena (Frank et al., 2009;Gulordava et al., 2020), ours is much simpler, transparent (in opposition to black box modeling) and to the best our knowledge, the first to predict that the bias will weaken over time providing a preliminary understanding of why this could happen.  (2015) Steps to a science of biosemiotics. Green Letters 19 (3)

A The mathematical model in detail
This appendix is organized as follows. Section A.1 details the expressions for probabilities and entropies introduced in Section 2. Section A.2 addresses the general problem of the dynamic calculation of Ω (Eq. 8) when a cell of the adjacency matrix is mutated, deriving the formulae to update these entropies once a single mutation has taken place. Finally, Section A.3 applies these formulae to derive the expressions for ∆ presented in Section 2.1.

A.1 Probabilities and entropies
In section 2, we obtained an expression for the joint probability of a form and a counterpart (Eq. 6) and the corresponding normalization factor, M φ (Eq. 7). Notice that M 0 is the number of edges of the bipartite graph. i.e. M = M 0 . To ease the derivation of the marginal probabilities, we define Notice that µ φ,i and ω φ,j should not be confused with µ i and ω i (the degree of the form i and of the counterpart j respectively). Indeed, µ i = µ 0,i and ω j = ω 0,j . From the joint probability (Eq. 6), we obtain the marginal probabilities To obtain expressions for the entropies, we use the rule Applying Eq. 20 and the rule in Eq. 22, By symmetry, equivalent formulae for H(R) can be derived easily using Eq. 21, obtaining Interestingly, when φ = 0, the entropies simplify as as expected from previous work (Ferrer-i-Cancho, 2005b). Given the formulae for H(S, R), H(S) and H(R) above, the calculation of Ω(λ) (Eq. 9) is straightforward.

A.2 Change in entropies after a single mutation in the adjacency matrix
Here we investigate a general problem: the change in the entropies needed to calculate Ω when there is a single mutation in the cell (i, j) of the adjacency matrix, i.e. when a link between a form i and a counterpart j is added (a ij becomes 1) or deleted (a ij becomes 0). The goal of this analysis is to provide the mathematical foundations for research on the evolution of communication and in particular, the problem of learning of a new word, i.e. linking a form that was previously unlinked (Appendix A.3), which is a particular case of mutation where a ij = 0 and µ i = 0 before the mutation (a ij = 1 and µ i = 1 after the mutation). Firstly, we express the entropies compactly as with X(S, R) = (i,j)∈E x(s i , r j ) x(r j ) = ω φ j ω φ,j log ω φ j ω φ,j .
We will use a prime mark to indicate the new value of a certain measure once a mutation has been produced in the adjacency matrix. Suppose that a ij mutates. Then We define Γ S (i) as the set of neighbors of s i in the graph and, similarly, Γ R (j) as the set of neighbors of r j in the graph. Then µ φ,k can only change if k = i or k ∈ Γ R (j) (recall Eq. 18) and ω φ,l can only change if l = j or l ∈ Γ R (i) (Eq. 19). Then, for any k such that 1 ≤ k ≤ n, we have that Likewise, for any l such that 1 ≤ l ≤ m, we have that We then aim to calculate M φ and X (S, R) from M φ and X(S, R) (Eq. 7 and Eq. 23) respectively. Accordingly, we focus on the pairs (s k , r l ), shortly (k, l), such that µ k ω l = µ k ω l may not hold. These pairs belong to E(i, j) ∪ (i, j), where E(i, j) is the set of edges having s i or r j at one of the ends. That is, E(i, j) is the set of edges of the form (i, l) where l ∈ Γ S (i) or (k, j) where k ∈ Γ R (j). Then the new value of M φ will be Similarly, the new value of X(S, R) will be x (s i , r j ) can be obtained by applying µ i and ω j (Eqs. 30 and 31) to x(s i , r j ) (Eq. 27). The value of H (S, R) is then obtained applying M φ (Eq. 34) and X(S, R) (Eq. 35) to H(S, R) (Eq. 23).
As for H (S), notice that x (s k ) can only differ from x(s k ) if µ k and µ φ,k change, namely when k = i or k ∈ Γ R (j). Therefore Similarly, x (s i ) can be obtained by applying µ i (Eq. 30) and µ φ,i (Eq. 32) to x(s i ) (Eq 28).

A.3 Derivation of ∆
Following from the previous sections, we set off to obtain expressions for ∆ for each of the skeleton classes we set out to study. As before, we denote the value of a variable after applying either strategy with a prime mark, meaning that it is a modified value after a mutation in the adjacency matrix. We also use a subindex a or b to indicate the vocabulary learning strategy corresponding to the mutation. A value without prime mark then denotes the state of that variable before applying either strategy. Firstly, we aim to obtain an expression for ∆ that depends on the new values of the entropies after either strategy a or b has been chosen. Combining ∆(λ) (Eq. 10) with Ω(λ) (Eq. 9), one obtains The application of H(S, R) (Eq. 23), H(S) (Eq. 24) and H(R) (Eq. 25), yields with . To obtain generic expressions for M φ , X (S, R), X (S) and X (R) via Eqs. 34, 35, 36 and 37, we define mathematically the state of the bipartite matrix before and after applying either strategy a or b with the following restrictions a ij a = a ij b = 0. Form i and counterpart j are initially unconnected.
µ ia = µ ib = 1. Form i will have one connection afterwards.
ω j a = 0. In case a, counterpart j is initially disconnected.
ω j b = ω j > 0. In case b, counterpart j has initially at least one connection.
ω j a = 1. In case a, counterpart j will have one connection afterwards.
ω j b = ω j + 1. In case b, counterpart j will have one more connection afterwards. -Γ S a (i) = Γ S b (i) = ∅. Form i has initially no neighbors.
-Γ Ra (j) = ∅. In case a, counterpart j has initially no neighbors.
-Ea(i, j) = ∅. In case a, there are no links with i or j at one of their ends.
-E b (i, j) = {(k, j)|k ∈ Γ R (j)}. In case b, there are no links with i at one of their ends, only with j.
We can apply these restrictions to x(s i , r j ), x(s i ) and x(r j ) (Eqs. 27, 28 and 29) to obtain expressions of x a (s i ), x b (s i ), x b (r j ) and x b (s i , r j ) that depend only on the initial values of ω j and ω φ,j x a (r j ) = 0 (40) x a (s i , r J ) = 0 (43) x b (s i , r j ) = (ω j + 1) φ log(ω j + 1).
Additionally, for any forms s k such that k ∈ Γ Rb (j) (that is, for every form that counterpart j is connected to), we can also obtain expressions that depend only on the initial values of ω j , ω φ,j , µ k and µ φ,k using the same restrictions and equations Applying the restrictions to M φ (Eq. 34), we can also obtain an expression that depends only on some initial values Applying now the expressions for x a (s i , r j ) (Eq. 43), x b (s i , r j ) (Eq. 44), x b (s k , r j ) (Eq. 45) and x b (s k , r j ) (Eq. 46) to X (S, R) (Eq. 35), along with the restrictions, we obtain Similarly, we apply x a (s i ) (Eq. 39), x b (s i ) (Eq. 41) and x b (s k ) (Eq. 47) to X (S) (Eq. 36) as well as the restrictions and obtain X a (S) = X(S) (52) X b (S) = X(S) + φ(ω j + 1) φ log(ω j + 1) We apply x a (r j ) (Eq. 40) and x b (r j ) (Eq. 42) to X (R) (Eq. 37) along with the restrictions and obtain At this point we could attempt to build an expression for ∆ for the most general case. However, this expression would be extremely complex. Instead, we study the expression of ∆ in three simplifying conditions: the case φ = 0 and the two classes of skeleta.
A.3.1 The case φ = 0 The condition φ = 0 corresponds to a model that is a precursor of the current model Ferreri-Cancho (2017a), and that we use to ensure our that our general expressions are correct. We apply φ = 0 to the expressions in Section A.3. M a and M b (Eqs. 48 and 49) both simplify as X a (S, R) and X b (S, R) (Eqs. 50 and 51) simplify as X a (S) and X b (S) (Eqs. 52 and 53) both simplify as X a (R) and X b (R) (Eqs. 54 and 55) simplify as The application of Eqs. 56, 57, 58, 59, 60 and 61 into the expression of ∆ (Eq. 38) results in the expression for ∆ (Eq. 5) presented in Section 1.

A.3.2 Counterpart degrees do not exceed one
In this case we assume that ω j ∈ {0, 1} for every r j and further simplify the expressions from A.3 under this assumption. This is the most relaxed of the conditions and so these expressions remain fairly complex. M φ a and M φ b (Eqs. 48 and 49) simplify as i , X a (S, R) and X b (S, R) (Eqs. 50 and 51) simplify as X a (S, R) = X(S, R) (64) X a (S) and X b (S) (Eqs. 52 and 53) simplify as X a (R) and X b (R) (Eqs. 54 and 55) simplify as The previous result on X(R) deserves a brief explanation as it is not straightforward. Firstly, we apply the definition of x(r j ) (Eq. 29) to that of X(R) (Eq. 26) As counterpart degrees are one, ω j = 1 and ω φ,j = µ φ i←j , where i ← j is used to indicate that we refer to the form i that the counterpart j is connected to (see Eq. 19). That leads to In order to change the summation over each j (every counterpart) to a summation over each i (every form) we must take into account that when summing over j, we accounted for each form i a total of µ i times. Therefore we need to multiply by µ i in order for the summations to be equivalent, as otherwise we would be accounting for each form i only once. This leads to and eventually Eq. 71 thanks to Eq. 66. The application of Eqs. 62,63,64,65,67,68,69 and 70 into the expression of ∆ (Eq. 38) results in the expression for ∆ (Eq. 12) presented in Section 1. If we apply the two extreme values of λ, i.e. λ = 0 and λ = 1, to that equation, we obtain the following expressions

A.3.3 Vertex degrees do not exceed one
As seen in Section 2.1, for this class we are working under the two conditions that ω j ∈ {0, 1} for every r j and µ i ∈ {0, 1} for every s i . We can simplify the expressions from A.3. M φ a and M φ b (Eqs. 62 and 63) simplify as where M φ = M 0 = M , the number of edges in the bipartite graph. X a (S, R) and X b (S, R) (Eqs. 64 and 65) simplify as X a (S) and X b (S) (Eqs. 67 and 68) simplify as X a (R) and X b (R) (Eqs. 69 and 70) simplify as X a (R) = 0 (78) Combining Eqs. 72,73,74,75,76,77,78,79 into the equation for ∆ (Eq. 38) results in the expression for ∆ (Eq.11) presented in Section 1. When the extreme values, i.e. λ = 0 and λ = 1, are applied to this equation, we obtain the following expressions
Inserting the previous results into the definition of p(s i ) when ω j ≤ 1, we have that A continuous approximation to vertex degrees and the number of edges gives Thanks to well-known integral bounds (Cormen et al., 1990, pp. 50-51), we have that as τ ≥ 0 by definition. When τ = 1, one obtains When τ = 1, one obtains Combining the results above, one obtains for τ = 1 and

C Complementary heatmaps for other values of φ
In Section 3, heatmaps were used to analyze ∆ takes for distinct sets of parameters. For the class of skeleta where counterpart degrees do not exceed one, only heatmaps corresponding to φ = 0 ( Fig. 9) and φ = 1 (Figs. 10, 12 and 14) were presented. The summary figures presented in that same section (Figs. 11,13 and 15)

D Complementary figures with discrete degrees
To investigate the class of skeleta such that the degree of counterparts does not exceed one, we have assumed that the relationship between the degree of a vertex and its rank follows a power-law (Eq. 15). For the plots of the regions where strategy a is advantageous, we have assumed, for simplicity, that the degree of a form is a continuous variable. As form degrees are actually discrete in the model, here we show the impact of rounding form degrees defined by Eq. 15 to the nearest integer in previous figures.
The correspondence between the figures in this appendix with rounded form degrees and the figures in other sections is as follows. Figs. 28,29,30,31,32 and 33 are equivalent to Figs. 9,16,10,17,18 and 19,respectively. These are the figures where λ is on the x-axis and µ k on the y-axis of the heatmap. Fig. 34, that summarizes the boundaries of the heatmaps, corresponds to Fig. 11 after discretization. Figs. 35,36,37,38 and 39 are equivalent to Figs. 20,12,21,22 and 23,respectively. In these figures, α is placed on the y-axis instead. Fig.  40 summarizes the boundaries and is the discretized version of Fig. 13. Finally, Fig. 41 ,42,43,44 and 45 are equivalent to Figs. 24,14,25,26 and 27,respectively. This set places n on the y-axis. The boundaries in these last discretized figures are summarized by Fig. 46, that corresponds to Figure 15.
We have presented two kinds of figures: heatmaps showing the value of ∆ and figures summarizing the boundaries between regions where ∆ > 0 and ∆ < 0. Interestingly, the discretization does not change the presence of regions where ∆ < 0 and ∆ > 0 and in general, it does not change the shape of the regions in a qualitative sense except in some cases where remarkable distortions appear (e.g., Figs. 32 or 33 have one or very few integer values on the y-axis for certain combinations of parameters, forming one dimensional bands that don't change over that axis; see also the distorted shapes in Figs. 38 and specially 45). In contrast,