Similarity-Reduced Diversities: the Effective Entropy and the Reduced Entropy

The paper presents and analyzes the properties of a new diversity index, the effective entropy, which lowers Shannon entropy by taking into account the presence of similarities between items. Similarities decrease exponentially with the item dissimilarities, with a freely adjustable discriminability parameter controlling various diversity regimes separated by phase transitions. Effective entropies are determined iteratively, and turn out to be concave and subadditive, in contrast to the reduced entropy, proposed in Ecology for similar purposes. Two data sets are used to illustrate the formalism, and underline the role played by the dissimilarity types.

frequencies and dissimilarities between items, as well as on a freely adjustable parameter controlling for the discriminability between similar items. Its construction can be motivated by a simple perceptual rationale: if two items are similar enough, and if the discriminability parameter is low enough, then one item can wrongly be perceived as another, hence lowering the diversity of the item collection. Further elaborating on this idea leads to a statistical mechanical framework (minimization of the free energy) whose associated formalism turns out to overlap in large parts with the rate distortion setup in information theory (e.g., Cover & Thomas, 2006) or the regularized optimal transportation (e.g., Cuturi, 2013) -with a completely different motivation and purpose.
The resulting effective entropy can be computed iteratively in general (Section 2, Theorems 1 and 2), sometimes in a single step for particular cases. One of the salient features of the framework is the apparition of phase transitions when lowering the discriminability parameter, from the low-temperature regime where all the items are perceived, at least partially, to the high-temperature regime, where only the dominating items persist (Section 3, Theorems 3, 4 and 5).
Two examples (copper concentration data, Section 3.1; world cities, Section 4.1) illustrate the formalism and its versatility, as well as the influence of the nature (city-block, Euclidean, ultrametric) of the item dissimilarities (Theorems 6 and 7). Similarities can also be considered directly, as far they are non-negative, symmetric, and endowed with a unit dominating diagonal.
The effective entropy (6) turns out to provide an (apparently close) upper bound to another logarithmic measure of diversity (12) considered in Ecology and computable in a single step, referred in this paper to as the reduced entropy. The presence of entirely similar items with distinct labels should not increase the diversity -a property automatically satisfied by the effective and the reduced entropies, in contrast to Shannon entropy. Also, the former satisfy the so-called monotonicity and modularity properties (Theorem 8). Both take on their maximum values for non-uniform item distributions in general (Theorem 10), and converge to Shannon entropy in the naive limit of identity similarities, as expected.
Yet, the concavity property, asserting that the diversity of the whole should not be lower than the average diversity of its parts, is in general satisfied by the effective entropy only (Theorem 9). Also, like Shannon entropy, the joint effective entropy is maximum for independent distributions (Theorem 11), a property violated again in general for the reduced entropy. Remedying the formal defects of the latter provided the initial motivation of the present study.

The Formalism
Basic ingredients consist of n items, objects or species, with relative frequencies f i > 0 normalized to n i=1 f i = 1. In addition, the differences between items are specified by a finite n × n matrix of dissimilarities D = (d ij ) obeying where (1d) defines an even or semi-proper dissimilarity (see, e.g., Critchley & Fichet, 1994).

Transition Matrices and Percept Weights
Let z ij denote the probability that item i is perceived or received as item j . According to the context, the pair i-j can be referred to as the stimulus-percept, stimulus-response, origindestination, input-output, symbol-reproduction, or source-estimate pair. By construction, z ij ≥ 0 and z i• = j z ij = 1 (where "•" denotes summation over the replaced index). The set of all such memberships or transition matrices Z = (z ij ) ∈ R n×n will be denoted as Z.
The weight of percept j is obtained as and obeys ρ • = 1 by construction.

Energy, Mutual Information, and Free Energy
The average stimulus-percept dissimilarity, cost or energy is whose minimum is Z = I n = (δ ij ), the identity matrix (no stimulus-percept confusion).
By contrast, the confusion is maximum when the percept is independent of the stimulus, that is when the joint stimulus-percept distribution f i z ij is equal to the corresponding independent distribution f i ρ j , or equivalently when z ij = ρ j for the distribution ρ in (2). Such an independent transition minimizes the stimulus-percept mutual information, which reads here as The above antagonist tendencies can be combined in the (dimensionless) free energy where β > 0 is a free discriminability parameter, known in statistical mechanics as the inverse temperature β = 1/T . In the low-temperature limit β → ∞, only the energy term contributes and the optimal transition matrix Z tends to I n (perfect discrimination). In the high-temperature limit β → 0, only the entropy term contributes and Z → 1 n ρ (complete confusion). Note that in physics, the free energy is generally defined as U + β −1 K instead of F in (5). The latter choice, without incidence on the form of the optimal transition matrix Z, has the advantage of making the effective entropy (defined below) dimensionless, and directly comparable to Shannon entropy. Also, D, U and β −1 have the dimensions of an energy; they can be made dimensionless by a proper rescaling of D (see Section 3.1).

Effective Entropy E and Reduced Entropy R
The effective entropy, whose study constitutes the main scope of the paper, is defined as The optimal transition Z minimizing F [Z] is determined by the non-linear equation (see Theorem 1 below) The components of the n × n similarity matrix S = (s ij ) in (7) obey Similarities (9) are related to dissimilarities as S = exp(−βD), to be understood as the componentwise or Hadamard exponential. In terms of similarities, the free energy (5) reads, by direct substitution, as Let κ(ρ||f ) = j ρ j ln(ρ j /f j ) ≥ 0 denote the Kullback-Leibler divergence between percept and stimulus weights, and consider the functional Its minimizer Z 0 , and minimum value R = G[Z 0 ], we shall refer to as the reduced entropy, are readily found (in a single step) to be (12) is the banality of item i, measuring its average similarity to other items (Marcon, 2016), proposed by Leinster and Cobbold (2012), as well as by Ricotta and Szeidl (2006) in the variant s ij = 1 − d ij /d max . By construction, The lower bound for R follows from (11), and the upper bound for R from f i ≤ b i ≤ 1. Also, substituting (7) in (10) yields the expression The normalization factor τ i in (7) and (8) has a form similar to b i in (12): they both measure the average similarity, where the average is taken on f for b i , but on ρ for τ i . One can refer to b i as the source or stimulus banality, and to τ i as the outcome or percept banality.

High-and Low-Temperature Similarities: Aggregation of Equivalent Items
In the high-temperature limit β → 0, S → J n (the n × n unit matrix filled with ones), and thus b → 1 n (the unit vector) and finally E, R → 0. In the low-temperature limit β → ∞, s ij → 0 for i = j , unless d ij = 0: recall from (1d) that D is supposed to be semi-proper only, which implies that the relation i ∼ j ⇔ d ij = 0 is an equivalence relation. As a result, lim β→∞ s ij = 1 if i and j belong to the same equivalence class C, and lim β→∞ s ij = 0 otherwise. Hence, S = C J C (direct sum of matrices) is block diagonal, and See the Appendix for a proof of (15). In other terms, in the low-temperature limit, the reduced entropy automatically aggregates equivalent items (i.e., whose dissimilarity is zero), yielding a corresponding entropy H agg ≤ H . The same limit holds for the effective entropy. Similar considerations hold for the effective and reduced entropies at any temperature: items equivalent in the sense d ij = 0 can be first aggregated into equivalence classes, yielding proper dissimilarities between classes, on which the above formalism can then be applied. From now on, we consider proper dissimilarities only, i.e., such that d ij > 0 for i = j .

Behavior and Computation of the Optimal Solution
Depending on β, the percept weights ρ j may be positive or zero, a circumstance causing the phase transitions observed in the case studies below. The effective variety counts the number of detected percepts, and ranges from 1 to n. Let n = {1, . . . , n} denote the collection of items, and for A ⊆ n , let Z A be the set of transitions whose non-zero percept weights are exactly those of the set of occurring percepts A, that is ρ j > 0 for j ∈ A, and ρ j = 0 for j ∈ A. Equivalently, z •j > 0 for j ∈ A, and z •j = 0 for j ∈ A.
Theorem 1 (first-order condition) Define the sub-indicator A necessary and sufficient condition for Z to be the minimizer of F [Z] is the first-order condition Also, c j = 1 when ρ j > 0.
In particular, the first identity in (18) is necessary and sufficient when all ρ j > 0. Also, c j ρ j = ρ j . Identity c j = 1 is a necessary condition for the apparition of percept j , whence the name "sub-indicator".

Theorem 2 (iterative solution) The series of iterations
converge towards the unique minimizer (7) of F [Z], provided Z (0) ∈ Z n , the set of transitions with strictly positive row margins.

Example A: Copper Concentration Data
The R dataset chem{MASS} consists of 24 univariate copper concentrations x i , on which dissimilarities are defined as contain ties occurring twice (x = 2.20, 2.40, 3.03), three times (x = 3.40) and four times (x = 3.70). They must be preliminarily aggregated to define a proper dissimilarity (Section 2.4), resulting in a sample of n = 16 observations with non-uniform weights f , namely with values 1/24 for the unique values, 2/24 for values with two ties in the original dataset, 3/24 for values with three ties, and 4/24 for values with four ties.
Temperatures and dissimilarities always appear in combination βd ij . For comparison sake, the temperature scale will, here and in the sequel, be fixed by further dividing the dissimilarities by the quantity which corresponds to the inertia of the configuration when D is squared Euclidean, and is referred to as Rao quadratic entropy in Ecology (see, e.g., Champely & Chessel, 2002;Pavoine et al., 2005 and references therein). Equivalently, the dissimilarity scale is normalized to Δ = 1. Figure 1 depicts the effective variety (16), effective entropy (6) and reduced entropy (12b) as a function of the inverse temperature. As expected, the number of detected percepts tends to decrease with the temperature, yet with possible monotonicity violations (disappearances followed by re-emergence of percepts ρ j > 0, as detailed in Figs. 2 and 3). Also, the effective entropy appears closely approximated by the more directly computable reduced entropy.
The left panel of Fig. 2 depicts the rate distortion function of information theory. In this framework (Cover & Thomas, 2006, chapter 10), source i is sent through a communication channel, and decoded as j . The mutual information K or channel rate between sources and their reconstruction is a measure of their dependence, reaching its maximum H (for fixed f ) for the perfect transmission Z = I , in which case the distortion U is zero (Section 3.3). Conversely, a zero-rate or random channel (K = 0) generates a distortion of at least U = ∂ (Section 3.4).
The right panels of Figs. 2 and 3 depict the sub-indicators c j (β) (17) for j = 1, . . . , 16. By construction, the effective variety is bounded above by the number of sub-indicators equal to one. Actually, the two quantities happen to numerically coincide in this case study,  13, 14, 15, 16 (right) in the sense n j =1 I (ρ j > 10 −5 ) = n j =1 I (c j > 1 − 10 −5 ). Said otherwise, ρ j > 0 iff c j = 1, and the sub-indicator c j turns out to behave here as an indicator.

High-and Low-Temperature Regimes
The previous sections suggest the existence of two critical temperatures β H and β L , with 0 < β H < β L < ∞, such that 1. for β > β L , all ρ j > 0 (that is Z ∈ Z n ) : low-temperature phase or regime 2. for β < β H , ρ j > 0 iff j belongs to some subset A ⊂ n , the dominating set: high-temperature regime 3. for β H < β < β L , the set A of occurring percepts with ρ j > 0 varies, generally non monotonically with β, from A to n : medium-temperature regime.

Low-Temperature Regime
For β → ∞, Z = I and ρ j → f j > 0. By continuity, for β large enough, all sub-indicators (17) are equal to one, and the inverse similarity matrix, noted S −1 = (s ij ), exists (recall lim β→∞ S = I for D proper).
Theorem 3 (low-temperature regime) Assume that S −1 = (s ij ) exists, and define the (possibly non positive) signed weights by Then all percepts emerge (that is ρ j > 0 for all j ) iff r j > 0. In this case, the optimal solution (7), (8) is given by Note that j r j = 1 always holds, whence the name signed weights. Thus the condition min j r j ≥ 0, characterizing the low-temperature regime, also implies max j r j ≤ 1; see the left panels of Figs. 4,7,8,9,10,11,12,13 and 14. The invertibility of S for all β > 0 holds for a large class of dissimilarities, namely squared Euclidean proper dissimilarities (Theorem 6b). Yet, the above theorem does not rule out the possible coexistence of all percepts together with a non-invertible S, although such a case has not been met in our investigations.

High-Temperature Regime
For β = 0, the free energy consists of K[Z] only, which is minimized (and equal to zero) by the independent transition z ij = ρ j , where ρ is an arbitrary distribution. Let us introduce the definitions Left: the low-temperature regime β > β L is characterized by min j r j ≥ 0, and hence r j = ρ j (Theorem 3). Right: histogram of ρ j in the high-temperature limit: the mass is evenly concentrated on the dominating states A = {9, 10}, that is ρ 9 = ρ 10 = 0.5 appearing in the study of the high-temperature regime. The eccentricity ∂ j measures the (weighted) average dissimilarity between item j and the other items in the configuration (f , D), and obeys ∂ j = d j f + Δ when D is squared Euclidean, where d j f is the dissimilarity between item j and the barycenter of the configuration (Huygens principle). The average eccentricity∂ is twice the Rao entropy (20), and is equal to 2 under the normalization Δ = 1. Medoids are items with minimal eccentricity ∂ . The collection of medoids constitutes the dominating set A .
In the copper concentration data, the dominating set A contains the minimizers of the median average deviation ∂ j = i f i |x i − x j |, and consists of the two observations j = 9 and j = 10 (see Fig. 4). In the world cities dataset with geodesic distances (Section 4.1), A consists of Berlin only.
For β → 0, the energy becomes U [Z] = j ρ j ∂ j , which is minimum iff the non-zero components or support supp(ρ) of ρ belong to the dominating set A . This dominating set persists in the high-temperature regime 0 < β 1: Theorem 4 (high-temperature regime) For 0 < β 1, the free energy is minimum for Also, the effective and relative entropies obey The dominating set is stable in the sense c j (β=0) = 1 for all j , with derivatives c j (β=0) = 0 for j ∈ A and c j (β=0) < 0 for j ∈ A , which implies (Theorem 1) ρ j (β) = 0 for 0 < β 1 and j ∈ A .
Inequality ∂ ≤∂ reflects the general relation E(β) ≤ R(β). Also, when A contains a single element, the relation E(β) = β ∂ is exact in the high-temperature regime.

Medium-Temperature Regime
The medium-temperature regime, neither characterized by the presence of all percepts nor the sole presence of dominating percepts, is arguably the most convoluted: by increasing the temperature, percepts can disappear, then reappear later (Figs. 1, 3 and 5). However, fairly simple and revealing relations still exist for the percept banality τ i (f ) in (8), considered as a function of f (that is for S fixed and evaluated on the optimal transition Z(f )): Theorem 5 (derivatives of the percept banalities) For all β > 0, the derivatives of the percept banalities with respect to the item weights satisfy Identities (26) permit to prove the concavity of the effective entropy for all β > 0 (Theorem 9). Arguably relevant in further perturbation studies, they are not further interpreted nor investigated here.

Example B: World Cities
Sparsely populated places, spatially close to more populated places, tend to be overlooked, and designated by the latter -a mechanism captured by the present formalism.
The population N i , latitudes φ i , longitudes θ i of n = 30 world cities have been extracted from the (now outdated) R dataset world.cities{maps}. The selected sample contains the five most populated cities for each of the six continents. Relative weights are f i = N i /N • and geodesic distances are given by

Fig. 5
For decreasing temperature (that is for increasing β, from the left to the right panels), more percepts tend to emerge. Right: in the low-temperature limit, the histogram of the percepts coincides with the original copper concentration data, that is ρ j = f j The left panel of Fig. 6 depicts the corresponding cylindrical projection, and the right panel depicts the dendrogram resulting from the single linkage hierarchical ascendent classification (HAC) applied on dissimilarities D geo . Figures 7, 8, 9, 10, 11 and 12 give, keeping the city weights f unchanged, the behavior of the signed weights r and percept weights ρ for the various regimes, as well as the effective and reduced entropies, and the effective variety. They do so, respectively, for the dissimilarities D geo , their square, their cube, their squared root, D ultra (the ultrametric dissimilarity obtained from the dendrogram of Fig. 6) and for random dissimilarities D random (whose univariate coordinates are independently drawn as the square of a Student variate with three degrees of freedom). The understanding of the class of exponential similarity matrices of the form S = exp(−βD) (componentwise or Hadamard exponential), where β > 0 and D is a dissimilarity matrix, benefits from a wealth of acute studies in matrix analysis (see, e.g., Horn & Johnson, 1991;Critchley & Fichet, 1994;Martínez et al., 1994;Bapat & Raghavan, 1997;Deza & Laurent, 1997;Reams, 1999;Bavaud, 2011;Dellacherie et al., 2014 and references therein). Theorem 6 summarizes some salient features from these sources: Theorem 6 (exponential similarity matrices) Let S = exp(−βD). Then

6a) S is positive semi-definite (p.s.d.) for all β > 0 iff D is squared Euclidean. 6b) S is positive definite (p.d.) for all β > 0, and hence inversible, iff D is squared Euclidean and proper. 6c) if D is ultrametric (and hence squared Euclidean) and proper, then S −1 = (s ij ) is,
for all β > 0, a strictly diagonally dominant Stieltjes matrix, that is a p.d. matrix with s ij ≤ 0 for i = j and s i• > 0.

6d) proper ultrametric dissimilarities possess an inverse
some vectors x i ∈ R p . The dissimilarities D geo and √ D geo (Hadamard square root) are squared Euclidean (see, e.g., Critchley & Fichet, 1994), in contrast to D geo2 , D geo3 or D random which are not. The difference between the reduced and effective entropies appears smaller for squared Euclidean dissimilarities, and still smaller for ultrametric dissimilarities which are investigated next.

Ultrametric (Dis)similarities
The single-linkage ultrametric dissimilarity D ultra extracted from the dendrogram in Fig. 6 is squared Euclidean too (see, e.g., Critchley & Fichet, 1994), and yields a strictly ultrametric similarity satisfying s ij ≥ max(s ik , s jk ), as well as s ii > s ij for all i = j .
From Theorem (6d), the normalized circumweights defined by Fig. 13 Same legend as Fig. 7, for similarities S = exp(−βD ultra ) and item weights given by the circumweights g in (28) In other terms, there exists, for each proper ultrametric dissimilarity, a weight distribution g such that the barycenter of the configuration is at equal distance Δ g = 1/(2d •• ) from all items, which lie on the circumcircle centered on the barycenter. This vividly illustrates the poor low-dimensional compressibility of ultrametric configurations when performing multidimensional scaling. This distance turns out to be 0.683 in the world cities example (see the right panel of Fig. 6), corresponding to a geodesic distance of about 0.683×6371 ∼ = 4351 km. When endowed with circumweights, all items dominate by construction in the hightemperature regime (i.e., A = n ), but they do not necessarily appear in the mediumtemperature regime (Fig. 13). As pointed out by Pavoine et al. (2005), ultrametric dissimilarities grant that the maximum value of Rao quadratic entropy Δ(f ) is precisely attained for f = g, where all items or species are present (i.e., supp(g) = n ), as expected from a decent measure of biodiversity. Figure 14 depicts the case of ultrametric equidistant similarities d ij = d > 0 for i = j , for which the critical temperatures can be explicitly computed: . . > f [n] . Let s(β) = exp(−βd/Δ). Equivalenty,

Further Properties as Measures of Diversity
The exponential of the entropy exp(H ) can be interpreted as a measure of the effective number of alternatives, taking on its maximum value n for uniform distributions f i = 1/n. Similarly, the reduced entropy R has been introduced as a similarity-dependent measure of ecological diversity, where exp(R) (denoted 1 D S (f ) by Leinster and Cobbold, 2012) can be interpreted as an effective, similarity-reduced number of alternatives. However, neither R(f ) nor E(f ) take on their maximum on the uniform distribution in general (see Theorem 10 below). The larger the similarity between items, the smaller should be the corresponding diversity. This monotonicity property is satisfied for the reduced and effective entropies: if s ij ≥ s ij for all i, j , thenR ≤ R (from (12)) andẼ ≤ E (from (10)).
Another property shared by the reduced and effective entropies is the so-called modularity (Leinster & Cobbold 2012), to be compared with the decomposition of Shannon entropy under aggregation: let X be a categorical variable, and let G be a coarser variable resulting from the aggregation of some categories of X, that is H (G|X) = 0. Then H (X) = H (G) + H (X|G) = H (G) + g π g H (X|g). Here we have: Theorem 8 (modularity) Let the n items be partitioned into m communities or groups g = 1, . . . , m, in such a way that the similarity between two items belonging to different groups is zero. Then where π g = i∈g f i is the relative weight of group g, H (G) = − g π g ln π g , and the quantities R g and E g denote the reduced and effective entropies within group g only.
What if the rather restrictive conditions of the previous theorem, namely strict partitioning of items into groups (that is H (G|X) = 0) and zero inter-group similarity (that is S = g S g ) are lifted? The question directly points to the issue of concavity, a desirable property of diversity measures, which says that the diversity of the whole is not less that the average diversity of its parts: such is, notably, the case for the variance (the total variance is bounded below by the within-groups variance, the difference being the between-groups variance), and Shannon entropy (H (X) ≥ H (X|G)).
The effective entropy turns out to be concave as well (Theorem 9 below). Let f g i (obeying f g i ≥ 0 and f g • = 1) denote the relative weight of item i in group g = 1, . . . , m, let π g ≥ 0 with π • = 1 denote the relative importance of group g, and let f i = g π g f g i denote the item weights of the whole. Also, let F [f , Z] denote the free energy (10) considered as a function of the two independent variables f and Z, and finally let Theorem 9 (concavity of the effective entropy) The effective entropy is concave in f , that is As the proof in the Appendix makes clear, virtually no condition on S in (10) is required to ensure the validity of Theorem 9. By contrast, the concavity of the reduced entropy R(f ), although supposedly simpler to manipulate than E(f ), seems harder to establish: without additional conditions on S, R(f ) is not concave in general, as shown by the following counter-example: let n = 3 and In the naive approach, the uniform distribution f i = 1/n is well-known to maximize Shannon entropy H (f ). This is not the case anymore for the effective entropy, whose nonuniform maximizing distribution f • can be explicitly computed for low temperatures: Theorem 10 (maximum effective entropy) Let β be large enough so that S is invertible with s i• > 0 for all i. Then the item distribution f Thus exp(E(f • )) = s •• measures, for β large, the maximum number of effective items, and tends to n, as it must, in the naive limit β → ∞. By contrast, f • is a stationary point of the reduced entropy R(f ), whose lack of concavity cannot however exclude the presence of multiple extrema in general.
Finally, we address the question of the subadditivity of the effective and reduced entropies, which expresses for Shannon entropy as the independence bound H (X, Y ) ≤ H (X) + H (Y ), where equality holds iff X and Y are independent. Here X can be thought of as a categorical variable whose modalities i = 1, . . . , n are similar in part, as expressed by the n × n similarity matrix s X ij . Likewise, Y is a categorical variable with modalities k = 1, . . . , m and associated m × m similarity matrix s Y kl . The weights of the crossmodalities are f XY (ik) with f X i = f XY (i•) and f Y k = f XY (•k) , and we assume the similarities between cross-modalities to be defined as which is equivalent to the definition d XY Theorem 11 (subadditivity) Consider a bivariate distribution f XY with margins f X and f Y , and letf XY (ik) = f X i f Y k be the corresponding independent distribution. Then Furthermore, E f XY ≤ E f XY , which makes E subadditive.
As the proof in the Appendix makes clear, subadditivity does not hold for the reduced entropy, which, after its lack of concavity (Theorem 9) suffers here a second setback. For a minimal example, consider n = m = 2 and .42 .28 .18 .12 .

Conclusion
This paper has introduced and analyzed the properties of a new diversity index, the effective entropy. It is based upon a formalism which is both tractable and non-trivial, and it is no coincidence that parts of this formalism also serve in statistical mechanics, statistics, operation research and information theory for the exposition and solution of their own relevant issues. The effective entropy appears to satisfy many properties expected from a diversity index, and handles the presence of item (dis)similarites in a systematic, controlled way. It remedies the formal deficiencies of the reduced entropy, for which it provides a lower bound, and provides an explicit mechanism of diversity reduction due to the confusion between close items. For a configuration of weighted items with given pair dissimilarities, the effective and the reduced entropies constitute a one-parameter diversity family indexed by a discriminability or inverse temperature parameter. Both entropies converge towards Shannon entropy in the low-temperature limit, provided items are pairwise distinct. In the high-temperature limit, the behavior of the effective and reduced entropy is governed by the minimum eccentricity, respectively the average eccentricity, that is (up to a factor two) Rao quadratic entropy.
Identifying the exact conditions on the (dis)similarities making the reduced entropy concave or subadditive constitutes an obvious future challenge. Determining a well-behaved "power" generalization of the presently proposed "logarithmic" effective dissimilarity, in the spirit of Rényi or Tsallis entropies, constitutes another. minimum under independence (within C), that is for z ij = a j ; identity ρ j = i f i z ij = i∈C f i a j = F C a j entails Proof of Theorem 1 Differentiating F [Z] + i λ i (1 − z i• ) with respect to z ij (where λ i is the Lagrange multiplier associated with constraint z i• = 1) yields the first-order condition βd ij + ln z ij (8); note that non-negativity of Z is automatically ensured. The second derivative Bavaud, 2009) can be shown to be positive semi-definite. Eigenvalues are zero for eigenvectors parallel to z ij ∂z ij ∂z i j z i j = 0), but positive otherwise, implying the strict convexity of F [Z] under admissible variations z ij → z ij + ε ij with ε i• = 1, and thus the unicity of the optimal solution in the interior of the convex set Z. Summing the first identity in (7) over i f i . . . yields ρ j = c j ρ j , that is c j = 1 for ρ j > 0.
The above derivation is incomplete at the boundary of Z when ρ j = 0, for some j (implying z ij = 0 for all i). Here the condition c j ≤ 1 can be derived (with some additional work) by starting from the Kuhn-Tucker conditions of convex optimization, or directly as follows: adding mass on percept j can be performed by the transformationz ij = α i ∈ (0, 1) is convex in γ . Hence ρ j = 0 minimizes the free energy if min γ ζ(γ ) = ζ(γ 0 ) ≥ 0. One finds γ 0 i = f i s ij /(τ i c j ) and ζ(γ 0 ) = − ln c j , and the latter condition becomes c j ≤ 1.
Proof of Theorem 2 Consider the functional M[Z, ρ] = ij f i z ij (βd ij + ln z ig ρ j ), where Z is a transition matrix and ρ some distribution. Then (a) ρ (t) = arg min ρ M[Z (t) , ρ] and (b) Applying i f i · · · to the second identity in (19) yields lim t→∞ ρ (t+1) j /ρ (t) j = c j . Thus c j < 1 implies ρ j = lim t→∞ ρ (t) j = 0, in accordance with Theorem (1). The condition Z (0) ∈ Z n is crucial, since restricting Z (0) to Z A , say, where A is a strict subset of n , that is starting with a Z (0) with column margins z (0) •j = 0 for j ∈ A, will automatically generate subsequent transitions Z (t) ∈ Z A , converging to the optimal solution of the restricted problem corresponding to a so-called metastable solution in statistical mechanics, here betrayed by the possible presence of some c j > 1 for j / ∈ A, in violation of Theorem 1. The iterative process (19) has been rediscovered several times in various disciplines. Known as the Blahut-Arimoto algorithm in information theory, it constitutes a variant of the EM algorithm of Statistics, which (among other purposes) serves at identifying the optimal soft clustering minimizing the functional F cluster = β n i=1 m g=1 f i z ig d ig +K [Z]. Here z ig ≥ 0 (with z i• = 1) is the membership probability of item i in cluster g = 1, . . . , m, and d ig is the dissimilarity between i and the optimal cluster representative, which turns out to be its weighted barycenter for squared Euclidean dissimilarities (see, e.g., Bavaud, 2009) and references therein). In contrast to problem (6), the cluster representatives g are not constrained to g ∈ n : for m ≥ n, the minimum free energy F cluster provides a lower bound for the effective entropy (6).
Proof of Theorem 3 By Theorem 1, c j = 1 whenever ρ j > 0, thus Sγ = 1 where γ i = f i /τ i by (17). Multiplying by S −1 yields f i /τ i = s i• , and hence ρ = S −1 τ , that is ρ j = i f i s ij /s i• which is (21). The third and fourth identities of (22) are obtained by simple substitution from (7), resp. (14). Identities j z ij = 1 and i f i z ij = ρ j are straightforward to demonstrate, showing (21) and (22) to yield the optimal solution, provided min j r j ≥ 0. 4 (24) and (25)

Proof of Theorem
which demonstrates the stability of the solution. The high-temperature expansions (25), as well as the exactness of the relation E = β ∂ when A consists of a single element, follows analogously from (10).

Proof of Theorem 10
For f = f • , one verifies in (21) that r j = f • j > 0, hence ρ • = f • by (22). As a result, (8) or (22) (7) is simplŷ Z XY = Z X ⊗ Z Y , where Z X and Z Y are the corresponding univariate optimal transitions.
In particular,ρ XY (j l) = ρ X j ρ Y l ,τ XY (ik) = τ X i τ Y k and thus E f XY = E f X + E f Y . Consider now a perturbation of the form f XY (ik) =f XY (ik) + εh (ik) , where is small and h obeys h (i•) = 0 and h (•k) = 0 to preserve the margins. Taylor expansion around = 0 yields whereÊ (ik) andÊ (ik)(j l) are the corresponding derivatives (A.6) and (A.7) evaluated at f XY . One hasÊ (ik) = − lnτ XY (ik) − 1 = − ln τ X i − ln τ Y k − 1, and the contribution of this first order term is zero in view of the conditions imposed on h. The contribution of the second order term is negative in view of the negative semi-definiteness ofÊ (ik)(j l) established in the proof of Theorem 9, and finally E f XY showingf XY to be a local maximum of the effective entropy within the set of bivariate distributions with fixed margins. The latter set being convex, this local maximum is the global maximum in view of the concavity of E(f ).