Introduction

In the absence of a quantitative definition, papers on collective behavior (including this one) often begin by listing well-known examples of collective systems, like fish schools or bird flocks. This gives a useful reference point for the reader, but offers little guidance on what to consider “collective” in other systems and behaviors. Even a canonical example of collective behavior like a fish school may vary in the degree of coordinated movement over time and transition between periods of ordered movement and disordered aggregation (Tunstrøm et al. 2013). Moreover, different parts of the same school may be more or less locally coordinated or preferentially coordinated with only subsets of the larger group (for example, in mixed species assemblies; Ward et al. 2018; Gil et al. 2018). Schools can also vary widely in size across and within species and environments. Millions of sardines moving together may be clearly collective, but a school of two is less clear.

Rather than searching for a heuristic distinction between “collective” and “not collective,” this paper investigates a principled measure of one of its defining characteristics: coordination. Typically coordination is measured in a system- and behavior-dependent way. For example, the average alignment of the headings of all the fish in a group provides a useful order parameter that indicates coordinated movement when high, and disordered aggregation when low (see, e.g., Couzin et al. 2002; Tunstrøm et al. 2013). While this same order parameter can be useful in other systems exhibiting collective movement, such as locusts (e.g., Buhl et al. 2006), it would have less utility for describing the degree of coordinated behavior in the nest-site selection process of honeybees (Seeley and Visscher 2004), bridge formation (Reid et al. 2015) and foraging decisions (Greene and Gordon 2007) in ants, social conflict policing in Macaques (Flack et al. 2006), quorum sensing in bacteria (Papenfort and Bassler 2016), or neuronal avalanches in slices of neocortex (Beggs and Plenz 2003). System-specific measures are useful in their relevant context, but make comparisons of coordination across systems or even between behaviors within the same system difficult to perform quantitatively.

Instead, in this paper we explore a system-independent measure of coordinated behavior based on a dimensionless information-theoretic measure of dependence. This measure quantifies the relative degree of statistical dependence shared by a set of elements (individuals) in any system, allowing the degree of macroscopic coordination to be quantified and compared across systems of any size. We demonstrate the practical utility of this measure in a classic model of collective behavior. Moreover, we provide a method that uses this measure to find the natural decompositions of a system into its most coordinated components. These decompositions provide mesoscale descriptions of the system that may offer a useful basis on which to make inferences about intermediate-scale social forces governing large-scale group behavior. Finally, we demonstrate the application of this method to both simulated and empirically recorded systems to show its utility in practice.

Results

Redundancy as a measure of coordination

Let \(S = \{1, 2, \ldots , n\}\) be the indices of a set of random variables, \(\{X_{i}\}_{i \in S}\), which in general may be neither identically distributed nor independent. In the context of a fish school or a bird flock, this could be the set of all the velocity vectors of the individuals in the group; for neurons, this could be the state of each neuron (firing or silent). In general, it could be any heterogeneous assemblage of the microscopic observables of a system. If we were asked to faithfully record the current state of the whole group, one strategy would be to simply write down a description of each element separately. One of the foundational results from information theory is that no lossless description of a random variable can be shorter on average than the tight lower bound given by its entropy (Shannon 1948). Thus a description of the system given by recording every element separately would require on average a minimum of \(\sum _{i \in S} H(X_i)\) bits, where \(H(X_i)\) is the entropy of \(X_i\).

Alternatively, another strategy would be to instead write down a shared (or ‘joint’) description of all elements at once. A joint description can capitalize on the dependencies among a set of variables to reduce the overall description length needed. For example, to characterize the state of both a lamp and the light switch that controls it, one could simply record the on/off state of one of the two components. Knowing the state of either the switch or the lamp automatically tells us the state of the other, under perfect operating conditions. For less than perfect operating conditions, it will be necessary to include additional information about the state of the other component, but only as frequently as the light switch fails to determine the state of the lamp. In either case, the joint entropy of the lamp and the light switch together determines the lower bound on the lossless joint description of the system. Thus the smallest lossless joint description requires \(H(\{X_i\}_{i \in S})\) bits on average, where we are guaranteed that \(H(\{X_{i}\}_{i \in S}) \le \sum _{i \in S} H(X_i)\).

In fact, the only way in which the joint description is as costly as the sum of the individual (or ‘marginal’) descriptions is if all \(X_i\)’s are independent. The difference between the marginal and joint descriptions, given by

$$\begin{aligned} I(\{X_{i}\}_{i \in S})&= \sum _{i \in S} H(X_i) - H(\{X_{i}\}_{i \in S}), \end{aligned}$$
(1)

gives us a natural measure of how much we reduce the fundamental representation cost by using a joint, rather than a marginal, description. Another way to think about Eq. 1 is as a measure of redundancy: the amount of information that is made redundant (unnecessary) when describing \(\{X_{i}\}_{i \in S}\) as a whole rather than by parts. A similar interpretation can be found in Watanabe (1960)’s original investigation of Eq. 1 as a general measure of multivariate correlation (also called “total correlation”).Footnote 1

Notably, redundancy in the absolute sense given by Eq. 1 scales in magnitude with the size of the system. For example, if we take n identical copiesFootnote 2 of the same random variable, X, then we have \(I(\{X_{i}\}_{i \in S}) = (n-1) H(X)\). This is a useful property for a measure of collective behavior, in the sense that just two or three of something behaving similarly is less “collective” than hundreds or thousands. On the other hand the \(H(X)\) term indicates that this also scales with the magnitude of the individual variability in behavior (Fig. 1, left). This is orthogonal to what is typically meant by “collective.” A school of fish swimming slowly or quickly through the coral of a reef ought to be “collective” to the same degree provided their movement decisions depend on one another to the same degree, rather than depending additionally on the range and variability of individual decisions that could be made. To reflect this invariance to the magnitude of individual variability, it is useful to consider instead the relative redundancy (normalized total correlation), i.e.,

$$\begin{aligned} r = \frac{I(\{X_{i}\}_{i \in S})}{\sum _{i \in S} H(X_i)} = 1 - \frac{H(\{X_{i}\}_{i \in S})}{\sum _{i \in S} H(X_i)} = 1 - s, \end{aligned}$$
(2)

where s is then the proportion of non-redundant, or incompressible, information in the set. Using the same example as before, for n identical copies of X, \(r = 1 - \frac{1}{n}\), which is invariant to \(H(X)\), while still increasing with n (Fig. 1, right).

Fig. 1
figure 1

(Left) Schematic description of a system, \(\{X_{i}\}_{i \in S}\), by its total correlation (y-axis), measuring dependence, and the total marginal entropy of its elements (x-axis). The 1–1 line and shaded area above are not achievable. Biological system examples may include starling flocks or fish schools (dynamic and dependent); whirligig beetle rafts (static and dependent); fish schools in a swarm (disordered) state (static and independent); and swarms of gnats (dynamic and independent). (Right) Feasible (white) and infeasible (shaded) redundancies for systems of a given size, n. The upper bound is given by a system in which every element is perfectly dependent on every other element (so knowing the state of one element is as good as knowing the state of every element in the system). The lower bound is zero, which occurs when all elements are independent

In general, the upper bound of relative redundancy for a fixed n is invariant to rescaling of the individual entropies, but sensitive to variability in the set of entropies. To see this, note that \(H(\{X_{i}\}_{i \in S}) \ge \max _{i \in S} H(X_i)\), s.t.

$$\begin{aligned} 0 \le \frac{I(\{X_{i}\}_{i \in S})}{\sum _{i \in S} H(X_i)} \le 1 - \frac{\max _{i \in S} H(X_i)}{\sum _{i \in S} H(X_i)} < 1, \end{aligned}$$
(3)

for any set of \(X_i\) (i.e., not necessarily all identical as in the prior example). Then rescaling all \(H(X_i)\) by a constant factor does not change the upper bound, and the upper bound is closest to 1 when all \(H(X_i)\) are equal. This last property also fits the intuitive definition of “collective,” in the sense that elements of a system behaving similarly should have similar variability in their individual behaviors.

To summarize, relative redundancy has the following properties useful for measuring coordination in collective behavior:

  1. 1.

    It increases the more the behavior of any one element in the system is informative about the behavior of all the other elements in the system.

  2. 2.

    Its upper bound increases as the number of individual elements in the system increases (yet remains on a zero to one scale).

  3. 3.

    It increases with increasing similarity in the variability of individual behavior.

  4. 4.

    It is invariant to the total amount of individual variability within the system.

As an example, swarms of gnats forming large mating groups would likely score low on this measure of collectivity (provided the microscopic property being measured is individual movement). While gnats within the swarm may have similar levels of variability in their velocities, their movements are relatively independent. In comparison, large groups of fireflies flashing in unison (provided the microscopic property measured is the on / off state of the firefly’s bioluminescent abdomen) should score high on the relative redundancy scale, regardless of species variability in the frequency of flashing. Relative redundancy should also give a graded distinction between “shoaling” and “schooling” in fish, based on the degree of coordinated movement behavior within the group (resulting in low and high relative redundancy, respectively).

Practical application

Computing relative redundancy in practice is challenging. Estimating the mutual information between just two variables (equivalently, the \(n=2\) case for Eq. 1), or the entropy of a single variable, runs into sampling problems and issues of estimator bias (Paninski 2003). While there may be no universal solution, for systems with continuous microscopic properties (the quantities of each element of the system for which we would like to measure coordination across the system), we can still make progress by maximizing a lower bound on redundancy instead.

First, for continuous random variables that are marginally Gaussian with system-wide correlation matrix \(P_S\), the Gaussian mutual information,

$$\begin{aligned} I_G(\{X_{i}\}_{i \in S}) = -\frac{1}{2} \log \det (P_S), \end{aligned}$$
(4)

is a lower bound on the total mutual information (Foster and Grassberger 2011; Kraskov et al. 2004). Since the marginals are continuous and Gaussian, each element has differential entropy

$$\begin{aligned} h_G(X_i) = \frac{1}{2} \log \left[ (2\pi e)^{k_i} \det (K_i) \right] , \end{aligned}$$
(5)

where \(K_i\) is the covariance matrix of \(X_i\), and \(k_i\) is the number of variates of element i. Unfortunately, while \(I_G(\cdot )\) is nonnegative, the differential entropy \(h_G(\cdot )\) can be positive or negative. Fortunately, for an arbitrarily precise \(\alpha\)-bit quantization of \(X_i\), its discrete entropy is approximated by \(h(X_i) + \alpha\) (see Theorem 8.3.1 in Cover and Thomas 2006). Since the choice of \(\alpha\) is arbitrary, we can choose it such that the differential entropies for the system are all positive. The choice of quantization cancels out in the numerator and only affects the denominator, giving

$$\begin{aligned} r&\ge \frac{I_G(\{X_{i}\}_{i \in S})}{\alpha + \sum _{i \in S} h_G(X_i)}, \end{aligned}$$
(6)

which is simple to compute in practice. However, since the quantization level, \(\alpha\), changes the scaling, when making cross-system comparisons one must be sure to compute redundancy using the same \(\alpha\) across all systems.

In general, when the random variables comprising the system are not marginally Gaussian, this lower bound can still be helpful. By substituting rank transformed variables \(G_i\) for \(X_i\) in the numerator, for which we enforce that each \(G_i\) is marginally Gaussian distributed, the numerator remains a useful lower bound on the total correlation among the \(X_i\) (by extension of Foster and Grassberger 2011; Kraskov et al. 2004, to the multivariate case). This essentially just measures the strength of any monotonic pairwise relationship among the system elements. The Gaussian differential entropies in the denominator are also upper bounds on the differential entropies of any continuous \(X_i\) with the same means and (co)variances. Thus redundancy is lower bounded by these two quantities for any continuous \(X_i\). Better or possibly even exact estimates of r may be possible depending on the system and microscopic variables at play; in any case, Eq. (2) still gives the correct system-independent blueprint for measuring coordination.

Fig. 2
figure 2

Example of redundancy as a measure of coordination in a system. (Top left) Redundancy for Vicsek model simulations (black dots) with \(n = 50, d = 0.2, L=1\), and \(\eta\) (noise in individual headings) varying along the x-axis. Simulations were run for 500 iterations to reach steady state, then redundancy was computed based on the subsequent 1000 iterations. Estimated mode(s) of simulation redundancy distributions at each noise level are shown (purple points); lines connect adjacent modes based on a threshold distance linking criterion. The x-axis scale is linear to the right of \(\eta = 1\) (dashed vertical line), and \(\log _{10}\) to the left. Qualitative descriptors of the system state, from disordered to dynamic order to coherent order are annotated on the plot. (Bottom left) Time-averaged mean alignment (red), the typical order parameter for the Vicsek model, as a function of noise, \(\eta\). The standard deviation in mean alignment over time is also shown in blue (corresponding to blue axis at right). (Right) Snapshots of simulated systems in the coherent order (high redundancy), dynamic order (intermediate redundancy), and disordered (low redundancy) states. Agent positions (black points) and headings (black lines) shown in simulated two-dimensional space with periodic boundary conditions

As a simple numerical application using the above redundancy bound, Fig. 2 explores the Vicsek et al. (1995) model of collective motion with alignment only, i.e.,

$$\begin{aligned} \theta _i(t+1)&= \bar{\theta }_i(t) + \epsilon _i(t), \end{aligned}$$
(7)
$$\begin{aligned} \mathbf {x}_i(t+1)&= \mathbf {x}_i(t) + \mathbf {v}_i(t) \Delta t, \end{aligned}$$
(8)

where \(\mathbf {x}_i(t)\) is the position of individual i at discrete time step t, \(\mathbf {v}_i(t)\) is individual i’s velocity at time t given by its heading, \(\theta _i(t)\), and a constant, c (fixed at 0.03 to match Vicsek et al. 1995), \(\bar{\theta }_i(t)\) is the angular average heading of i and all neighbors within a distance d at time t, and \(\mathbf {\epsilon }_i(t)\) is drawn i.i.d. from a uniform distribution on the interval \(\left[ -\eta /2, \eta /2 \right]\). In this well-studied system, redundancy (Fig. 2, Top left) shows the same phase transition from disorder to order when varying the noise parameter \(\eta\), as seen in the system-specific order parameter of average alignment (Fig. 2, Bottom left). Interestingly, it also shows an apparently discontinuous transition with a bistable region in the ordered regime, which to our knowledge has not been reported before. This appears to distinguish between “dynamic order” (in which there are still fluctuations in average alignment over time across the group) and “coherent order” (in which the group is almost always aligned). A detailed investigation of this transition is beyond the scope of this study and is left for future work. However, based on a visual inspection of the emergent dynamics, it seems likely that the observed discontinuous transition may be related to the correlation range of the orientation exceeding the finite system size, whereas the bistability emerges from different spatial configurations exhibiting either coherent or dynamic order for the same noise values.

Redundancy partitioning for system structure

While relative redundancy (resp. incompressibility) can be used to compare the degree of collectivity exhibited by very different systems, it can also be used to characterize the dependency structure within a given system. Writing the relative redundancy as a function of a subset of the system, \(A \subseteq S\), we have

$$\begin{aligned} r(A)&= 1 - \frac{H(\{X_{i}\}_{i \in A})}{\sum _{i \in A} H(X_i)}. \end{aligned}$$
(9)

What divisions of a system maximize the relative redundancy of each subset?

To make this question concrete, let \(\widehat{S}\) be a set of indices for a collection of subsets of S, which we will refer to as the components of system S. That is, let \(\widehat{S} = \{1, 2, \ldots , m\}\), where typicallyFootnote 3\(m \le n\), and introduce a probabilistic assignment p(j|i), \(\forall (i,j) \in (S, \widehat{S})\),Footnote 4 which can be read as the probability that element i belongs to component j. Then the expected quality of an assignment to a given component is

$$\begin{aligned} \mathbb {E}\left[ r(A)|j\right]&= \sum _{A \in \mathcal {P}(S)} r(A) p(A|j), \end{aligned}$$
(10)

where \(\mathcal {P}(S)\) is the power set (set of all subsets) of S, and

$$\begin{aligned} p(A|j)&= \prod _{i \in A} p(j|i) \prod _{i \in {A}^{\mathsf {c}}} \left[ 1 - p(j|i)\right] , \end{aligned}$$
(11)

is the probability of subset A given the assignments of elements to component j, by a simple counting argument.Footnote 5 Treating the quality of each component equally, the expected quality over all components is then

$$\begin{aligned} \mathbb {E}\left[ r(A)\right]&= \frac{1}{m}\sum _{j \in \widehat{S}} \mathbb {E}\left[ r(A)|j\right] . \end{aligned}$$
(12)

Note that the redundancy of any individual element, i.e., \(r\left( \{1\}\right)\), is equal to zero according to Eq. 9. For continuity, we define the redundancy of the empty set, \(r\left( \{\}\right)\), to be zero. A visual example of dividing a system into different numbers of components and measuring component redundancy is illustrated in Fig. 3.

Fig. 3
figure 3

(Left) Example probabilistic assignments of \(n = 5\) variables (i.e., \(X_1, X_2, \ldots , X_5\)), to \(m = 1, 2, \text {and}\,3\) components. Probabilistic assignments of \(i \in S\) to \(j \in \widehat{S}\), written p(j|i), are shown as matrices of dimension n rows by m columns. Each cell is shaded according to the probability of assignment, ranging from 0 to 1 (white to black), with each column summing to 1. The \(m = 2\) case illustrates a “soft assignment,” in which there are assignment probabilities between 0 and 1. The \(m = 3\) case illustrates a “hard assignment,” in which each assignment probability is only either 0 or 1. (Right) The “quality” of the j-th component is measured in terms of its expected redundancy: \(\mathbb {E}\left[ r(A) | j \right]\). The expectation is over the distribution of possible sets, p(A|j), which is a function of the probabilistic assignments p(j|i) (see Eq. 11)

Rate-distortion theory

While this gives us a natural way to evaluate the quality of a given assignment, it does not immediately provide us with a way to find such an assignment. Instead, we draw inspiration from the information-theoretic treatment of compression given by rate-distortion theory (see Shannon 1959; Cover and Thomas 2006). Classical rate-distortion theory addresses the following problem: given a source (random variable) X, a measure of distortion, d, and an allowable level of average distortion D, determine the minimum amount of information necessary for a compressed description of X that introduces an average distortion no more than D. I.e.,

$$\begin{aligned} R(D) = \min _{p(\hat{x}|x)\,:\,\mathbb {E}d(x,\hat{x})\,\le \,D} I(X;\widehat{X}), \end{aligned}$$
(13)

where the rate, R(D), equals the minimum amount of information (measured in bits per symbol, hence “rate”) needed for average distortion D. In this case, the rate measures the information, \(I(X;\widehat{X})\), that the compressed representation, \(\widehat{X}\), needs to keep about the source, X, where

$$\begin{aligned} I(X;\widehat{X})&= \sum _{x,\hat{x}} p(x,\hat{x}) \log \frac{p(x,\hat{x})}{p(x)p(\hat{x})} \end{aligned}$$
(14)

is the mutual information between X and \(\widehat{X}\). The lower the rate, the better the compression, but (depending on the source and the distortion measure) the higher the average distortion introduced. Surprisingly, not only can the rate-distortion curve be characterized numerically in general, the minimal compressed representation of X can be found via a simple, iterative, alternating minimization algorithm (Blahut 1972; Arimoto 1972).

Redundancy partitioning

Though there are important differences from rate-distortion theory (discussed in “Appendix 1”), we can similarly frame the problem of finding structure based on redundancy as a compression problem. Here, we wish to find the assignment of elements of S to components of \(\widehat{S}\) that achieves an average redundancy no less than \(r^*\), and otherwise preserves as little about the original identities of the elements as possible. I.e.,

$$\begin{aligned} R(r^*) = \min _{p(j|i)\,:\,\mathbb {E}\left[ r(A)\right] \,\ge \,r^*} I(S;\widehat{S}), \end{aligned}$$
(15)

where p(j|i) is further required to be nonnegative and sum to one. This is not a standard rate-distortion problem, but we can use many of the same ideas developed by Blahut (1972) and Arimoto (1972) in their original numerical algorithms for deriving a practical solution. We give a brief account of this derivation here; see “Appendix 1” for a complete account.

Introducing Lagrange multipliers to constrain the \(\sum _{j \in \widehat{S}} p(j|i) = 1\) (non-negativity will be enforced by the form of the solution), the variational problem becomes

$$\begin{aligned} L\left[ p(j|i) \right]&= I(S;\widehat{S}) - \beta \sum _{j \in \widehat{S}, A \in \mathcal {P}(S)} r(A) p(A|j) + \sum _{i \in S} \lambda (i) \sum _{i \in \widehat{S}} p(j|i), \end{aligned}$$
(16)

where \(\beta\), the Lagrange multiplier for the average redundancy constraint, absorbs the 1/m term. Taking the derivative with respect to a particular \(j'\) and \(i'\), we have

$$\begin{aligned} \frac{\partial }{\partial p(j'|i')} L\left[ p(j|i) \right]&= p(i') \log \frac{p(j'|i')}{p(j')} - \beta \sum _{j \in \widehat{S}, A \in \mathcal {P}(S)} r(A) \frac{\partial p(A|j)}{\partial p(j'|i')} + \lambda (i'), \end{aligned}$$
(17)

where

$$\begin{aligned} \frac{\partial p(A|j)}{\partial p(j'|i')}&= \left\{ \begin{array}{ll} 0 &{}\quad \text{ if } \,\, j \ne j', \\ f_{i'}(A|j') &{}\quad \text{ if } \,\, j = j', i' \in A, \\ -f_{i'}(A|j') &{}\quad \text{ if } \,\, j = j', i' \in {A}^{\mathsf {c}}, \end{array} \right. \end{aligned}$$
(18)

and

$$\begin{aligned} f_i(A|j)&= \prod _{k \in A \setminus \{i\}} p(j|k) \prod _{k \in {A}^{\mathsf {c}} \setminus \{i\}} \big [1 - p(j|i)\big ], \end{aligned}$$
(19)

where \(A \setminus \{i\}\) is the relative complement of the singleton set \(\{i\}\) with respect to A.

Then setting \(\partial L / \partial p(j'|i') = 0\) and splitting the sum over \(\mathcal {P}(S)\) into terms with and without \(i' \in A\), we have

$$\begin{aligned} \begin{aligned} p(i') \log \frac{p(j'|i')}{p(j')}&= \beta \sum _{\{A \in \mathcal {P}(S)\,:\,i' \in A \}} r(A) f_{i'}(A|j') \\&\quad - \beta \sum _{\{A \in \mathcal {P}(S)\,:\,i' \in {A}^{\mathsf {c}} \}} r(A) f_{i'}(A|j') \\&\quad - \lambda (i'). \end{aligned} \end{aligned}$$
(20)

Let

$$\begin{aligned} d(i,j)&= \frac{1}{p(i)} \sum _{\{A \in \mathcal {P}(S)\,:\,i \in A \}} r(A) f_{i}(A|j), \end{aligned}$$
(21)

and define \(d_{\mathsf {c}}(i,j)\) to be identical except substituting \(i \in {A}^{\mathsf {c}}\) for \(i \in A\). Lastly, let \(\Delta d(i, j) = d(i,j) - d_{\mathsf {c}}(i,j)\). Then, dividing through by \(p(i')\) and substituting, we have,

$$\begin{aligned} \log \frac{p(j'|i')}{p(j')}&= \beta \Delta d(i',j') - \frac{\lambda (i')}{p(i')}. \end{aligned}$$
(22)

Finally, substituting \(\log \mu (i') = \lambda (i') / p(i')\) and solving for \(p(j'|i')\),

$$\begin{aligned} p(j'|i')&= \frac{p(j')}{\mu (i')} e^{\beta \Delta d(i',j')}. \end{aligned}$$
(23)

Enforcing the constraint that \(\sum _{j \in \widehat{S}} p(j|i') = 1\) and simplifying notation, we have

$$\begin{aligned} p(j|i)&= \frac{p(j) e^{\beta \Delta d(i,j)}}{\sum _{j' \in \widehat{S}} p(j') e^{\beta \Delta d(i,j')}}. \end{aligned}$$
(24)

Before moving on, it is worth noting that \(\Delta d(i,j)\) has a simple and intuitive interpretation. It is the difference in redundancy for component j when i is included versus when it is excluded, weighted by the relative importance of i.

Note that p(j) and p(A|j) depend on the choice of p(j|i). The final algorithm,

$$\begin{aligned} \left\{ \begin{array}{rl} p_t(j|i) &{}= \frac{p_t(j) e^{\beta \Delta d(i,j)}}{\sum _{j' \in \widehat{S}} p_t(j') e^{\beta \Delta d(i,j')}}, \\ p_{t+1}(j) &{}= \sum _{i \in S} p_t(j|i) p(i), \\ p_{t+1}(A|j) &{}= \prod _{i \in A} p_t(j|i) \prod _{i \in {A}^{\mathsf {c}}} \big [1 - p_t(j|i) \big ],\\ \end{array} \right. \end{aligned}$$
(25)

follows a similar alternating minimization scheme to the one developed by Blahut and Arimoto and generalized by Csiszár and Tsunády (1984), albeit with only local optimality guarantees similar to Tishby et al. (1999); Banerjee et al. (2005). See "Appendix 1" and Fig. 8 for a complete derivation and description of the algorithm.

One immediate issue is the \(2^n\) scaling of the number of subsets of S as n (the number of elements of S) increases. First, it is worth noting that there are non-trivial collective systems of empirical interest even for small n. Current computational hardware may permit exact computation up to around \(n \approx 15\) even on consumer hardware, which would be relevant for many experimental systems (as in, e.g., Miller and Gerlai 2007; Katz et al. 2011; Jolles et al. 2018). Second, for larger systems, Monte Carlo estimation of \(\Delta d(i,j)\) can be readily employed, e.g., for K samples,

$$\begin{aligned} \begin{array}{rll} \widehat{d}(i,j) &{}= \displaystyle \frac{1}{p(i) K} \sum _{k = 1}^K r\big (A_{ij} \cup \{i\}\big ),\\ \widehat{d}_{\mathsf {c}}(i,j) &{}= \displaystyle \frac{1}{p(i) K} \sum _{k = 1}^K r\big (A_{ij} \setminus \{i\}\big ), &{}\quad \text {where}\,A_{ij} \sim f_i(\cdot |j). \end{array} \end{aligned}$$
(26)

For large systems in particular initializing near good solutions may be helpful. In many systems we may expect elements to be spatially or temporally dependent, and use that prior knowledge to initialize reasonable clusters. However the preliminary results given in the next section do not employ any such strategy; we simply run the algorithm many times beginning with many different initial conditions and select the best solution generated. Finally, although we omit the exposition here, in the “hard-partition” limit (as \(\beta \rightarrow \infty\)), p(j|i) becomes a delta function, meaning that no sampling is necessary and we need only consider adding or dropping each element from each component on each iteration. When using the Gaussian bound on redundancy introduced in "Practical application" section, this can be accomplished in \(O(n^4)\) (or \(O(n^3)\) with some decrease in numerical precision). Our open source implementation of this algorithm is available by request or online at https://github.com/crtwomey/sscs.

Experiments

Simulation experiments

We tested the proposed algorithm on two sets of data: simulations of schooling groups, and empirical data collected from the movements of schooling fish in a lab environment. The former allow us to control the dependency structure of the system, while the latter allows us to demonstrate applicability to empirical systems. Simulations used a simple model of coordinated movement based on attraction, alignment, and repulsion social forces (based on Romanczuk et al. 2012; Romanczuk and Schimansky-Geier 2012; a description of the model and additional information on the simulation conditions can be found in Appendix 2). Position and velocity data for independent groups of size \(n = 5,\,10,\) and 20 were generated for a high \((\eta = 0.2)\) and low \((\eta = 0.15)\) noise conditions.

Empirical experiments

Movement data of fish comes from videos originally recorded by Katz et al. (2011). In that work, groups of 10, 30, and 70 golden shiners (Notemigonus crysoleucas) were purchased from Anderson Farms (www.andersonminnows.com) and filmed in a \(1.2 \times 2.1\,\mathrm {m}\) tank with an overhead camera. Videos were then corrected for lens distortion and fish were tracked using the same custom in-house software developed by Haishan Wu and used in Rosenthal et al. (2015). The software begins by detecting all individuals in each frame, then links individuals across frames to form tracks. All tracks were manually corrected to ensure accuracy. Individual positions and velocities were estimated from these tracks using a \(3^\mathrm{rd}\) order Savitzky–Golay filter (Savitzky and Golay 1964; similar to, e.g., Harpaz et al. 2017) with a 7 frame smoothing window (videos were recorded at 30 fps). Interactions between fish are time-dependent; for the results presented here we simply chose a fixed window of \(\pm \, 15\,\mathrm {s}\) surrounding a given time t to estimate the dependency structure of the group. An optimal choice of time window is left for future work.

Experimental results

The algorithm outlined in "Redundancy partitioning" section requires specifying the number of components and a parameter, \(\beta\), which controls the relative importance of maximizing the average redundancy of the components as opposed to maximally compressing the original set of system elements. While it will be interesting to investigate the ‘soft-partitioning’ aspect of this approach in future work, here we simply consider the hard assignment case, which requires only that \(\beta\) is large. Figure 4 (Right) illustrates this point, showing the stabilization of average component redundancy for \(\beta > 5\). We found that \(\beta = 200\) was sufficient to recover hard assignments in all cases tested here.Footnote 6 Since relative redundancy ranges between 0 and 1 for any dataset, these parameter values should generalize well to other systems, and leaves the method free of parameter fine-tuning.

Fig. 4
figure 4

Algorithm implementation and parameter sensitivity. (Left) Comparison of exact and Monte Carlo estimates of \(\Delta d(i,j)\), for groups of size 5 and 10, for low and high noise conditions. Note that the number of components can be greater than the number of elements; in this case some components will be empty, i.e., have no elements assigned to them. (Right) Impact of the choice of \(\beta\) on the average redundancy of the recovered components for a simulated group of size 10, high noise condition, searching for 5 components. Dotted line shows the mean of the solutions for \(\beta > 5\)

To validate that the Monte Carlo estimate of \(\Delta d(i,j)\) employed is effective, we compared its behavior to exact computations of \(\Delta d(i,j)\) for small system sizes (simulated groups of size 5 and 10). We ran each version of the algorithm for up to 10 components and took the best (maximum) average component redundancy achieved over 100 random initializations of the assignment matrix p(j|i). Figure 4 (Left) shows that the results are in good agreement, and where there are discrepancies they tend to favor the Monte Carlo method, in that the Monte Carlo method recovers solutions with higher average redundancy.

Next, we tested the algorithm on simulated data in which the dependency structure of the simulated groups was known, using the hard partitioning variant of the algorithm for computational efficiency. For each test, we computed the maximum average component redundancy recovered for up to 10 components, again using 100 random initializations of the assignment matrix for each computation. In all cases partitioning decreases the average redundancy of the system with increasing number of components (Fig. 5).Footnote 7 However the magnitude of the change in average redundancy (or ‘\(\Delta\) average redundancy’) from m to \(m-1\) components is informative of the system’s dependency structure. Small values of \(\Delta\) average redundancy occur when subdividing the system has a comparatively minor impact on average redundancy, which should be expected when partitioning relatively independent parts of the system. In comparison, a large increase in the value of \(\Delta\) average redundancy appears to occur when a strongly interacting component is split. This can be seen by comparing the \(\Delta\) average redundancy curves for each group size between instances of a single group (Fig. 5Left) in the system or two independent, non-interacting groups in the same system (Fig. 5Middle). The \(\Delta\) average component redundancies for systems containing only a single group have either no or only shallow local minima followed by at most small increases. In comparison, \(\Delta\) average redundancies for systems with two non-interacting groups, in pairs of matched size groups of 5, 10, and 20, have comparatively deep local minima first occurring at 2 components for \(n =\) 5 and 10, and at 4 components for \(n = 20\), followed directly by relatively large increases in Δ average redundancy. At the point preceding each of these transitions from low to high \(\Delta\) average redundancy, the two non-interacting groups are assigned to separate components by the algorithm, and in the \(n = 20\) case the two groups are further subdivided into two spatially assorted components each. Finally, the \(\Delta\) average redundancies for a system of three non-interacting groups of mixed sizes 5, 10, and 20 were computed, with local minima first occurring at 3 and 4 components for high and low noise conditions, respectively (Fig. 5Right), followed by large increases in \(\Delta\) average redundancy.Footnote 8 Taken together, this is evidence that the transition from low to high \(\Delta\) average component redundancies recovered by the algorithm reflect the dependency structure of the underlying system. It suggests that these features may be useful in identifying relevant structure in other systems, even those with less extreme dependency structures.

Fig. 5
figure 5

Partitioning results for simulations of 1, 2, and 3 independent (non-interacting) groups. Top row shows average redundancy as a function of the number of components (m). Bottom row shows the magnitude of the change in average redundancy between m and \(m-1\) components (larger values are larger decreases). (Left) For single cohesive groups of size \(n =\) 5, 10, or 20, the average redundancy (y-axis) of the components decreases approximately monotonically as the number of components increases. The \(n = 10\) group has shallow local minima in \(\Delta\) average redundancy at \(m = 2\) and 4. (Center) For two non-interacting groups of the same size, the average redundancy approximately plateaus at two components for \(n = 5\) and 10. For \(n = 20\), the first minimum in \(\Delta\) average redundancy is achieved at \(m = 4\). (Right) A mixed (varying in group size) collection of three non-interacting groups, with sizes 5, 10, and 20, first plateau in average redundancy at three or four components, depending on the noise \((\eta )\) used in the simulation. For comparison, the left two plots show results for \(\eta = 0.2\) (the ‘high’ noise)

Figure 6 illustrates the iterative generation of assignments for the algorithm in the mixed three group (high noise) case. Assignments change and harden until they converge on a (local) maximal average redundancy partition of the system’s elements (Left). The assignments generated by the algorithm of system elements to components correspond one-to-one with the original, non-interacting set of three groups (of sizes 5, 10, and 20) comprising the whole system (of total size 35). Positions of the elements of the system and their velocity vectors are shown for one time point, colored by the component they were assigned to (which corresponds to their original group), in Fig. 6 (Left). Note that, while the snapshot shown in Fig. 6 was chosen to show the three distinct groups, at many points in the simulation the positions, velocities, or both, overlapped between the three groups. The algorithm is able to recover the independent groups in the system without using spatial position information, based on coordination in individual velocities alone.

Fig. 6
figure 6

Generation of assignments by the average redundancy partitioning algorithm for a mixture of three non-interacting simulated groups, based on coordination in individual velocities. (Left) Assignments generated by the proposed sequential algorithm for three components after initialization \((t = 0)\), 1 iteration \((t = 1)\), and 10 iterations \((t = 10)\), at top, middle, and bottom, respectively. The color scale indicates the probability of assigning a member of a group (column) to a particular component (row), where low to high probability is coded dark to light (color scale top right). Original groupings of the system into its three non-interacting subsets are indicated on the x-axis. (Right) Two-dimensional positions (arbitrary units) of simulated system at one time point, color-coded by final component assignment; velocity vectors indicated by line segments. The algorithm correctly separates each subgroup based on coordination in velocities alone, without reference to spatial position

Finally, we applied the algorithm to empirical data collected on fish schools to validate that the method is able to recover sensible results for strongly interacting groups and from non-simulated data. Figure 7 shows that for fish, groups of size 10 interact strongly enough (in at least the one instance tested here) to be considered one coherent unit, while groups of size 30 are already large enough to have subsets that more strongly interact with one another than the rest of the group (e.g., the local minima in \(\Delta\) average redundancy at \(m = 5\) components; Fig. 7Middle). The component assignments at the \(m = 5\) local minima and positions for the school of 30 fish are shown in Fig. 7 (Right) at a single time point. The subdivisions of the system show strong spatial assortment with a stratification of the group from front to back. As in the simulation case, here we use only coordination in individual velocities to determine partitions, so this spatial assortment is a consequence of similar behavior as opposed to some criterion based on proximity. Further work is needed to investigate the duration of substructure in fish schools, as well as the emergence and disappearance of components over time.

Fig. 7
figure 7

Coordinated substructure for empirical fish schools. (Left) Average component redundancy as a function of the number of components, for fish groups of size \(n = 10\), 30, and 70. (Middle) Magnitude of the change in average component redundancy between m and \(m-1\) components. The \(n = 30\) school (orange lines) has a local minimum at \(m = 5\) components. (Right) Example partitioning of a group of size 30 fish into five components (shown as different colors). Dots indicate the positions of the fish (swimming freely in a large \(1.2\,\mathrm {m} \times 2.1\,\mathrm {m}\) arena) relative to the school mean position. Line segments indicate historical positions of each individual 2 s into the past

Discussion

Redundancy as a measure of coordination

Collective behavior is an emergent property of the actions and interactions of a system’s constituents. One of its characteristic features is a high degree of coordination among the individual elements of a system. In this work, we explored an information-theoretic measure of coordination defined by relative redundancy, or one minus the ratio of the maximally compressed description of a system to its uncompressed description. This quantity can be compared between systems of any size and total variability, and in this sense it functions as a system-independent measure of coordination. As a numerical example, we showed that measuring redundancy for a simple model of collective motion exhibited both the classic transition from order to disorder in this system as the noise in individual headings increases, but interestingly, also identified an apparently discontinuous transition from “coherent order” to “dynamic order.” While these results should be investigated more systematically in future work, they demonstrate the practical utility of this measure and suggest it may be used to shed new light on even classic models of collective behavior.

Still, redundancy as a universal measure of coordination is challenging to compute in general. While the Gaussian bound introduced in "Practical application" section is useful in practice, it has limited applicability. It is not appropriate for discrete systems or mixed measurements of continuous and discrete individual properties. One potentially promising direction for future work is to better constrain empirical estimates of redundancy by leveraging the rapidly growing body of research devoted to inferring and modeling individual-level interaction rules in a wide range of natural systems (see, e.g., Ballerini et al. 2008; Lukeman et al. 2010; Nagy et al. 2010; Katz et al. 2011; Herbert-Read et al. 2011; Bialek et al. 2012; Strandburg-Peshkin et al. 2013; Rosenthal et al. 2015; Harpaz et al. 2017; Torney et al. 2018; Hein et al. 2018; Sosna et al. 2019). Accurate probabilistic models of individual behavior can be used to estimate the time-varying relative redundancy of empirically recorded configurations of individuals. This could be useful for improving system-specific estimators of redundancy, which may be of particular importance when making cross-species comparisons; for example, to understand the evolution of collective behavior.

Finally, while coordinated behavior is central to what it means to be collective, it is not the only important property of collective systems. In the Vicsek model of collective motion explored in Fig. 2, coordination is highest in the “coherent order” state, in particular when the entire system is locked into a single stable heading. Collective behavior in living systems can rarely afford to be so rigid; animal groups must respond and adapt appropriately to their environment. Daniels et al. (2016)’s investigation of amplification, i.e., the extent to which individuals within the group can affect group-level properties, is of particular interest in this regard.

Redundancy partitioning for system structure

There are a wide range of both general purpose clustering algorithms (see Jain 2010; Xu and Tian 2015) and network community detection methods (see Forunato 2010), owing to a diversity of plausible clustering and community detection criteria. The justification for the average relative redundancy criterion presented here stems from its principled approach to the specific problem of quantifying coordination and its demonstrated ability to identify dependent structure in collective systems. It is specific in scope and not intended as a drop-in replacement for other clustering methods for arbitrary similarity matrices.

This approach to understanding the structure of collective systems also differs from methods concerned with the inference of individual interaction networks. For one, this method makes no attempt to construct such a network. Useful information-theoretic methods based on, e.g., estimating the transfer entropy (Lizier and Rubinov 2012) or causation entropy (Lord et al. 2016) between and among system elements can be used for this purpose. Similarly, when individuals in a group need to each remember their own representation of within-group interactions, as in Macaques, biologically plausible interaction representations can be inferred based on a sparse coding principle (Daniels et al. 2012). Instead, this approach attempts to simply identify the maximally coordinated components of a system, which offers a natural mesoscopic locus of analysis for the full system’s behavior. It could then be interesting to study the network of transfer or causation entropy between coordinated components, for instance, though this is made potentially more challenging by the possibly only short-term persistence of any given component.

There are many questions left for future work. First, the identification of transitions from low to high \(\Delta\) average redundancy with increasing number of components is only a heuristic. In some cases there may be no local minima, or there may be multiple, in which case there may be more than one useful decomposition of the group. In other cases it may be more appropriate to divide the group into a given number of components regardless of the existence or position of a minimum. Further theoretical work is needed on the significance of plateaus in the average redundancy plot; we present only empirical evidence of their utility here. Second, an investigation of these features as a function of the time window chosen for computing the dependency structure may be important for understanding how the dependency structure of the group scales with time. It might be expected that on short time-scales for many systems only very local interactions will matter, requiring many components, while on longer time scales the system may be best represented as a single component.

It may also be important to investigate the algorithm presented here in the context of generating a soft-partitioning of a system’s elements into partially overlapping components. Using intermediate values of \(\beta\) may allow the algorithm to find better average redundancy solutions ‘in-between’ m and \(m+1\) components, in which assignments may be shared among components. At the same time, since optimal sets of components are not guaranteed to be unique, it may be important to explore the set of equally (or nearly equally) optimal solutions as an ensemble of equivalent descriptions of a system. Moreover, exploring the range of solutions as the number of components varies may reveal whether or not the system exhibits some form of hierarchical structure. In hierarchical systems we would expect components to be successively subdivided as the number of components increases.

One practical application of this method could be to the principled identification of a “group” in fission-fusion systems where this is an amorphous, time-varying concept. Another potential application of the method may be to long time-series, where the dependency structure itself is dynamic. Characterizing the natural decompositions of a system as a function of time may reveal important time-dependent mesoscopic features. How does the natural number of components of a system fluctuate in time, and how long do components persist? How do they interact as a function of time? These questions are central to the study of collective systems and may benefit from the quantitative approach to measuring coordination and identifying group structure introduced here.