Redundancy as a measure of coordination
Let \(S = \{1, 2, \ldots , n\}\) be the indices of a set of random variables, \(\{X_{i}\}_{i \in S}\), which in general may be neither identically distributed nor independent. In the context of a fish school or a bird flock, this could be the set of all the velocity vectors of the individuals in the group; for neurons, this could be the state of each neuron (firing or silent). In general, it could be any heterogeneous assemblage of the microscopic observables of a system. If we were asked to faithfully record the current state of the whole group, one strategy would be to simply write down a description of each element separately. One of the foundational results from information theory is that no lossless description of a random variable can be shorter on average than the tight lower bound given by its entropy (Shannon 1948). Thus a description of the system given by recording every element separately would require on average a minimum of \(\sum _{i \in S} H(X_i)\) bits, where \(H(X_i)\) is the entropy of \(X_i\).
Alternatively, another strategy would be to instead write down a shared (or ‘joint’) description of all elements at once. A joint description can capitalize on the dependencies among a set of variables to reduce the overall description length needed. For example, to characterize the state of both a lamp and the light switch that controls it, one could simply record the on/off state of one of the two components. Knowing the state of either the switch or the lamp automatically tells us the state of the other, under perfect operating conditions. For less than perfect operating conditions, it will be necessary to include additional information about the state of the other component, but only as frequently as the light switch fails to determine the state of the lamp. In either case, the joint entropy of the lamp and the light switch together determines the lower bound on the lossless joint description of the system. Thus the smallest lossless joint description requires \(H(\{X_i\}_{i \in S})\) bits on average, where we are guaranteed that \(H(\{X_{i}\}_{i \in S}) \le \sum _{i \in S} H(X_i)\).
In fact, the only way in which the joint description is as costly as the sum of the individual (or ‘marginal’) descriptions is if all \(X_i\)’s are independent. The difference between the marginal and joint descriptions, given by
$$\begin{aligned} I(\{X_{i}\}_{i \in S})&= \sum _{i \in S} H(X_i) - H(\{X_{i}\}_{i \in S}), \end{aligned}$$
(1)
gives us a natural measure of how much we reduce the fundamental representation cost by using a joint, rather than a marginal, description. Another way to think about Eq. 1 is as a measure of redundancy: the amount of information that is made redundant (unnecessary) when describing \(\{X_{i}\}_{i \in S}\) as a whole rather than by parts. A similar interpretation can be found in Watanabe (1960)’s original investigation of Eq. 1 as a general measure of multivariate correlation (also called “total correlation”).Footnote 1
Notably, redundancy in the absolute sense given by Eq. 1 scales in magnitude with the size of the system. For example, if we take n identical copiesFootnote 2 of the same random variable, X, then we have \(I(\{X_{i}\}_{i \in S}) = (n-1) H(X)\). This is a useful property for a measure of collective behavior, in the sense that just two or three of something behaving similarly is less “collective” than hundreds or thousands. On the other hand the \(H(X)\) term indicates that this also scales with the magnitude of the individual variability in behavior (Fig. 1, left). This is orthogonal to what is typically meant by “collective.” A school of fish swimming slowly or quickly through the coral of a reef ought to be “collective” to the same degree provided their movement decisions depend on one another to the same degree, rather than depending additionally on the range and variability of individual decisions that could be made. To reflect this invariance to the magnitude of individual variability, it is useful to consider instead the relative redundancy (normalized total correlation), i.e.,
$$\begin{aligned} r = \frac{I(\{X_{i}\}_{i \in S})}{\sum _{i \in S} H(X_i)} = 1 - \frac{H(\{X_{i}\}_{i \in S})}{\sum _{i \in S} H(X_i)} = 1 - s, \end{aligned}$$
(2)
where s is then the proportion of non-redundant, or incompressible, information in the set. Using the same example as before, for n identical copies of X, \(r = 1 - \frac{1}{n}\), which is invariant to \(H(X)\), while still increasing with n (Fig. 1, right).
In general, the upper bound of relative redundancy for a fixed n is invariant to rescaling of the individual entropies, but sensitive to variability in the set of entropies. To see this, note that \(H(\{X_{i}\}_{i \in S}) \ge \max _{i \in S} H(X_i)\), s.t.
$$\begin{aligned} 0 \le \frac{I(\{X_{i}\}_{i \in S})}{\sum _{i \in S} H(X_i)} \le 1 - \frac{\max _{i \in S} H(X_i)}{\sum _{i \in S} H(X_i)} < 1, \end{aligned}$$
(3)
for any set of \(X_i\) (i.e., not necessarily all identical as in the prior example). Then rescaling all \(H(X_i)\) by a constant factor does not change the upper bound, and the upper bound is closest to 1 when all \(H(X_i)\) are equal. This last property also fits the intuitive definition of “collective,” in the sense that elements of a system behaving similarly should have similar variability in their individual behaviors.
To summarize, relative redundancy has the following properties useful for measuring coordination in collective behavior:
-
1.
It increases the more the behavior of any one element in the system is informative about the behavior of all the other elements in the system.
-
2.
Its upper bound increases as the number of individual elements in the system increases (yet remains on a zero to one scale).
-
3.
It increases with increasing similarity in the variability of individual behavior.
-
4.
It is invariant to the total amount of individual variability within the system.
As an example, swarms of gnats forming large mating groups would likely score low on this measure of collectivity (provided the microscopic property being measured is individual movement). While gnats within the swarm may have similar levels of variability in their velocities, their movements are relatively independent. In comparison, large groups of fireflies flashing in unison (provided the microscopic property measured is the on / off state of the firefly’s bioluminescent abdomen) should score high on the relative redundancy scale, regardless of species variability in the frequency of flashing. Relative redundancy should also give a graded distinction between “shoaling” and “schooling” in fish, based on the degree of coordinated movement behavior within the group (resulting in low and high relative redundancy, respectively).
Practical application
Computing relative redundancy in practice is challenging. Estimating the mutual information between just two variables (equivalently, the \(n=2\) case for Eq. 1), or the entropy of a single variable, runs into sampling problems and issues of estimator bias (Paninski 2003). While there may be no universal solution, for systems with continuous microscopic properties (the quantities of each element of the system for which we would like to measure coordination across the system), we can still make progress by maximizing a lower bound on redundancy instead.
First, for continuous random variables that are marginally Gaussian with system-wide correlation matrix \(P_S\), the Gaussian mutual information,
$$\begin{aligned} I_G(\{X_{i}\}_{i \in S}) = -\frac{1}{2} \log \det (P_S), \end{aligned}$$
(4)
is a lower bound on the total mutual information (Foster and Grassberger 2011; Kraskov et al. 2004). Since the marginals are continuous and Gaussian, each element has differential entropy
$$\begin{aligned} h_G(X_i) = \frac{1}{2} \log \left[ (2\pi e)^{k_i} \det (K_i) \right] , \end{aligned}$$
(5)
where \(K_i\) is the covariance matrix of \(X_i\), and \(k_i\) is the number of variates of element i. Unfortunately, while \(I_G(\cdot )\) is nonnegative, the differential entropy \(h_G(\cdot )\) can be positive or negative. Fortunately, for an arbitrarily precise \(\alpha\)-bit quantization of \(X_i\), its discrete entropy is approximated by \(h(X_i) + \alpha\) (see Theorem 8.3.1 in Cover and Thomas 2006). Since the choice of \(\alpha\) is arbitrary, we can choose it such that the differential entropies for the system are all positive. The choice of quantization cancels out in the numerator and only affects the denominator, giving
$$\begin{aligned} r&\ge \frac{I_G(\{X_{i}\}_{i \in S})}{\alpha + \sum _{i \in S} h_G(X_i)}, \end{aligned}$$
(6)
which is simple to compute in practice. However, since the quantization level, \(\alpha\), changes the scaling, when making cross-system comparisons one must be sure to compute redundancy using the same \(\alpha\) across all systems.
In general, when the random variables comprising the system are not marginally Gaussian, this lower bound can still be helpful. By substituting rank transformed variables \(G_i\) for \(X_i\) in the numerator, for which we enforce that each \(G_i\) is marginally Gaussian distributed, the numerator remains a useful lower bound on the total correlation among the \(X_i\) (by extension of Foster and Grassberger 2011; Kraskov et al. 2004, to the multivariate case). This essentially just measures the strength of any monotonic pairwise relationship among the system elements. The Gaussian differential entropies in the denominator are also upper bounds on the differential entropies of any continuous \(X_i\) with the same means and (co)variances. Thus redundancy is lower bounded by these two quantities for any continuous \(X_i\). Better or possibly even exact estimates of r may be possible depending on the system and microscopic variables at play; in any case, Eq. (2) still gives the correct system-independent blueprint for measuring coordination.
As a simple numerical application using the above redundancy bound, Fig. 2 explores the Vicsek et al. (1995) model of collective motion with alignment only, i.e.,
$$\begin{aligned} \theta _i(t+1)&= \bar{\theta }_i(t) + \epsilon _i(t), \end{aligned}$$
(7)
$$\begin{aligned} \mathbf {x}_i(t+1)&= \mathbf {x}_i(t) + \mathbf {v}_i(t) \Delta t, \end{aligned}$$
(8)
where \(\mathbf {x}_i(t)\) is the position of individual i at discrete time step t, \(\mathbf {v}_i(t)\) is individual i’s velocity at time t given by its heading, \(\theta _i(t)\), and a constant, c (fixed at 0.03 to match Vicsek et al. 1995), \(\bar{\theta }_i(t)\) is the angular average heading of i and all neighbors within a distance d at time t, and \(\mathbf {\epsilon }_i(t)\) is drawn i.i.d. from a uniform distribution on the interval \(\left[ -\eta /2, \eta /2 \right]\). In this well-studied system, redundancy (Fig. 2, Top left) shows the same phase transition from disorder to order when varying the noise parameter \(\eta\), as seen in the system-specific order parameter of average alignment (Fig. 2, Bottom left). Interestingly, it also shows an apparently discontinuous transition with a bistable region in the ordered regime, which to our knowledge has not been reported before. This appears to distinguish between “dynamic order” (in which there are still fluctuations in average alignment over time across the group) and “coherent order” (in which the group is almost always aligned). A detailed investigation of this transition is beyond the scope of this study and is left for future work. However, based on a visual inspection of the emergent dynamics, it seems likely that the observed discontinuous transition may be related to the correlation range of the orientation exceeding the finite system size, whereas the bistability emerges from different spatial configurations exhibiting either coherent or dynamic order for the same noise values.
Redundancy partitioning for system structure
While relative redundancy (resp. incompressibility) can be used to compare the degree of collectivity exhibited by very different systems, it can also be used to characterize the dependency structure within a given system. Writing the relative redundancy as a function of a subset of the system, \(A \subseteq S\), we have
$$\begin{aligned} r(A)&= 1 - \frac{H(\{X_{i}\}_{i \in A})}{\sum _{i \in A} H(X_i)}. \end{aligned}$$
(9)
What divisions of a system maximize the relative redundancy of each subset?
To make this question concrete, let \(\widehat{S}\) be a set of indices for a collection of subsets of S, which we will refer to as the components of system S. That is, let \(\widehat{S} = \{1, 2, \ldots , m\}\), where typicallyFootnote 3\(m \le n\), and introduce a probabilistic assignment p(j|i), \(\forall (i,j) \in (S, \widehat{S})\),Footnote 4 which can be read as the probability that element i belongs to component j. Then the expected quality of an assignment to a given component is
$$\begin{aligned} \mathbb {E}\left[ r(A)|j\right]&= \sum _{A \in \mathcal {P}(S)} r(A) p(A|j), \end{aligned}$$
(10)
where \(\mathcal {P}(S)\) is the power set (set of all subsets) of S, and
$$\begin{aligned} p(A|j)&= \prod _{i \in A} p(j|i) \prod _{i \in {A}^{\mathsf {c}}} \left[ 1 - p(j|i)\right] , \end{aligned}$$
(11)
is the probability of subset A given the assignments of elements to component j, by a simple counting argument.Footnote 5 Treating the quality of each component equally, the expected quality over all components is then
$$\begin{aligned} \mathbb {E}\left[ r(A)\right]&= \frac{1}{m}\sum _{j \in \widehat{S}} \mathbb {E}\left[ r(A)|j\right] . \end{aligned}$$
(12)
Note that the redundancy of any individual element, i.e., \(r\left( \{1\}\right)\), is equal to zero according to Eq. 9. For continuity, we define the redundancy of the empty set, \(r\left( \{\}\right)\), to be zero. A visual example of dividing a system into different numbers of components and measuring component redundancy is illustrated in Fig. 3.
Rate-distortion theory
While this gives us a natural way to evaluate the quality of a given assignment, it does not immediately provide us with a way to find such an assignment. Instead, we draw inspiration from the information-theoretic treatment of compression given by rate-distortion theory (see Shannon 1959; Cover and Thomas 2006). Classical rate-distortion theory addresses the following problem: given a source (random variable) X, a measure of distortion, d, and an allowable level of average distortion D, determine the minimum amount of information necessary for a compressed description of X that introduces an average distortion no more than D. I.e.,
$$\begin{aligned} R(D) = \min _{p(\hat{x}|x)\,:\,\mathbb {E}d(x,\hat{x})\,\le \,D} I(X;\widehat{X}), \end{aligned}$$
(13)
where the rate, R(D), equals the minimum amount of information (measured in bits per symbol, hence “rate”) needed for average distortion D. In this case, the rate measures the information, \(I(X;\widehat{X})\), that the compressed representation, \(\widehat{X}\), needs to keep about the source, X, where
$$\begin{aligned} I(X;\widehat{X})&= \sum _{x,\hat{x}} p(x,\hat{x}) \log \frac{p(x,\hat{x})}{p(x)p(\hat{x})} \end{aligned}$$
(14)
is the mutual information between X and \(\widehat{X}\). The lower the rate, the better the compression, but (depending on the source and the distortion measure) the higher the average distortion introduced. Surprisingly, not only can the rate-distortion curve be characterized numerically in general, the minimal compressed representation of X can be found via a simple, iterative, alternating minimization algorithm (Blahut 1972; Arimoto 1972).
Redundancy partitioning
Though there are important differences from rate-distortion theory (discussed in “Appendix 1”), we can similarly frame the problem of finding structure based on redundancy as a compression problem. Here, we wish to find the assignment of elements of S to components of \(\widehat{S}\) that achieves an average redundancy no less than \(r^*\), and otherwise preserves as little about the original identities of the elements as possible. I.e.,
$$\begin{aligned} R(r^*) = \min _{p(j|i)\,:\,\mathbb {E}\left[ r(A)\right] \,\ge \,r^*} I(S;\widehat{S}), \end{aligned}$$
(15)
where p(j|i) is further required to be nonnegative and sum to one. This is not a standard rate-distortion problem, but we can use many of the same ideas developed by Blahut (1972) and Arimoto (1972) in their original numerical algorithms for deriving a practical solution. We give a brief account of this derivation here; see “Appendix 1” for a complete account.
Introducing Lagrange multipliers to constrain the \(\sum _{j \in \widehat{S}} p(j|i) = 1\) (non-negativity will be enforced by the form of the solution), the variational problem becomes
$$\begin{aligned} L\left[ p(j|i) \right]&= I(S;\widehat{S}) - \beta \sum _{j \in \widehat{S}, A \in \mathcal {P}(S)} r(A) p(A|j) + \sum _{i \in S} \lambda (i) \sum _{i \in \widehat{S}} p(j|i), \end{aligned}$$
(16)
where \(\beta\), the Lagrange multiplier for the average redundancy constraint, absorbs the 1/m term. Taking the derivative with respect to a particular \(j'\) and \(i'\), we have
$$\begin{aligned} \frac{\partial }{\partial p(j'|i')} L\left[ p(j|i) \right]&= p(i') \log \frac{p(j'|i')}{p(j')} - \beta \sum _{j \in \widehat{S}, A \in \mathcal {P}(S)} r(A) \frac{\partial p(A|j)}{\partial p(j'|i')} + \lambda (i'), \end{aligned}$$
(17)
where
$$\begin{aligned} \frac{\partial p(A|j)}{\partial p(j'|i')}&= \left\{ \begin{array}{ll} 0 &{}\quad \text{ if } \,\, j \ne j', \\ f_{i'}(A|j') &{}\quad \text{ if } \,\, j = j', i' \in A, \\ -f_{i'}(A|j') &{}\quad \text{ if } \,\, j = j', i' \in {A}^{\mathsf {c}}, \end{array} \right. \end{aligned}$$
(18)
and
$$\begin{aligned} f_i(A|j)&= \prod _{k \in A \setminus \{i\}} p(j|k) \prod _{k \in {A}^{\mathsf {c}} \setminus \{i\}} \big [1 - p(j|i)\big ], \end{aligned}$$
(19)
where \(A \setminus \{i\}\) is the relative complement of the singleton set \(\{i\}\) with respect to A.
Then setting \(\partial L / \partial p(j'|i') = 0\) and splitting the sum over \(\mathcal {P}(S)\) into terms with and without \(i' \in A\), we have
$$\begin{aligned} \begin{aligned} p(i') \log \frac{p(j'|i')}{p(j')}&= \beta \sum _{\{A \in \mathcal {P}(S)\,:\,i' \in A \}} r(A) f_{i'}(A|j') \\&\quad - \beta \sum _{\{A \in \mathcal {P}(S)\,:\,i' \in {A}^{\mathsf {c}} \}} r(A) f_{i'}(A|j') \\&\quad - \lambda (i'). \end{aligned} \end{aligned}$$
(20)
Let
$$\begin{aligned} d(i,j)&= \frac{1}{p(i)} \sum _{\{A \in \mathcal {P}(S)\,:\,i \in A \}} r(A) f_{i}(A|j), \end{aligned}$$
(21)
and define \(d_{\mathsf {c}}(i,j)\) to be identical except substituting \(i \in {A}^{\mathsf {c}}\) for \(i \in A\). Lastly, let \(\Delta d(i, j) = d(i,j) - d_{\mathsf {c}}(i,j)\). Then, dividing through by \(p(i')\) and substituting, we have,
$$\begin{aligned} \log \frac{p(j'|i')}{p(j')}&= \beta \Delta d(i',j') - \frac{\lambda (i')}{p(i')}. \end{aligned}$$
(22)
Finally, substituting \(\log \mu (i') = \lambda (i') / p(i')\) and solving for \(p(j'|i')\),
$$\begin{aligned} p(j'|i')&= \frac{p(j')}{\mu (i')} e^{\beta \Delta d(i',j')}. \end{aligned}$$
(23)
Enforcing the constraint that \(\sum _{j \in \widehat{S}} p(j|i') = 1\) and simplifying notation, we have
$$\begin{aligned} p(j|i)&= \frac{p(j) e^{\beta \Delta d(i,j)}}{\sum _{j' \in \widehat{S}} p(j') e^{\beta \Delta d(i,j')}}. \end{aligned}$$
(24)
Before moving on, it is worth noting that \(\Delta d(i,j)\) has a simple and intuitive interpretation. It is the difference in redundancy for component j when i is included versus when it is excluded, weighted by the relative importance of i.
Note that p(j) and p(A|j) depend on the choice of p(j|i). The final algorithm,
$$\begin{aligned} \left\{ \begin{array}{rl} p_t(j|i) &{}= \frac{p_t(j) e^{\beta \Delta d(i,j)}}{\sum _{j' \in \widehat{S}} p_t(j') e^{\beta \Delta d(i,j')}}, \\ p_{t+1}(j) &{}= \sum _{i \in S} p_t(j|i) p(i), \\ p_{t+1}(A|j) &{}= \prod _{i \in A} p_t(j|i) \prod _{i \in {A}^{\mathsf {c}}} \big [1 - p_t(j|i) \big ],\\ \end{array} \right. \end{aligned}$$
(25)
follows a similar alternating minimization scheme to the one developed by Blahut and Arimoto and generalized by Csiszár and Tsunády (1984), albeit with only local optimality guarantees similar to Tishby et al. (1999); Banerjee et al. (2005). See "Appendix 1" and Fig. 8 for a complete derivation and description of the algorithm.
One immediate issue is the \(2^n\) scaling of the number of subsets of S as n (the number of elements of S) increases. First, it is worth noting that there are non-trivial collective systems of empirical interest even for small n. Current computational hardware may permit exact computation up to around \(n \approx 15\) even on consumer hardware, which would be relevant for many experimental systems (as in, e.g., Miller and Gerlai 2007; Katz et al. 2011; Jolles et al. 2018). Second, for larger systems, Monte Carlo estimation of \(\Delta d(i,j)\) can be readily employed, e.g., for K samples,
$$\begin{aligned} \begin{array}{rll} \widehat{d}(i,j) &{}= \displaystyle \frac{1}{p(i) K} \sum _{k = 1}^K r\big (A_{ij} \cup \{i\}\big ),\\ \widehat{d}_{\mathsf {c}}(i,j) &{}= \displaystyle \frac{1}{p(i) K} \sum _{k = 1}^K r\big (A_{ij} \setminus \{i\}\big ), &{}\quad \text {where}\,A_{ij} \sim f_i(\cdot |j). \end{array} \end{aligned}$$
(26)
For large systems in particular initializing near good solutions may be helpful. In many systems we may expect elements to be spatially or temporally dependent, and use that prior knowledge to initialize reasonable clusters. However the preliminary results given in the next section do not employ any such strategy; we simply run the algorithm many times beginning with many different initial conditions and select the best solution generated. Finally, although we omit the exposition here, in the “hard-partition” limit (as \(\beta \rightarrow \infty\)), p(j|i) becomes a delta function, meaning that no sampling is necessary and we need only consider adding or dropping each element from each component on each iteration. When using the Gaussian bound on redundancy introduced in "Practical application" section, this can be accomplished in \(O(n^4)\) (or \(O(n^3)\) with some decrease in numerical precision). Our open source implementation of this algorithm is available by request or online at https://github.com/crtwomey/sscs.
Experiments
Simulation experiments
We tested the proposed algorithm on two sets of data: simulations of schooling groups, and empirical data collected from the movements of schooling fish in a lab environment. The former allow us to control the dependency structure of the system, while the latter allows us to demonstrate applicability to empirical systems. Simulations used a simple model of coordinated movement based on attraction, alignment, and repulsion social forces (based on Romanczuk et al. 2012; Romanczuk and Schimansky-Geier 2012; a description of the model and additional information on the simulation conditions can be found in Appendix 2). Position and velocity data for independent groups of size \(n = 5,\,10,\) and 20 were generated for a high \((\eta = 0.2)\) and low \((\eta = 0.15)\) noise conditions.
Empirical experiments
Movement data of fish comes from videos originally recorded by Katz et al. (2011). In that work, groups of 10, 30, and 70 golden shiners (Notemigonus crysoleucas) were purchased from Anderson Farms (www.andersonminnows.com) and filmed in a \(1.2 \times 2.1\,\mathrm {m}\) tank with an overhead camera. Videos were then corrected for lens distortion and fish were tracked using the same custom in-house software developed by Haishan Wu and used in Rosenthal et al. (2015). The software begins by detecting all individuals in each frame, then links individuals across frames to form tracks. All tracks were manually corrected to ensure accuracy. Individual positions and velocities were estimated from these tracks using a \(3^\mathrm{rd}\) order Savitzky–Golay filter (Savitzky and Golay 1964; similar to, e.g., Harpaz et al. 2017) with a 7 frame smoothing window (videos were recorded at 30 fps). Interactions between fish are time-dependent; for the results presented here we simply chose a fixed window of \(\pm \, 15\,\mathrm {s}\) surrounding a given time t to estimate the dependency structure of the group. An optimal choice of time window is left for future work.
Experimental results
The algorithm outlined in "Redundancy partitioning" section requires specifying the number of components and a parameter, \(\beta\), which controls the relative importance of maximizing the average redundancy of the components as opposed to maximally compressing the original set of system elements. While it will be interesting to investigate the ‘soft-partitioning’ aspect of this approach in future work, here we simply consider the hard assignment case, which requires only that \(\beta\) is large. Figure 4 (Right) illustrates this point, showing the stabilization of average component redundancy for \(\beta > 5\). We found that \(\beta = 200\) was sufficient to recover hard assignments in all cases tested here.Footnote 6 Since relative redundancy ranges between 0 and 1 for any dataset, these parameter values should generalize well to other systems, and leaves the method free of parameter fine-tuning.
To validate that the Monte Carlo estimate of \(\Delta d(i,j)\) employed is effective, we compared its behavior to exact computations of \(\Delta d(i,j)\) for small system sizes (simulated groups of size 5 and 10). We ran each version of the algorithm for up to 10 components and took the best (maximum) average component redundancy achieved over 100 random initializations of the assignment matrix p(j|i). Figure 4 (Left) shows that the results are in good agreement, and where there are discrepancies they tend to favor the Monte Carlo method, in that the Monte Carlo method recovers solutions with higher average redundancy.
Next, we tested the algorithm on simulated data in which the dependency structure of the simulated groups was known, using the hard partitioning variant of the algorithm for computational efficiency. For each test, we computed the maximum average component redundancy recovered for up to 10 components, again using 100 random initializations of the assignment matrix for each computation. In all cases partitioning decreases the average redundancy of the system with increasing number of components (Fig. 5).Footnote 7 However the magnitude of the change in average redundancy (or ‘\(\Delta\) average redundancy’) from m to \(m-1\) components is informative of the system’s dependency structure. Small values of \(\Delta\) average redundancy occur when subdividing the system has a comparatively minor impact on average redundancy, which should be expected when partitioning relatively independent parts of the system. In comparison, a large increase in the value of \(\Delta\) average redundancy appears to occur when a strongly interacting component is split. This can be seen by comparing the \(\Delta\) average redundancy curves for each group size between instances of a single group (Fig. 5Left) in the system or two independent, non-interacting groups in the same system (Fig. 5Middle). The \(\Delta\) average component redundancies for systems containing only a single group have either no or only shallow local minima followed by at most small increases. In comparison, \(\Delta\) average redundancies for systems with two non-interacting groups, in pairs of matched size groups of 5, 10, and 20, have comparatively deep local minima first occurring at 2 components for \(n =\) 5 and 10, and at 4 components for \(n = 20\), followed directly by relatively large increases in Δ average redundancy. At the point preceding each of these transitions from low to high \(\Delta\) average redundancy, the two non-interacting groups are assigned to separate components by the algorithm, and in the \(n = 20\) case the two groups are further subdivided into two spatially assorted components each. Finally, the \(\Delta\) average redundancies for a system of three non-interacting groups of mixed sizes 5, 10, and 20 were computed, with local minima first occurring at 3 and 4 components for high and low noise conditions, respectively (Fig. 5Right), followed by large increases in \(\Delta\) average redundancy.Footnote 8 Taken together, this is evidence that the transition from low to high \(\Delta\) average component redundancies recovered by the algorithm reflect the dependency structure of the underlying system. It suggests that these features may be useful in identifying relevant structure in other systems, even those with less extreme dependency structures.
Figure 6 illustrates the iterative generation of assignments for the algorithm in the mixed three group (high noise) case. Assignments change and harden until they converge on a (local) maximal average redundancy partition of the system’s elements (Left). The assignments generated by the algorithm of system elements to components correspond one-to-one with the original, non-interacting set of three groups (of sizes 5, 10, and 20) comprising the whole system (of total size 35). Positions of the elements of the system and their velocity vectors are shown for one time point, colored by the component they were assigned to (which corresponds to their original group), in Fig. 6 (Left). Note that, while the snapshot shown in Fig. 6 was chosen to show the three distinct groups, at many points in the simulation the positions, velocities, or both, overlapped between the three groups. The algorithm is able to recover the independent groups in the system without using spatial position information, based on coordination in individual velocities alone.
Finally, we applied the algorithm to empirical data collected on fish schools to validate that the method is able to recover sensible results for strongly interacting groups and from non-simulated data. Figure 7 shows that for fish, groups of size 10 interact strongly enough (in at least the one instance tested here) to be considered one coherent unit, while groups of size 30 are already large enough to have subsets that more strongly interact with one another than the rest of the group (e.g., the local minima in \(\Delta\) average redundancy at \(m = 5\) components; Fig. 7Middle). The component assignments at the \(m = 5\) local minima and positions for the school of 30 fish are shown in Fig. 7 (Right) at a single time point. The subdivisions of the system show strong spatial assortment with a stratification of the group from front to back. As in the simulation case, here we use only coordination in individual velocities to determine partitions, so this spatial assortment is a consequence of similar behavior as opposed to some criterion based on proximity. Further work is needed to investigate the duration of substructure in fish schools, as well as the emergence and disappearance of components over time.