1 Introduction

Political opinion polls capture how the opinions of people within a society regarding a certain topic or their current voting preferences are distributed. Individual opinions do not have to be constant, but rather are subject to change induced by impactful events or the opinions of their peers which is formalized under the term conformity in Stangor (2015). There have been recent advances in simulating the process in which members of a society change their opinions; see, e.g., Banisch et al. (2011), Klimek et al. (2007), Misra (2012), Li et al. (2012), Nardini et al. (2008), Böhme and Gross (2012), Bolzern et al. (2017) and the review articles Anderson and Ye (2019), Xia et al. (2011), Castellano et al. (2009), Sîrbu et al. (2017). This is in part due to increasing computing power which enables to carry out agent-based models that simulate behaviour of members of a synthetic population, such as members of a society, on the microscale by emulating the decision-making rules. The agents are often treated as the nodes of a network, while an edge between two nodes means that these agents are neighbours of each other and thus influence each other’s respective opinions.

One is often not interested in modelling, or predicting, which person has which opinion, but rather, as in polls, what the percentage of each opinion within the society is. There is ample interest in deriving dynamics for the evolution of these percentages.

In this article, we will present a framework which identifies the governing equations for the dynamics of opinion percentages for different types of networks, more precisely, how the governing equations can be inferred from data on the opinion percentages. To this end, we will emulate the decision-making process with a simple agent-based model (ABM) that is based on the assumption of conformity and inspired by the ABM in Misra (2012). Introductions into agent-based modelling in general can be found in Jennings et al. (1998) and Laubenbacher et al. (2009) and specifically into agent-based models for opinion dynamics in Banisch (2016).

The literature contains a variety of approaches for finding governing equations on the macrolevel (here, opinion percentages) based on microdynamics (here, agent-based model). However, most do not deal with opinion formation or voter models, but with models originating from the context of the natural sciences. There it is well known that the aggregation process from the micro- to the macrolevel typically leads to non-Markovian processes, i.e., finding the governing equations on the macrolevel requires the inclusion of memory, cf. the Mori–Zwanzig formalism (Zwanzig 2001; Lin and Lu 2019; Chorin et al. 2002. In the context of opinion formation, this aspect is hardly discussed at all. Banisch (2014) discusses the issue for agent-based models; he gives stochastic and combinatorial arguments for the appearance of memory with heterogeneous microstructure, but does not present any practical methods for finding appropriate governing equations for the macrodynamics. Several other authors discuss the micro-macroaggregation problem in opinion formation, e.g., via influence matrices between agents (Wu et al. 2018; Ravazzi et al. 2019; De et al. 2019), but ignore memory effects entirely. Others discuss memory effects, but only on the microlevel, e.g., Jedrzejewski and Sznajd-Weron (2018), Chen et al. (2018) (agents have memory), Moussaïd et al. (2013) (agents gain experience) or Boschia et al. (2019) (microdynamics depends on collective memory). Very few articles consider the practical methods for finding governing equations on the macrolevel, e.g., by inferring them from microlevel simulation data, but memory effects are ignored, cf. Lu et al. (2019). Thus, there is a significant gap between Banisch’ insight that opinion aggregation introduces memory and its practical use for finding appropriate description of the resulting macrodynamics.

This article aims at closing this gap by (1) utilizing techniques like the Mori–Zwanzig formalism and Taken’s well-known embedding theorem for showing that agent-based models for the microdynamics lead to memory effects on the macrolevel if the interaction between the agents is heterogeneous, while doing this in a way that allows for (2) proposing practical algorithmic techniques to learn governing equations for the macrodynamics including memory utilizing macroobservations of microlevel simulation data.

More precisely, we investigate complete and incomplete interaction networks: in complete networks, every agent interacts with all others (homogeneous interaction), while in incomplete networks there are subcommunities within the society that have few links between each other (heterogeneous interaction). As we will show, in the case of a complete network, one can identify a Markovian model for the macrodynamics of the opinion percentages using standard well-mixedness arguments known from the mean-field approaches or population limits, e.g., for predator–prey models Berryman (1992). However, arguments used for that case do not hold true in cases when the network is not complete. We will show how to use information from the past (memory) via a kind of delay embedding of the dynamics to describe the evolution of opinion percentages in the general case.

The exact reason for the inclusion of memory will formally be derived in Sect. 2 by using the Mori–Zwanzig formalism (Zwanzig 2001; Lin and Lu 2019; Chorin et al. 2002). Inspired by problems in statistical physics, the Mori–Zwanzig formalism explains how in the case of only low-dimensional observations of a high-dimensional system being available, the evolution of these observations of the full system can be obtained by replacing the missing information of the full system by past information of these available observations. This is in light of the result of Takens (1981) that states that, under fairly generic assumptions, the delay embedding of the dynamics of an observable is diffeomorphic to the dynamics of the full system.

There are various techniques for the modelling of time-discrete dynamical systems which involve the memory of the system. An intuitive approach is comprised by higher-order Markov models (Raftery 1985; Tuyen 2018). These models are defined by transition probabilities between discrete states where each state represents a sequence of cells of a discretization of the state space with a given length (“memory depth”). Although these models can be powerful in investigating the long-term behaviour of the process by means of Markov state models for Markovian processes (Bowman et al. 2014), they yield two problems: the loss of accuracy obtained from the discretization and an exponentially increasing number of states with increasing length of the sequences and number of grid cells.

Another example is simplex projection as in Sugihara and May (1990) where, using Takens’ result, subsequent states of a system are predicted from relative next steps of similar patterns as its recent history. A younger modelling technique is long short-term memory neural networks (LSTMs) (Hochreiter and Schmidhuber 1997; Pan and Duraisamy 2018) which is a subclass of recurrent neural networks and specifically designed for prediction of time series for which past information is vital. However, both these techniques provide little to no understanding of the dynamical rules of the system: simplex projection does not produce any model or dynamical law, but rather uses a procedure similar to the nearest neighbour classification algorithm (see, e.g., Devroye et al. (2013)). LSTMs, as most neural networks, typically have far too many parameters to admit interpretability. An additional means for forecasting of memory-dependent dynamical systems is the well-known class of autoregressive (AR) models (Brockwell and Davis 1991), which describes the evolution of a system by a linear combination of its most recent states. Additionally, there exist variants of these AR models that are sparse (Davis et al. 2012; Fujita et al. 2007) or nonlinear (Billings 2013) or comprise both aspects in application to a singular value decomposition of a data matrix (Brunton et al. 2016). As we will see, linear (Markovian) systems cannot describe the evolution of opinion percentages even in the simplest case, but simple polynomial terms are sufficient for fully connected networks. We shall address this point with nonlinear AR (NAR) models, as derived through the Mori–Zwanzig formalism.

In addition to the analysis of micro-macroaggregation for opinion formation, further novelty in our work lies in the methods we propose for learning NAR models from data, to describe the evolution of opinion percentages, and their theoretical justification. We will show that the prediction accuracy of the NAR models for the opinion percentages increases with larger memory depths. To this end, we will deploy methods from data-driven (sparse) system identification—as in dynamic mode decomposition (Schmid and Sesterhenn 2008; Tu et al. 2014; Jovanovic et al. 2013) or sparse identification of nonlinear dynamics (SINDy) (Brunton et al. 2016a)—to the field of opinion dynamics. More precisely, we will extend SINDy towards finding (sparse) NAR models to describe the evolution of opinion percentages. The new method is called “sparse identification of nonlinear autoregressive models” (SINAR), as it is technically a natural generalization of SINDy by including nonlinear memory terms. We will demonstrate that SINAR is well suited for our purposes in learning macroscopic opinion dynamics. A conceptually similar method has been introduced in Brunton et al. (2016) with Hankel alternative view of Koopman (HAVOK). It can be interpreted as a special case of SINAR.

Outline In Sect. 2, we start with outlining the opinion aggregation process and proceed with the derivation of NAR models for the evolution of observations through the Mori–Zwanzig formalism. Next, in Sect. 3, we present the SINAR method for estimating the coefficients in these NAR models from data. Last, we demonstrate how to apply SINAR for increasing the accuracy of prediction of opinion percentages in the case of incomplete interaction networks in Sect. 4.

2 Derivation of a Nonlinear Autoregressive Model Using the Mori–Zwanzig Formalism

Below, we will model the spread of opinions inside a closed society by an agent-based model. It will consist of a high number N of agents who change their opinions \(X_i\), \(i=1,\ldots , N\), within a finite set of M possible opinions over discrete time steps according to a rule that is based on the opinions of themselves and other agents. This rule will be Markovian, or memory-free, i.e., the changes of opinions are only influenced by opinions in the current time step. These dynamics will be called the microdynamics. The state of the microdynamics at time t is denoted by \(X_t = [(X_t)_1,\dots ,(X_t)_N]^T\). The respective state space is denoted by \(\mathbb {X}\) and has cardinality \(|\mathbb {X}| = M^N\).

We will only be able to observe the percentages of opinions, i.e., the ratios of those among all agents with each of the M opinions. In this article, we are interested in identifying the dynamical rules of the evolution of the percentages of opinions, which we call the macrodynamics. Identifying the dynamics of low-dimensional observations of a higher-dimensional system is a typical setup for the Mori–Zwanzig formalism (Zwanzig 2001; Chorin et al. 2002; Lin and Lu 2019). We will consider a general framework for this and show how it yields a nonlinear autoregressive model (Billings 2013) for the macrodynamics. Later on, we show how it can be applied to the specific case of the spread of opinions.

2.1 The Setting: Microdynamics and Projected Observations

First we assume that the microdynamics are Markovian (memory-free) and deterministic. We consider the dynamical system \(F:\mathbb {X}\rightarrow \mathbb {X}\) that governs the microdynamics

$$\begin{aligned} X_{t+1} = F(X_t) \in \mathbb {X}. \end{aligned}$$

Further, we denote the space of observations of the microdynamics (observables) by \(\mathbb {Y} \subseteq \mathbb {R}^m\) and by \(\mathcal {G} := \lbrace g : \mathbb {X}\rightarrow \mathbb {Y} \rbrace \) the set of functions that map states of the dynamical system (2.1) to \(\mathbb {Y}\). We suppose from here on that we do not have knowledge of the state of the microdynamics at any point in time, but instead only have the value of the fixed observable \(x = \xi (X) \in \mathbb {Y}\) which we call the accessible, or relevant, variables.

Additionally, we define the subspace \(\mathcal {H}\) of functions in \(\mathcal {G}\) that depend only on these relevant variables and map to \(\mathbb {Y}\) as \(\mathcal {H} := \lbrace h \in \mathcal {G}\mid \exists \tilde{h}:\xi (\mathbb {X}) \rightarrow \mathbb {Y} :\ h = \tilde{h}\circ \xi \rbrace \). Functions in \(\mathcal {H}\) still depend on \(X \in \mathbb {X}\), but the information of \(\xi (X)\) is enough to evaluate them. When we write h(x) for \(x \in \mathbb {Y}\), we abuse notation and mean \(h(\xi (x))\). An example is

$$\begin{aligned} \mathbb {X} = \mathbb {R}^2,\quad \xi (X) = X_1+X_2,\quad h(X_1,X_2) = (X_1+X_2)^2 = \xi (X)^2. \end{aligned}$$

In this case, it is enough to know the value of \(\xi (X)\) to evaluate h(X).

The goal is now to represent the evolution of the observations \(x_t = \xi (X_t)\) under the microdynamics with knowledge only about values of \(x_t\), but not of the states \(X_t\) of the microdynamics. As illustrated in the following diagram, instead of taking one step of the microdynamics and then evaluating \(\xi \), we only have access to the observation \(\xi (X)\) and want to evaluate \(\xi (F(X))\) under the premise that \(\xi (X) = x\).


To this end, we define a projection operator \(P:\mathcal {G}\rightarrow \mathcal {H}\) that maps a function depending on X to a function depending on \(\xi (X)\). We additionally define its complement \(Q := Id - P\). We assume from now on that the microdynamics are stationary with an F-invariant probability distribution \(\mu \) over \(\mathbb {X}\), so that when asking what g(X) is, we assume that \(X_t\) is distributed by \(\mu \).Footnote 1 We, of course, are interested in the case \(g = \xi \circ F\). We follow Lin and Lu (2019) until the end of Sect. 2.2 and define P as the orthogonal projection onto the span of a set of linearly independent functions from \(\mathcal {H}\). These functions are denoted by \(\varphi _1,\dots ,\varphi _L:\mathbb {Y}\rightarrow \mathbb {R}^{m}\) which build the columns of \(\varphi = [\varphi _1,\dots ,\varphi _L]\).

$$\begin{aligned} (Pg)(x) := \varphi (x) \langle \varphi ,\varphi \rangle ^{-1} \langle \varphi ,g\rangle \end{aligned}$$

where \(x \in \mathbb {Y}\) and the scalar product \(\langle \cdot ,\cdot \rangle \) is defined for matrix-valued functions \(f:\mathbb {X}\rightarrow \mathbb {R}^{m \times a}\) and \(g:\mathbb {X}\rightarrow \mathbb {R}^{m \times b}\) as

$$\begin{aligned} \langle f,g\rangle := \int _{\mathbb {X}} \underbrace{f(X)^T}_{\in \mathbb {R}^{a\times m}} \underbrace{g(X)}_{\in \mathbb {R}^{m \times b}} d\mu (X) \in \mathbb {R}^{a \times b}, \end{aligned}$$

which itself is matrix-valued. The term \(\langle \varphi ,\varphi \rangle \) is a mass matrix that ensures that P is an orthogonal projection. This orthogonal projection has the property that Pg is the closest function in \(span(\varphi )\) to g with respect to \(\langle \cdot , \cdot \rangle \).

Note that if \(\mathcal {H}\) is infinite-dimensional, one would need an infinite number of functions to yield that \(span(\varphi ) = \mathcal {H}\). In this case, the projection formalism is well defined if \(\mathcal {H}\) is closed. In practice, in this case for the computation that will follow one would choose a sufficiently rich finite set of functions so that \(span(\varphi )\) covers those parts of \(\mathcal {H}\) that are of interest.

2.2 Mori–Zwanzig Representation of the Macrodynamics

We will now show how to represent the evolution of the observations over time. With the Koopman operator (Koopman 1931) \(\mathcal {K}\) for the system (2.1), defined as the operator that maps a function \(g \in \mathcal {G}\) to \(g\circ F \in \mathcal {G}\), we consider the Dyson formula

$$\begin{aligned} \mathcal {K}^{t+1} = \sum _{k=0}^t \mathcal {K}^{t-k} P\mathcal {K} (Q\mathcal {K})^k + (Q\mathcal {K})^{k+1}. \end{aligned}$$

The Dyson formula describes a way to iteratively split up the application of the Koopman operator to a function g into parts \(P\mathcal {K}g\) and \(Q\mathcal {K}g\). Equation (2.4) yields, by application of both sides of the equation to \(\xi \) and evaluation at the initial value \(X_0\) of the microdynamics, that

$$\begin{aligned} x_{t+1} = \sum _{k=0}^t [P(\rho ^k\circ F)](x_{t-k}) + \rho ^{t+1}(X_0). \end{aligned}$$

where \(\rho ^k := (Q\mathcal {K})^k\xi \). The derivation of Eq.  (2.5) is explained in detail in Appendix A.1, together with interpretation of terms of its right-hand side.

Substituting the definition of P as the orthogonal projection onto basis functions as in (2.3), we obtain

$$\begin{aligned} \begin{aligned} P(\rho ^k\circ F)(x_{t-k}) =\varphi (x_{t-k})\langle \varphi ,\varphi \rangle ^{-1} \langle \varphi ,\rho ^k \circ F\rangle =: \varphi (x_{t-k}) h_k \in \mathbb {R}^m\\ \end{aligned} \end{aligned}$$

with vector-valued coefficients \(h_k = \langle \varphi ,\varphi \rangle ^{-1}\int _{\mathbb {X}}\varphi (\xi (X))^T\rho ^k(F(X))d\mu (X)\).

Finding a suitable approximation of the non-accessible noise term \(\rho ^{t+1}(X_0)\) in (2.5) is generally a non-trivial task and depends on properties of the microdynamics. Examples are discussed in Li and Chu (2017), Hijón et al. (2010), Kondrashov et al. (2015). From this point onwards, we will make the simplification of replacing \(\rho ^{t+1}(X_0)\) by a zero-mean stochastic noise term \(\varepsilon _{t+1}\in \mathbb {R}^m\). A typical practice is to let \(\varepsilon _{t+1}\) be a zero-mean Gaussian random variable as, e.g., in Lin and Lu (2019), Lei et al. (2016). With this, we obtain the macrodynamics

$$\begin{aligned} x_{t+1} = \sum _{k=0}^t \varphi (x_{t-k}) h_{k} + \varepsilon _{t+1}. \end{aligned}$$

As we can see, the evolution of the observations now depends on past terms, although the microdynamics are Markovian. For \(k > 0\), the terms \([P(\rho ^k\circ F)](x_{t-k})\) in Eq. (2.5) and \(\varphi (x_{t-k})\) in Eq. (2.7) are usually referred to as memory terms.

2.3 Macrodynamics as a Nonlinear Autoregressive Process

If it is reasonable to assume a sufficiently fast decay of the terms \(h_{k}\) with increasing k, the memory terms that lie far in the past have negligible influence (Horenko et al. 2007; Venkataramani et al. 2017; Chorin et al. 2000; Zhu et al. 2018). In light of (2.5) and (2.6), it is sufficient that the \(\rho ^k\) decay fast. To understand when this is the case, we recall \(\rho ^k = (Q\mathcal {K})^k\xi \) and assume the \(\mathrm {range}(P) \approx \mathcal {H}\), i.e., functions parametrized by \(\xi \) are well approximated by the chosen approximation space. Then, \(\rho ^k\) decays fast if \(Q\mathcal {K}\) has a small norm, which is the case if F mixes well functions that are perpendicular to \(\mathcal {H}\). In other words, the dominant modes of \(\mathcal {K}\) should align well with the space \(\mathcal {H}\). For quantitative statements we refer to Zhu et al. (2018).

Thus, in order to obtain a feasible number of memory terms, from now on we approximate the dynamics by ending the sum in (2.7) with \(k = p-1\) instead of \(k = t\), i.e., by truncating the terms \(\varphi (x_{t-p})h_p,\dots ,\varphi (x_0)h_t\). Regarding the selection of an appropriate value for the memory depth p, there are various methods such as Information Criteria (Konishi and Kitagawa 2008; Aho et al. 2014) or the L-curve method (Hansen and D. O’leary 1993). We have thus derived a nonlinear autoregressive model (NAR) (Billings 2013; An and Huang 1996) over x given by

$$\begin{aligned} x_{t+1} = \sum _{k=0}^{p-1} \varphi (x_{t-k}) h_{k} + \varepsilon _{t+1} \end{aligned}$$

with matrix-valued basis functions and vector-valued coefficients \(h_k\).

In Sect. 3, we will introduce a method that identifies coefficients for NAR models in a way that is motivated by system identification methods such as dynamic mode decomposition (Williams et al. 2014; Tu et al. 2014), extended dynamic mode decomposition (Williams et al. 2014) or sparse identification of nonlinear dynamics (Brunton et al. 2016a, b), see Fig. 1, where the dynamics are expressed with a vector of scalar-valued basis functions and a matrix-valued coefficient. Having selected the scalar-valued basis functions \(\tilde{\varphi }_1,\dots ,\tilde{\varphi }_K\) and denoting \(\tilde{\varphi } = [\tilde{\varphi }_1,\dots ,\tilde{\varphi }_K]^T:\mathbb {Y}\rightarrow \mathbb {R}^K\), we thus formulate the macrodynamics

$$\begin{aligned} x_{t+1} = \sum _{k=0}^{p-1} H_{k} \tilde{\varphi }(x_{t-k}) + \varepsilon _{t+1}, \end{aligned}$$

with \(H_k \in \mathbb {R}^{m\times K}\). Although seeming like only a slight notational modification, both formulations represent different model forms. While in (2.8) the dynamics are expressed using different basis functions and the same coefficients across all coordinates, we will now switch to the framework in (2.9) where we select scalar-valued basis functions \(\tilde{\varphi }_1,\dots ,\tilde{\varphi }_L\) which are used for each coordinate, while the coefficients for all coordinates can be different (the different rows of the \(H_k\)). In summary, for (2.8), one chooses L m-dimensional basis functions and finds L-dimensional coefficients, while for (2.9), one chooses K one-dimensional basis functions and finds \((m\times K)\)-dimensional coefficients.

Equation (2.9) is still consistent with the way we derive (2.8) through the Mori–Zwanzig formalism: basis functions are evaluated at observations made at distinct times—no terms with mixed delays occur. In Appendix A.2, we show how to choose basis functions and coefficients in each of the models to derive the equivalent dynamics. Please note that this does not mean that both model forms are always equivalent, as explained above. Merely, one can always choose \(\tilde{\varphi }\) in dependence on \(\varphi \), respectively, vice versa, in a way that makes the dynamics equivalent.

2.4 Stochastic Microdynamics

Let us consider stochastic dynamics

$$\begin{aligned} X_{t+1} = F(X_t,\omega _t) \end{aligned}$$

where \(\omega _t \in \Omega \) is a random influence on F which is now defined as \(F:\mathbb {X}\times \Omega \rightarrow \mathbb {X}\). We will assume that the noise process \(\omega _t\), \(t\in \mathbb {N}\), is i.i.d. with law \(\mathbb {P}\). In this case, we only strive to forecast the expected macrodynamics, and define the (stochastic) Koopman operator as

$$\begin{aligned} (\mathcal {K}\circ g)(X) = \mathbb {E}_{\mathbb {P}}[g(F(X,\omega ))]. \end{aligned}$$

The spaces \(\mathcal {G}\) and \(\mathcal {H}\), just as the projection P remain unchanged. Naturally, to the derivation of the Mori–Zwanzig approximation we need to apply the necessary obvious modifications. For example, the last step in (2.5) now has to be modified as:

$$\begin{aligned} \begin{aligned} \left[ P \mathcal {K}\rho ^k\right] (x_{t-k}) = \varphi (x_{t-k}) \langle \varphi ,\varphi \rangle ^{-1} \int _{\Omega }\int _X \varphi (\xi (X))^T \rho ^k(F(X,\omega ))d\mu (X) d\mathbb {P}(\omega ). \end{aligned} \end{aligned}$$

We can thus obtain the identical structure of the macrodynamics as in (2.7) where for the computation of the coefficients \(h_k\) in (2.6) the expectation with respect to \(\mathbb {P}\) had to be added.

3 Sparse Identification of Nonlinear Autoregressive Models (SINAR)

We propose here a method of data-based identification for coefficients \(H_k\) in (2.7) that is an extension of the sparse identification of nonlinear dynamics (SINDy) algorithm from Brunton et al. (2016a), Brunton et al. (2016b), Kaiser et al. (2018). SINDy can be used to identify the governing equations of a Markovian—in our case, discrete time—dynamical system

$$\begin{aligned} x_{t+1} = f(x_t) \in \mathbb {R}^m \end{aligned}$$

from data

We will extend this method to non-Markovian systems by applying SINDy to an extended version of \(\mathbf{X} \), the Hankel matrix

$$\begin{aligned} \tilde{\mathbf{X }} = \left[ \begin{array}{cccc} x_{p-1} &{} \ldots &{} x_{T-1} \\ \vdots &{} &{} \vdots \\ x_0 &{}\ldots &{} x_{T-p} \end{array} \right] . \end{aligned}$$

In essence, this is the concept used for the Hankel alternative view of Koopman (HAVOK) analysis from Brunton et al. (2016), where an autoregressive model is identified on transformed coordinates obtained from a singular value decomposition of the Hankel matrix from a scalar-valued observation function to separate linear from nonlinear, or even chaotic, behaviour of a Markovian system. We, however, seek a formulation for the dynamics of multidimensional observations. In this section and by the choice of the name SINAR, we explicitly want to point out the connection of system identification methods for nonlinear Markovian systems to their counterparts for nonlinear non-Markovian systems (with finite memory these are NAR systems) that can be derived through the Mori–Zwanzig formalism from Sect. 2.

3.1 SINDy: A Short Summary

We start with a short description of SINDy (Brunton et al. 2016a). In SINDy, we try to approximate each coordinate of f by a linear combination of basis functions \(\theta _i:\mathbb {R}^{m}\rightarrow \mathbb {R}\) and define

$$\begin{aligned} \Theta (x) = \left[ \begin{array}{cccc} \theta _1(x)\\ \vdots \\ \theta _v(x)\\ \end{array} \right] , \qquad \Theta (\mathbf{X} ) = \left[ \begin{array}{cccc} \theta _1(x_0) &{} \dots &{} \theta _1(x_{T-1})\\ \vdots &{} &{} \vdots \\ \theta _v(x_0) &{} \dots &{} \theta _v(x_{T-1}) \end{array} \right] . \end{aligned}$$

To this end, we fit a sparse coefficient matrix \(\Xi \in \mathbb {R}^{m \times v}\) with rows \(\Xi _i\) to the data \(\mathbf{X} ,\mathbf{X} '\) by solving for every row \(\mathbf{X} '_i\) of \(\mathbf{X} '\),

$$\begin{aligned} \Xi _i = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\Xi _i} \Vert \mathbf{X} _i' - \Xi _i \Theta (\mathbf{X} ) \Vert _F + \lambda \Vert \Xi _i \Vert _1. \end{aligned}$$

We then obtain the model

$$\begin{aligned} x_{t+1} \approx \Xi \Theta (x_t). \end{aligned}$$

In (3.2), we enforce a sparsity constraint using the LASSO regression algorithm (Tibshirani 1996) in which a regularization term is added onto the coefficient matrix, in order to only obtain the basis functions from \(\Theta \) that are dominant for the relation between \(x_{t+1}\) and \(\Theta (x_t)\).

The use of the 1-norm generates a sparse solution if we set \(\lambda > 0 \) appropriately. Sparse models will often times be less accurate than non-sparse models. However, what we gain through a sparse right-hand side of (3.3) is a better interpretability of the model since only the dominant terms have been identified as influential to the dynamics. It is vital to set \(\lambda \) so that the loss of accuracy is minimal compared to the gain in interpretability.

SINDy is closely related to the (first step of) the method of dynamic mode decomposition (DMD) (Williams et al. 2014; Tu et al. 2014), which aims at finding a linear connection between \(x_t\) and \(x_{t+1}\). To this end, one solvesFootnote 2

$$\begin{aligned} A = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{A} \Vert \mathbf{X} ' - A \mathbf{X} \Vert _F. \end{aligned}$$
Fig. 1
figure 1

Relation between different system identification methods. All of them are based on solving a least squares problem with respect to transformations of past to future states. While the AR minimization problem can be seen as the DMD problem on delay-embedded states and SINDy finds a nonlinear instead of linear connection between states (as in Hankel-DMD in Arbabi and Mezic (2017)), SINAR finds a nonlinear connection between multiple past states and future ones. SINAR allows for imposing a sparsity constraint onto the determination of macromodels in the same fashion as is done in SINDy for Markovian systems. This has already been done in a special way in Brunton et al. (2016), which is a special case of SINAR

3.2 Extending SINDy to SINAR

When the dynamical model (3.1) is insufficient in the sense that \(x_{t+1}\) depends not only on \(x_t\) but on memory terms too, we can apply the SINDy algorithm to suitably transformed data to obtain a nonlinear autoregressive model as in (2.9) with sparse coefficients. That is, only a few basis functions should occur with nonzero coefficients. Selecting a memory depth p and denoting

$$\begin{aligned} \tilde{x}_t := \left[ \begin{array}{cccc} x_t\\ \vdots \\ x_{t-p+1}\\ \end{array} \right] \in \mathbb {R}^{mp}, \end{aligned}$$

let us define as data matrices the Hankel matrix


Again, we choose basis functions

$$\begin{aligned} \tilde{\Theta }(\tilde{x}) = \left[ \begin{array}{cccc} \tilde{\theta }_1(\tilde{x})\\ \vdots \\ \tilde{\theta }_v(\tilde{x})\\ \end{array} \right] , \qquad \tilde{\Theta }(\tilde{\mathbf{X }}) = \left[ \begin{array}{cccc} \tilde{\theta }_1(\tilde{x}_{p-1}) &{} \dots &{} \tilde{\theta }_1(\tilde{x}_{T-1})\\ \vdots &{} &{} \vdots \\ \tilde{\theta }_v(\tilde{x}_{p-1}) &{} \dots &{} \tilde{\theta }_v(\tilde{x}_{T-1}) \end{array} \right] \end{aligned}$$

for example

$$\begin{aligned} \tilde{\Theta }(\tilde{x}_t) = [(x_t)_1^2,(x_t)_1 (x_t)_2,\dots , \sin ((x_{t-1})_1), \dots , (x_{t-2})_m (x_{t-3})_1]^T, \end{aligned}$$

and minimize for every row \(\tilde{\Xi }_i\) of \(\tilde{\Xi }\):

$$\begin{aligned} \tilde{\Xi }_i = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\tilde{\Xi }_i} \Vert \mathbf{X} '_i - \tilde{\Xi }_i \tilde{\Theta }(\tilde{\mathbf{X }}) \Vert _F + \lambda \Vert \tilde{\Xi }_i \Vert _1. \end{aligned}$$

Then with the basis functions with nonzero coefficients in \(\tilde{\Xi }\in \mathbb {R}^{m\times v}\), we have derived a nonlinear autoregressive model that approximates the evolution of x:

$$\begin{aligned} x_{t+1} = \tilde{\Xi } \tilde{\Theta }(\tilde{x}_t) \in \mathbb {R}^{m}, \quad \text { or, equivalently, } \quad (x_{t+1})_i = \sum _{j=1}^{v} \tilde{\Xi }_{ij} \tilde{\theta }_j(\tilde{x}_t). \end{aligned}$$

By deleting all columns of \(\tilde{\Xi }\) that only contain zeros, which should be many if we enforce the sparsity constraint, we get a reduced matrix and thus a low number of terms on the right-hand side of (3.7). We have thus identified a sparse nonlinear autoregressive model so that we call this extension of SINDy sparse identification of nonlinear autoregressive models (SINAR). Note that for a memory depth of \(p = 1\), SINDy and SINAR are equivalent. Figure 1 shows the connections between several prominent methods for learning macrodynamics from microsimulation data in the Markovian and non-Markovian setting. Figure 2 further illustrates the different structures of SINDy and SINAR.

Fig. 2
figure 2

Sketch of the SINDy algorithm (left) and SINAR (right). SINAR contains the additional step of creating a Hankel matrix

The choice of \(\tilde{\Theta }\) allows for an arbitrary functional dependence between the distinct time-delayed observables. We can recover the special structure used in the Mori–Zwanzig formalism (2.8) and (2.9) by a particular choice of the basis by choosing

$$\begin{aligned} \tilde{\Theta }(\tilde{x}_t) = [ \tilde{\varphi }_1(x_t), \ldots , \tilde{\varphi }_K(x_t), \ldots , \tilde{\varphi }_1(x_{t-p+1}), \ldots , \tilde{\varphi }_K(x_{t-p+1}) ]^T, \end{aligned}$$

with \(\tilde{\varphi }_1,\dots ,\tilde{\varphi }_K\) being scalar-valued functions as introduced in Sect. 2.3. Then we could directly estimate the coefficients \(H_k\) of the model (2.9)—which was derived through the Mori–Zwanzig formalism previously—from data, provided its distribution is approximately \(\mu \). Then \(\tilde{\Xi }\) has the block-wise form

$$\begin{aligned} \tilde{\Xi } = [H_0,\dots ,H_{p-1}] \in \mathbb {R}^{m\times pK} \end{aligned}$$


$$\begin{aligned} \tilde{\Xi }\tilde{\Theta }(\tilde{x}_t) = \sum _{k=0}^{p-1}H_k \begin{bmatrix} \tilde{\varphi }_1(x_{t-k})\\ \vdots \\ \tilde{\varphi }_K(x_{t-k}) \end{bmatrix}. \end{aligned}$$

Of course, by choosing linear basis functions \(\tilde{\Theta }(\tilde{x}_t) = \tilde{x}_t\) and setting \(\lambda = 0\), one obtains a well-known linear autoregressive model (Brockwell and Davis 1991). Except for the sparsity term, the determination of model coefficients as in (3.6) is exactly the least squares method commonly used for the linear AR models. In Appendix A.3, we explain the structural equivalences and differences between SINDy, SINAR, DMD and AR models that are also sketched in Fig. 1.

The covariance of the noise term \(\varepsilon _{t+1}\) in (2.9) can be estimated in the common way for linear or nonlinear AR models (Brockwell and Davis 1991; Lin and Lu 2019) by calculating the statistical covariance between \(\mathbf{X} '\) and \(\tilde{\Xi } \tilde{\Theta }(\mathbf{X} )\) (see Appendix A.6 for more details on both statements).

In Appendix B, we apply SINAR to an extended Hénon system, a two-dimensional dynamical system that admits a global attractor, and inspect both its accuracy in short-term predictions and its capacity to reconstruct the original attractor. This is to illustrate basic properties of nonlinear autoregressive models for a simple system yielding complex dynamics.

4 Application to an Agent-Based Model for Opinion Dynamics

We will now consider a network-based model of agents that change their opinions on a topic based on the opinions of their neighbours in the network. Suppose, we can only observe the percentages of agents inside the network that share each opinion, but not which agent exactly has which opinion, as in an anonymous opinion poll. Describing the evolution of these percentages can be approached by the Mori–Zwanzig formalism that we discussed in Sect.  2, since they are simply observations of hidden microdynamics. We will demonstrate the efficacy of NAR models in predicting the evolution of opinion percentages, compared with Markovian models. We use a time-discrete agent-based model (ABM), similar to the concept of modelling opinion changes in a population explained in Misra (2012). The ABM in Misra (2012), however, is time-continuous, while we use a time-discretized version of it. To apply the Mori–Zwanzig formalism to a time-continuous microdynamics, we refer the interested reader to the literature such as (Chorin et al. 2000, 2002).

4.1 Formulating the ABM

The ABM is given as follows: suppose there are N agents and each agent has exactly one out of M different opinions, denoted by \(1,\dots ,M\). The vector \(X_t\), which comes from

$$\begin{aligned} \mathbb {X} = \lbrace 1,\dots ,M \rbrace ^N, \end{aligned}$$

then represents the opinions of each agent at time t and \((X_t)_i\) denotes the opinion of agent i at time t. The neighbourhoods of all agents are represented by the symmetric adjacency matrix \(A \in \lbrace 0,1\rbrace ^{N\times N}\) where \(A_{ij} = 1\) means that agents i and j are neighbours of each other and \(A_{ij} = 0\) otherwise. Let \(N_i := \# (j : A_{ij} = 1)\) be the number of neighbours of an agent. The diagonal entries of A are set to 1, so that every agent is its own neighbour.

Let the procedure of opinion changing be given by the following rule: in every time step, every agent picks one of its neighbours in the network uniformly at random and changes its opinion with adaption probability \(\alpha _{m'm''}\) where \(m'\) is the opinion of the agent and \(m''\) is the opinion of the selected neighbour. This results in the term

$$\begin{aligned} \mathbb {P}[(X_{t+1})_i = m'' | (X_t)_i = m'] = \alpha _{m'm''} \frac{\#( j: A_{ij} = 1 \text { and } (X_t)_j = m'')}{N_i} \text { for } m'\ne m'', \end{aligned}$$

which we denote by \(p_i^{t}(m',m'')\). The probability for an agent not to change its opinion thus is

$$\begin{aligned} p_i^t(m',m') = \mathbb {P}[(X_{t+1})_i = m' | (X_t)_i = m'] = 1-\sum _{m'' \ne m'} p_i^{t}(m',m''). \end{aligned}$$

In algorithmic form, the agent-based model is executed in the following way:

figure a

To clarify the notation, remember that \((X_t)_i\) and \((X_t)_j\) denote the opinions of agents i and j at time t. Hence, \(\alpha _{(X_t)_i(X_t)_j}\) is the adaption probability of opinion \((X_t)_j\) given that an agent has opinion \((X_t)_i\). Note that in each time t every agent is given the opportunity to change its opinion, and whether this happens is a probabilistic event depending only on the opinions at time t.

We can now state the so-defined microdynamics by

$$\begin{aligned} X_{t+1} = F(X_t,\omega _t) \end{aligned}$$

where at every time step, \(\omega _t\) denotes a tuple consisting of N agents that represents the chosen neighbour of each agent plus numbers \(u_i \sim \mathcal {U}[0,1]\) that govern the adaption probability \(\alpha _{(X_t)_i(X_t)_j}\) as in Algorithm 1. To be more precise, \(\omega _t\) has the form

$$\begin{aligned} \omega _t = [j_1,\dots ,j_N,u_1,\dots ,u_N],\text { } j_i \sim \mathcal {U}\lbrace j: A_{ij} = 1 \rbrace ,\text { } u_i \sim \mathcal {U}[0,1]. \end{aligned}$$

F then is given by

$$\begin{aligned} (X_{t+1})_i = F(X_t,\omega _t)_i = {\left\{ \begin{array}{ll} (X_t)_{j_i} &{} \text { if } u_i < \alpha _{(X_t)_i,(X_t)_{j_i}} \\ (X_t)_{i} &{} \, \text {otherwise}. \end{array}\right. } \end{aligned}$$

This way of stating the microdynamics seems complicated compared to the more intuitive option of denoting by \((\omega _t)_i\) the new opinion of the ith agent, distributed by \([p_i^t((X_t)_i,1),\dots ,p_i^t((X_t)_i,M)]\). However, this would mean that the distribution of \(\omega _t\) changes over time, since the \(p_i^t\) depend on \((X_t)_i\). For the Mori–Zwanzig formalism, this would prevent us from applying the procedure of skew–shift systems introduced in Sect. 2.4 where we drew all \(\omega _t\) a priori and thus independently of the \(X_t\). By using the notation of \(\omega _t\) denoting a tuple of neighbours \(j_i\) and random numbers \(u_i\) that are compared to the adaption coefficients, we can draw the whole sequence of \(\omega _t\) independently of the \(X_t\) and maintain consistency with the notation of skew–shift systems.

4.2 Deducing Macrodynamics from the ABM

Closed-form macrodynamics.

We now define as the opinion percentages the function

$$\begin{aligned} \xi (X)= \frac{1}{N} \begin{bmatrix} \# X_i = 1\\ \vdots \\ \# X_i = M \end{bmatrix} \end{aligned}$$

and are interested in modelling how these percentages evolve over time. It turns out that for a complete network, i.e., \(A_{ij} = 1\) \(\forall i,j\), we can derive macrodynamics for the expected evolution of

$$\begin{aligned} x_{t} := \xi (X_{t}), \end{aligned}$$

that do not require memory terms. They are given by

$$\begin{aligned} \mathbb {E}[(x_{t+1})_{m'} \mid x_{t}] = (x_t)_{m'} + \sum _{m''\ne m'}(\alpha _{m''m'} - \alpha _{m'm''}) (x_t)_{m''}(x_t)_{m'} \text { for } m' = 1,\dots ,m. \end{aligned}$$

This equation can be derived as follows: in case of a complete network, \(p_i^t(m',m'') \equiv p^t(m',m'')\) is independent of i because the percentages of opinions among neighbours are equal for all agents since they all have the same neighbours. Then

$$\begin{aligned} p^t(m',m'') = \alpha _{m'm''} (x_t)_{m''}. \end{aligned}$$

In every time step, every agent with opinion \(m'\) chooses its opinion in the next time step with respective probabilities \(p^t(m',m'')\) for all opinions \(m'' \ne m'\) and probability \(1 - \sum _{m'' \ne m'} p^t(m',m'')\) for keeping opinion \(m'\). Since the number of these agents is given by \(N\cdot (x_t)_{m'}\), the expected absolute number of agents that change their opinion from \(m'\) to \(m''\) is given by

$$\begin{aligned} \begin{aligned}&\mathbb {E}[\#\text {Agents changing opinion from } m' \text { to } m''] \\&\quad = \sum _{i: (X_t)_i = m'} p^t(m',m'')\\&\quad = N\cdot (x_t)_{m'} \cdot p^t(m',m'') \\&\quad = N\cdot (x_t)_{m'} \cdot \alpha _{m'm''}\cdot (x_t)_{m''}. \end{aligned} \end{aligned}$$

This is the expected absolute number of agents that change their opinion from \(m'\) to \(m''\). This means that from this term alone, the percentage \((x_t)_{m'}\) of \(m'\) is reduced by \(\frac{1}{N}\) times this term, which is \(\alpha _{m'm''} (x_t)_{m''} (x_t)_{m'} \). Since at the same time agents with opinion \(m''\) can change their opinion to \(m'\) with probability \(\alpha _{m''m'} (x_t)_{m''} (x_t)_{m'} \), we have to subtract the analogous term for \(\mathbb {E}[\#\text {Agents changing opinion from } m'' \text { to } m']\) and the factor \((\alpha _{m''m'} - \alpha _{m'm''})\) comes in. As a consequence, for a complete network the expected evolution of x can be written in terms of x alone, without requiring additional information of the microstate X.

Consequences of the Mori–Zwanzig formalism.

In the abstract language of the Mori–Zwanzig formalism from Sect. 2, the above means that

$$\begin{aligned} P\mathcal {K}\xi = \mathcal {K}\xi ,\qquad \text {thus }Q\mathcal {K}\xi = 0, \end{aligned}$$

because we can express \(\mathcal {K}\xi = \mathbb {E}[\xi \circ F]\) as a function of \(\xi \) directly by using (4.1). Let us now consider (2.5), where terms of the form

$$\begin{aligned} P\mathcal {K}\rho ^k \quad \text { with } \rho ^k = (Q\mathcal {K})^k\xi \end{aligned}$$

occur. Equation (4.2) yields for \(k > 0\) that \(\smash {\rho ^k = (Q\mathcal {K})^{k-1} (Q\mathcal {K}\xi ) = 0}\). In this way, we can see that memory terms are not required for the dynamics of \(\xi \) if the network is complete. However, this is generally not the case for incomplete networks, as demonstrated in detail in Banisch (2014). In other words, (4.2) is no longer valid so that the \(\rho ^k\) do not vanish. In this case, by using as P the orthogonal projection onto basis functions we were able to find approximate representations of the terms \(P(\rho ^k\circ F)\) in (2.5). Here lies another part of the value of the application of the Mori–Zwanzig formalism: it installs that the structure of the ensuing macrodynamics in (2.5) is additive, i.e., it can be written as a sum of transformations of memory terms of individual delays, as opposed to memory terms containing mixed delays (e.g., \(\psi _1(x_t)\psi _2(x_{t-1})\)). This guides our choice for a good approximation structure and reduces the number of potential basis functions from exponential in the delay depth p to linear.Footnote 3

For an incomplete network which is still sufficiently densely connected, we expect the microdynamics to be in expectation still close to that of a complete network. Thus, in such a case we expect \(Q\mathcal {K}\xi \approx 0\), even if (4.2) does not hold exactly. Consequently, assuming dense connectedness, the opinion percentages should allow for a closed-form description of their evolution with a small memory depth. In the following, we will use SINAR to identify NAR models of this form suggested by the Mori–Zwanzig formalism.

4.3 Recovering the Macrodynamics in Case of an Incomplete Network

We now create realizations of the ABM with networks that consist of equally sized clusters of agents. Edges between agents from different clusters exist, but are few. Inside the clusters, all agents are connected with each other. To this end, we create networks with a total number of agents N consisting of equally sized clusters. Two agents from different clusters are connected with probability \(p_{between}\).

From the same initial state and with the same parameters, we create multiple realizations of the form \([X_0\dots , X_T]\) of the ABM and deduce the percentages of opinions \([x_0,\dots ,x_T] = [\xi (X_0),\dots ,\xi (X_T)]\). We denote the realizations of the resulting macrodynamics by \(\mathbf{X} _1,\dots ,\mathbf{X} _r\) and divide these data into training data \(\mathbf{X} _1,\dots ,\mathbf{X} _{train}\) and validation data \(\mathbf{X} _{train+1},\dots ,\mathbf{X} _{r}\). Subsequently, we execute the SINAR method with different memory depths p on the training data. SINAR gives us NAR models that we use for the reconstruction of the validation data. For this, the SINAR method can straightforwardly be modified for multiple trajectories by defining data matrices \(\mathbf{X} ' = [\mathbf{X} _1',\dots ,\mathbf{X} _{train}']\) and \(\tilde{\mathbf{X }} = [\tilde{\mathbf{X }}_1,\dots ,\tilde{\mathbf{X }}_{train}]\) in the notation of Sect. 3. We then compute the reconstruction errors of the validation data for each value of \(p = 1,\dots , p_{max}\). For the reconstruction, we divide each realization \(\mathbf{X} _i\) of the validation data into blocks of length \(l\ge p\). A block denotes l states \(\mathbf{x} ^{(j)}_i= [x_{jl},\dots ,x_{(j+1)l-1}]\), while the next block will be \(\mathbf{x} ^{(j+1)}_i= [x_{(j+1)l},\dots ,x_{(j+2)l-1}].\) We then compute a reconstruction \(\hat{\mathbf{x }}^{(j)}_i =[\hat{x}_{jl},\dots ,\hat{x}_{(j+1)l-1}]\) of this block with the NAR model obtained with SINAR for which we use the last p values of the previous block as starting values. We calculate the relative Euclidean error between reconstruction and data for each block by

$$\begin{aligned} err(\hat{\mathbf{x }}^{(j)}_i) = \frac{\Vert \mathbf{x} ^{(j)}_i - \hat{\mathbf{x }}^{(j)}_i\Vert _F}{\Vert \mathbf{x} ^{(j)}_i \Vert _F}. \end{aligned}$$

Afterwards, we take the mean over all \(err(\hat{\mathbf{x }}^{(j)}_i)\) to measure the performance of the NAR model.

Since the entries of \(\xi (X_t)\) always sum up to 1, information about the percentages of opinions \(1,\dots ,M-1\) immediately yields the percentage of opinion M so that we use SINAR to find an NAR model for the evolution of the percentages of the first \(M-1\) opinions only and omit the redundant information \(\xi (X)_M\). For the reconstruction error, we compare data about the percentages of only the first \(M-1\) opinions with their reconstructions. This NAR model does not necessarily ensure that the predicted first \(M-1\) percentages stay between 0 and 1 and their sum is at most 1. Since we make short-term predictions only, however, there will at most be only slight deviations from this property.

In the form of the diagram (2.2) from Sect. 2, the Mori–Zwanzig procedure applied to this concept can be described as

figure b

4.3.1 Case 1: A Complete Network

For \(p_{between} = 1\), the network is complete and there should be no improvement of the prediction by allowing memory terms.

We set \(N = 5000, T = 300\) and \(A_{ij} = 1\) \(\forall i,j\). The number of different opinions is \(M = 3\). As coefficients \(\alpha _{m'm''}\) we choose

$$\begin{aligned} \begin{bmatrix} \alpha _{11} &{}\quad \alpha _{12} &{}\quad \alpha _{13}\\ \alpha _{21} &{}\quad \alpha _{22} &{}\quad \alpha _{23}\\ \alpha _{31} &{}\quad \alpha _{32} &{}\quad \alpha _{33} \end{bmatrix} = \begin{bmatrix} 0 &{}\quad 0.165 &{}\quad 0.03\\ 0.03 &{}\quad 0 &{}\quad 0.165\\ 0.165 &{}\quad 0.03 &{}\quad 0 \end{bmatrix} \end{aligned}$$

As initial percentages we assign values to the \((X_0)_i\) so that \(\xi (X_0) = [0.45,0.1,0.45]^T\).

As the block length in the validation data, we use \(l = 40\). We can already write down the macrodynamics since they are given in (4.1) (see Appendix C.1 for details):

$$\begin{aligned} \begin{aligned} \mathbb {E}[(x_{t+1})_1 \mid x_t]&= (1+\alpha _{31} - \alpha _{13})(x_t)_1 + (\alpha _{13}-\alpha _{31}) (x_t)_1^2 + (\alpha _{21}-\alpha _{12}-\alpha _{31} + \alpha _{13}) (x_t)_1 (x_t)_2\\&= 1.135 (x_t)_1 - 0.135 (x_t)_1^2 - 0.27 (x_t)_1 (x_t)_2,\\ \mathbb {E}[(x_{t+1})_2 \mid x_t]&= (1+\alpha _{32} - \alpha _{23})(x_t)_2 + (\alpha _{23}-\alpha _{32}) (x_t)_2^2 + (\alpha _{12}-\alpha _{21}-\alpha _{32} + \alpha _{23}) (x_t)_1 (x_t)_2\\&= 0.865 (x_t)_2 + 0.135 (x_t)_2^2 + 0.27 (x_t)_1 (x_t)_2. \end{aligned} \end{aligned}$$

Inspired by this structure, we choose as basis functions in SINAR

$$\begin{aligned}{}[\tilde{\varphi }_1, \dots , \tilde{\varphi }_L](x_t) = [(x_t)_1,(x_t)_2,(x_t)_1^2,(x_t)_2^2,(x_t)_1 (x_t)_2] \end{aligned}$$

so that

$$\begin{aligned} \begin{aligned} \tilde{\Theta }(\tilde{x}_t)&= [\underbrace{(x_t)_1,(x_t)_2,(x_t)_1^2,(x_t)_2^2,(x_t)_1 (x_t)_2}_{\text {Markovian terms as in (4.3)}},\ldots \\&\quad \underbrace{(x_{t-1})_1,(x_{t-1})_2,(x_{t-1})_1^2,(x_{t-1})_2^2,(x_{t-1})_1 (x_{t-1})_2,\dots }_{\text {Memory terms}}]^T. \end{aligned} \end{aligned}$$

Since (4.1), resp. (4.3), describe the expected evolution of the percentages and are thus in the form of deterministic models, we omit the noise term \(\varepsilon _{t+1}\) from (2.9) which we assumed to satisfy \(\mathbb {E}[\varepsilon _{t+1}] = 0\).

We create \(r=20\) realizations of which we use 12 for training and the others for validation. We set the sparsity parameter to \(\lambda = 0\) and to \(\lambda =0.05\) to test how the accuracy decreases with a sparser model. Since the macrodynamics (4.3) are Markovian, we obtain for the prediction error of the validation data no improvement by allowing memory terms (Fig. 3) for neither the 40- nor the one-step prediction error. Note that the predictions with the sparse NAR model provide slightly better accuracy for large memory depths. This is because small nonzero coefficients for memory terms improve the fit of the training data, but cause errors in the prediction of the validation data, because the macrodynamics are Markovian. Through the sparsity constraint enforced, these nonzero coefficients for memory terms are cut off. The recovered sparse macrodynamics for \(p=1\) reads

$$\begin{aligned} \begin{aligned} (x_{t+1})_1&= 1.1353 (x_t)_1 - 0.1351 (x_t)_1^2 - 0.2709 (x_t)_1 (x_t)_2\,,\\ (x_{t+1})_2&= 0.8655 (x_t)_2 + 0.1344 (x_t)_2^2 + 0.2699 (x_t)_1 (x_t)_2\,, \end{aligned} \end{aligned}$$

which is very close to the analytically derived macrodynamics (4.3).

Fig. 3
figure 3

Results for the complete network. Top left: One realization of the microdynamics. Every column of the graphic represents the opinion of each of the 5000 agents at one point in time. Blue denotes opinion 1, green denotes opinion 2 and red denotes opinion 3. Top right: Corresponding realization of the macrodynamics \(\xi (X)\) that represent the percentages of opinions among all agents. We can observe oscillatory behaviour since agents with opinion 1 tend to change their opinion to 2 and analogously from 2 to 3 and from 3 to 1. Bottom: 40-step and one-step relative prediction errors of the NAR models determined by SINAR for different memory depths p with \(\lambda = 0\) and \(\lambda = 0.05\). As expected, the prediction error does not decrease with higher memory depth than \(p=1\) (Color figure online)

4.3.2 Case 2: A Two-Cluster Network

We now construct a network with \(N = 5000\) agents, divided into two clusters of size 2500 each. We set \(p_{between} = 0.0001\). Again, \(M = 3\) and \(\alpha _{m'm''}\) are the same as in case 1. As the starting condition, we let opinions in the first cluster be distributed by [0.8, 0.1, 0.1] and in the second cluster by [0.1, 0.1, 0.8]. If the initial percentages in both clusters were equal then the percentages in both clusters would evolve in a quite similar way in parallel so that the macrodynamics would essentially be the same as in the complete network case. With the initial percentages being so different, it is possible that an opinion that is dominant in one cluster at one point in time but only sparsely represented in the other can become popular through the links between agents from different clusters. This will cause the difference in behaviour of the evolution of percentages compared to the complete network.

Moreover, in order to derive the Markovian macrodynamics in Eq. (4.1), we needed that the probabilities for an agent i to change its opinion \((X_t)_i\) at time t, which we denoted by \(p^t((X_t)_i,m'')\), be independent of i. If the neighbourhoods of different agents are generally different from each other, this is no longer the case. Especially so, if agents are distributed into different clusters, where opinion percentages might be very different. Thus, we cannot derive Markovian macrodynamics for this case, but in light of the Mori–Zwanzig formalism, we will need memory terms.

To show this, we create \(r = 20\) realizations of length \(T = 500\) and again use 12 for training, the remaining for validation. As block length, we choose \(l = 20\). Memory terms become immediately significant, as the error graphs illustrate (Fig. 4). We use the basis given in (4.4), which has the length 5p.

Fig. 4
figure 4

Results for the two-cluster network. Top left: One realization of the microdynamics. Colours represent opinions as in Fig. 3. Top right: Corresponding realization of the macrodynamics \(\xi (X)\). Again there is oscillatory behaviour but also plateaus and short dips as in the red and green graphs at time 25 - 150. This is because at these times one opinion is dominant in one cluster, but not present in the other. Through the links between the clusters, an opinion, that is not present in a cluster but dominant in the other one can be revived, e.g., the blue opinion in the upper cluster. Bottom: 20-step and one-step relative prediction errors of the NAR models determined by SINAR for different memory depths p with \(\lambda = 0\) and \(\lambda = 0.05\). Memory terms yield a significant decrease in the prediction errors compared to Markovian predictions

The non-sparse and sparse solutions only deviate slightly from each other in their accuracy, but the sparse solution gives a significantly more compact model. For example, for \(p = 2\), we obtain for the coefficients \(\tilde{\Xi }\)

$$\begin{aligned} \lambda = 0: \tilde{\Xi }&= \begin{bmatrix} 2.04 &{} 0.03 &{} -0.07 &{} -0.08 &{} 0.02 &{} -1.05 &{} -0.02 &{} 0.07 &{} 0.07 &{} -0.02 \\ -0.05 &{} 1.88 &{} 0.00 &{} 0.11 &{} 0.06 &{} 0.06 &{} -0.89 &{} -0.01 &{} -0.12 &{} -0.05 \end{bmatrix}\\ \lambda = 0.05: \tilde{\Xi }&= \begin{bmatrix} 1.9691 &{} 0 &{} 0 &{} 0 &{} 0 &{} -0.9700 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 1.9662 &{} 0 &{} 0 &{} 0 &{} 0 &{} -0.9671 &{} 0 &{} 0 &{} 0 \end{bmatrix} \end{aligned}$$

so that for \(\lambda = 0.05\) the NAR model is given by

$$\begin{aligned} \begin{aligned} (x_{t+1})_1&= 1.9691 (x_t)_1 - 0.9700 (x_{t-1})_1\\ (x_{t+1})_2&= 1.9662 (x_t)_2 - 0.9671 (x_{t-1})_2. \end{aligned} \end{aligned}$$

For \(p=1\), the NAR model obtained with SINAR (\(\lambda = 0.05\)) is

$$\begin{aligned} \begin{aligned} (x_{t+1})_1&= 1.0094(x_t)_1 -0.053 (x_t)_1 (x_t)_2 \\ (x_{t+1})_2&= 0.9894(x_t)_2 + 0.0574(x_t)_1 (x_t)_2. \end{aligned} \end{aligned}$$

With \(\lambda = 0\), the obtained NAR model has other terms with nonzero coefficients, but these are small. In Fig. 5, an example for the predictions of opinion percentages in one block using the NAR models with \(p = 1,2\) and 10 is depicted and compared to the corresponding data. As the error graphs in Fig. 4 show already, the predicted percentages come closer to the percentages in the data with increasing memory depth. In order to illustrate why memory terms improve the prediction accuracy, let us imagine for now that there are no links between the clusters. Then, the evolutions of opinion percentages in both clusters run in parallel to each other and are Markovian as derived previously. The opinion percentages in the full network are then given by the averages of the cluster-wise percentages \(x_t^{(i)}\), i.e., \(x_t = \frac{1}{2}(x_t^{(1)} + x_t^{(2)})\). This means, if we know \(x_t\), then there are various options for what \(x_t^{(1)}\) and \(x_t^{(2)}\) can be, all of which might result in different values for \(x_{t+1}^{(1)}\) and \(x_{t+1}^{(2)}\) and thus \(x_{t+1}\). If we are additionally given \(x_{t-1}\), this might yield possible values for \(x_{t-1}^{(1)}\) and \(x_{t-1}^{(2)}\), which themselves make some of the candidates for \(x_{t}^{(1)}\) and \(x_{t}^{(2)}\) unlikely. Thus, through the information of memory terms we can restrict the options for what the percentages inside each cluster are. We illustrate this in more detail in Appendix C.2.

Fig. 5
figure 5

Opinion percentages over one block of length 20 from the validation data and prediction evolutions with NAR models obtained with SINAR for \(p = 1,2\) and 10 and \(\lambda = 0\) (two-cluster network). Percentages from validation data are depicted with thin lines and predicted percentages with lines with crosses. With \(p=1\), the prediction accuracy is poor and improves drastically for \(p=2\). With \(p=10\), the predicted evolutions come even closer to the curves from the validation data

The links between the clusters have as consequence that within one cluster agents generally do not have identical opinion change probabilities since their neighbourhoods are different. This yields additional need for memory terms since then not even for the macrodynamics in one cluster a Markovian formulation can be derived.

4.3.3 Case 3: A Five-Cluster Network

We repeat the same procedure as with the two-cluster network, but with five clusters of equal size 1000. Again, all agents within a cluster are connected with each other and \(p_{between} = 0.0001\). The \(\alpha _{m'm''}\) are identical to the ones used in the first two examples. As starting conditions we let opinions in the different clusters be drawn according to different distributions for each cluster. Those distributions are [0.8, 0.1, 0.1], [0.1, 0.1, 0.8], [0.1, 0.8, 0.1], [0.3, 0.4, 0.3] and [0.5, 0.3, 0.2]. The evolution of the opinion percentages is now much more irregular compared to the previous examples. The oscillatory behaviour is still present, but the amplitudes differ from time to time. Through the higher number of clusters, more randomness comes into the model since an opinion can be randomly spread from one cluster, where it is dominant, to another one, where it is not dominant, suddenly altering the evolution of percentages in this cluster and thus in the whole network.

Fig. 6
figure 6

Results for the five-cluster network. Top left: One realization of the microdynamics. Every column of the graphic represents the opinion of each of the 5000 agents at one point in time. Top right: Corresponding realization of the macrodynamics \(\xi (X)\). The behaviour is much more complex than in the first two cases. Bottom: 20-step and one-step relative prediction errors of the NAR models determined by SINAR for different memory depths p with \(\lambda = 0\) and \(\lambda = 0.05\)

Fig. 7
figure 7

Opinion percentages over one block of length 40 from the validation data and prediction evolutions with NAR models obtained with SINAR for \(p = 1,2\) and 20 and \(\lambda = 0\) (five-cluster network). Percentages from validation data are depicted with thin lines and predicted percentages with lines with crosses. As in the example with a two-cluster network, we can see what the error graphs in Fig. 6 indicate: the predicted evolutions are closer to the validation data with higher memory depth of the NAR model

We now show that, similar to when we used a two-cluster network, memory terms become important for predictions of the evolution of the microdynamics. This is shown in Fig. 6. Again, the mean relative error per block converges with increasing p. While in the two-cluster network example the performance did not improve visibly with \(p > 10\), in this case we can get slightly lower errors for p approaching 20.

For \(p = 2\) and \(\lambda = 0.05\), we obtain the NAR model

$$\begin{aligned} \begin{aligned} (x_{t+1})_1&= 1.8745 (x_t)_1 - 0.8748 (x_{t-1})_1\\ (x_{t+1})_2&= 1.8672 (x_t)_2 - 0.8674 (x_{t-1})_2. \end{aligned} \end{aligned}$$

For \(p > 2\), the models show increasing complexity, e.g., for \(p=3\):

$$\begin{aligned} \begin{aligned} (x_{t+1})_1&= 1.4662 (x_t)_1 - 0.1188(x_t)_2 + 0.0552 (x_t)_1^2 + 0.1318 (x_t)_1 (x_t)_2 \\&\quad + 0.2309 (x_{t-1})_2\\&\quad - 0.1899 (x_{t-1})_1(x_{t-1})_2 - 0.2021 (x_{t-1})_2^2 - 0.4658 (x_{t-2})_1 \\&\quad - 0.1060 (x_{t-2})_2\\&\quad + 0.1206 (x_{t-2})_1^2 +0.0644 (x_{t-2})_2^2(x_{t+1})_2\\&= 1.3157 (x_t)_2 - 0.3161 (x_{t-2})_2. \end{aligned} \end{aligned}$$

Again, we show as an example the predictions of percentages for one block of length 40 with memory depths 1, 2 and 10 (Fig. 7). As in the example with the two-cluster network, we can see that a higher memory depth indeed increases the prediction accuracy for the evolution of the opinion percentages in the short term, i.e., for predictions of length 20 resp. 40. Plus, enforcing the sparsity constraint with the parameter \(\lambda \) in SINAR set to 0.05 yields significantly sparser models, while the prediction accuracy only suffered slightly.

5 Discussion

In this article, we have summarized how the evolution of observations of a dynamical system can be derived through the Mori–Zwanzig formalism and how this can result in a nonlinear autoregressive model with memory. For the determination of model parameters, we have used methodology from data-driven system identification methods, inspired by SINDy (Brunton et al. 2016a). We could then extend SINDy to SINAR which identifies sparse nonlinear autoregressive (NAR) models from data, thus deploying a common system identification method for non-Markovian systems.

We applied this to an agent-based model (ABM) that simulates the dynamics of opinion changes in a population. Assuming that all agents are equally strongly influenced by all other agents in the population, we showed that for the prediction of the percentages of opinions within the population memory terms are not necessary. However, for incomplete networks, this is no longer the case. Our methodology enabled us to make more accurate predictions for the percentages of opinions among the agents when the population of agents was defined by clusters with little influence between them. Additionally, sparse models obtained from enforcing a sparsity constraint in the estimation of NAR models in SINAR gave almost equally good prediction accuracy as the non-sparse ones, while yielding far simpler models. In the context of opinion dynamics, such sparse models permit to point out more clearly which opinions impact which others and how.

The following challenges have yet to be addressed:

  • In our methodology, we have assumed a noise term resulting from Mori–Zwanzig that was zero mean. This allowed us to omit it when making predictions of the expected value of the opinion percentages. This simplifying assumption does not need to be true, and one could try to derive a more accurate representation for the noise term. As a result of this simplifiying assumption, the NAR models we considered were deterministic, even for non-deterministic microdynamics. Introduction of explicit noise in the NAR models, e.g., by extending the approach outlined in Klus et al. (2020), could improve their (statistical) predictive capacities.

  • One could additionally choose a different projection P in the Mori–Zwanzig formalism. The choice of an orthogonal projection on a finite set of basis functions explicitly yielded an NAR model. The right projection for a given system could inspire an optimal choice of basis functions, e.g., such that the memory depth is minimal.

  • We have derived models that are stationary, i.e., do not change over time. Since the assumption of an equilibrium distribution over states of the microdynamics might not always hold, coefficients of the NAR model may become time-dependent. One could use a regime switching model as in Horenko (2011) that fixes coefficients for a time interval before changing them to other fixed values when the macrodynamics show certain behaviour, e.g., coefficients might be different depending on which opinion is dominating.

A MATLAB toolbox for the experiments done in Sect. B and Appendix 4 is provided under https://github.com/nwulkow/OpinionDyamicsModelling.