Entropic herding

Herding is a deterministic algorithm used to generate data points regarded as random samples satisfying input moment conditions. This algorithm is based on a high-dimensional dynamical system and rooted in the maximum entropy principle of statistical inference. We propose an extension, entropic herding, which generates a sequence of distributions instead of points. We derived entropic herding from an optimization problem obtained using the maximum entropy principle. Using the proposed entropic herding algorithm as a framework, we discussed a closer connection between the herding and maximum entropy principle. Specifically, we interpreted the original herding algorithm as a tractable version of the entropic herding, the ideal output distribution of which is mathematically represented. We further discussed how the complex behavior of the herding algorithm contributes to optimization. We argued that the proposed entropic herding algorithm extends the herding to probabilistic modeling. In contrast to the original herding, the entropic herding can generate a smooth distribution such that both efficient probability density calculation and sample generation become possible. To demonstrate the viability of these arguments in this study, numerical experiments were conducted, including a comparison with other conventional methods, on both synthetic and real data.


Introduction
In a scientific study, we often collect data on the study object, but its perfect information can not be obtained.Therefore, we often have to make a statistical inference based on the available data, assuming a certain background distribution.One way of drawing a statistical inference is to adopt the distribution with the largest uncertainty, consistent with the available information.This is referred to as the maximum entropy principle (Jaynes, 1957) because entropy is often used to measure the uncertainty.
Specifically, for a distribution that has a probability mass (or density) function p(x), the (differential) entropy is defined as where E p denotes the expectation over p. Assume that the collected data are a set of averages µ m of feature values ϕ m (x) ∈ R for m ∈ M, where M is a set of indices.Then, the distribution p that the maximum entropy principle adopts is known as the Gibbs distribution: where θ is the vector consisting of the parameters θ m for m ∈ M, and Z θ is the coefficient that makes the total probability mass one.The parameters can be chosen by maximizing the likelihood of the data.However, if the distribution is defined in a high-dimensional or continuous space, parameter learning often becomes difficult because approximations of the averages over the distribution are required.
The herding algorithm (Welling, 2009b;Welling and Chen, 2010) is another method that can be used to avoid parameter learning.This algorithm, which is represented as a dynamical system in a high-dimensional space, deterministically generates an apparently random sequence of data, where the average of feature values converges to the input target value µ m .The generated sequence is treated as a set of pseudo-samples generated from the background distribution in the following analysis.Thus, we can bypass the difficult step of parameter learning of the Gibbs distribution model.The paper by Welling (2009b) describes its motivation in relation to the maximum entropy principle.
In this paper, we propose an extension of herding, which is called entropic herding.We use the proposed entropic herding as a framework to analyze the close connection between herding and the maximum entropy principle.Entropic herding outputs a sequence of distributions, instead of points, that greedily minimize a target function obtained from an explicit representation of the maximum entropy principle.The target function is composed of the error term of the average feature value and the entropy term.We mathematically analyze the minimizer, which can be regarded as the ideal output of entropic herding.We also argue that the original herding is obtained as a tractable version of entropic herding and discuss the inherent characteristics of herding in relation to the complex behavior of herding.
In addition, in section 4.5, we argue the advantages of the proposed entropic herding, in contrast to the original, which we call point herding.The advantages include the smoothness of the output distribution, the availability of model validation based on the likelihood, and the feasibility of random sampling from the output distribution.
These arguments are further discussed through numerical examples in Section 5 using synthetic and real data.

Herding
Let X be the set of possible data points.Let us consider a set of feature functions ϕ m : X → R indexed with m ∈ M, where M is the set of indices.Let M be the size of M, and let the bold symbols denote M -dimensional vectors with indices in M. Let µ m be the input target value of the feature mean, which can be obtained from the empirical mean of the features over the collected samples.Herding generates the sequence of points x (T ) ∈ X for T = 1, 2, . . ., T max according to the equations below: where w (T ) m are auxiliary weight variables, starting from the initial values w (0) m .These equations can be regarded as a dynamical system for w (T ) m in M -dimensional space.The average of the features over the generated sequence converges as (5) at the rate of O(1/T max ) (Welling, 2009b).Therefore, the generated sequence reflecting the given information can be used as a set of pseudo-samples from the background distribution.This convergence is derived using the equation w (T ) ) and the boundedness of w (Tmax) .The equation is obtained by summing both sides of Eq. ( 4) for T = 1, . . ., T max .The boundedness of w (Tmax) is guaranteed under some mild assumptions by using the optimality of x (T ) in the maximization of Eq. (3) (Welling, 2009b).
The probabilistic model ( 2) is also regarded as a Markov random field (MRF).Herding has been extended to the case in which only a subset of variables in an MRF is observed (Welling, 2009a).In addition, it is combined with the kernel method to generate a pseudo-sample sequence for arbitrary distribution (Chen et al, 2010).The steps of herding can be applied to the update steps of Gibbs sampling (Chen et al, 2016), and the convergence analysis is provided by Yamashita and Suzuki (2019).

Related work
Herding used the weighted average of features with time-varying weight values.This technique, which can be called time-varying feature weighting, is also employed in other research areas.It is mainly used to mitigate the problem of local minima often encountered in optimization.
The maximum likelihood learning of MRF (2) is often performed using the following gradient: The expectation in the second term can be approximated using the Markov chain Monte Carlo (MCMC) (Hinton, 2002;Tieleman, 2008).However, it is known that MCMC often suffers from slow mixing because of the sharp local minima in the potential function E(x) = − log p(x) + const..In the case of MRF (2), the potential function is , which also has the form of a weighted sum of the features.The parameter θ m changes over time during the learning process.Tieleman and Hinton (2009) demonstrated that the efficiency of learning can be improved by adding extra time variation to the parameter.Combinatorial optimization is another example of the application of time-varying feature weighting.The Boolean satisfiability problem (SAT) is the problem of finding a binary assignment to a set of variables satisfying the given Boolean formula.A Boolean formula can be represented in a conjunctive normal form (CNF), which is a set of subformulae called clauses combined by logical conjunctions.This problem can be regarded as the minimization of the weighted sum of features, where features are defined by the truth values of clauses in the CNF.Several methods (Cai and Su, 2011;Morris, 1993;Thornton, 2005;Wu and Wah, 2000) solve the SAT problem by repeatedly improving the assignment for the target function and changing its weights when the process is trapped in a local minimum.Ercsey-Ravasz andToroczkai (2011, 2012) proposed a continuous-time dynamical system to solve the SAT problem, which implements the local improvement and time variation of the weight values.It is also shown that this system is effective in solving MAX-SAT, which is the optimization version of the SAT problem (Molnár et al, 2018).As suggested by Welling (2009b), one advantage of using deterministic dynamics is the possibility of efficient physical implementation.Yin et al (2018) designed an electric circuit implementing the SAT-solving continuous-time dynamical system and evaluated its performance.

Entropic herding and its target function representing maximum entropy principle
We first present a summary of the proposed algorithm, which we refer to as entropic herding.Let us suppose the same situation as in Section 2.1.
For simplicity, we assume that X is continuous, although the arguments also hold when X is discrete.For a distribution q on X , let H(q) = − X q(x) log q(x)dx be its differential entropy.Let ϕ m be the feature functions and µ m be the input target value of the feature mean.In addition, Λ m and ε (T ) are also provided to the algorithm as additional parameters.The proposed algorithm is an iterative process similar to original herding, and it also maintains time-varying weights a m for each feature.Each iteration of the proposed algorithm indexed with T is composed of two steps as in the herding: the first step is to solve the optimization problem, and the second is the parameter update based on the solution.The proposed algorithm outputs a sequence of distributions (r (T ) ) instead of points as is the case with the original herding.The two steps in each time step, which will be derived later, are represented as follows: where Q is the set of candidate distributions of the output for each step and η m (q) ≡ E q [ϕ m (x)] is the feature mean for the distribution q.The pseudocode of the algorithm is provided in Algorithm 1.As will be described later, we can modify the algorithm by changing the set of candidate distributions Q and the optimization method, allowing the suboptimal solution for Eq. ( 7).

Target function
The proposed procedure summarized above is derived from the minimization of a target function as presented below.
Let us consider the problem of minimizing the following function: This problem minimizes the difference between feature means over p and the target vector and simultaneously maximizes the entropy of p. Λ m can be regarded as the weights for the moment conditions η m (p) = µ m .When Λ m → +∞, the problem becomes an entropy maximization problem that requires the solution to satisfy the moment condition exactly.Thus, this can be interpreted as a relaxed form of the maximum entropy principle.The function L is convex because the negated entropy function −H(p) is convex.
Let π θ be the Gibbs distribution, the density function of which is defined as Eq. ( 2).Suppose that there exists θ * that satisfies the following equations The conditions for the existence of the solution are provided in Appendix C. The functional derivative δL δp (x) of L at π θ * is obtained as follows: Suppose that we perturb the distribution π θ * to p.Then, because X (π θ * (x)−p(x))dx = 1−1 = 0, the inner product δL δp , p − π θ * becomes zero, where the inner product is defined as f, g = X f (x)g(x)dx.Therefore, π θ * is the optimal distribution for the problem because of the convexity of L. If Λ m is sufficiently large, the feature mean η m (π θ * ) for the optimal distribution is close to the target value µ m from Eq. (10).

Greedy minimization
Let us consider the following greedy construction of the solution for the minimization problem of L. Let r (T ) be a distribution that is selected at time T and p be the distribution to be constructed as the weighted average of the sequence of distributions (r (T ) ): where ρ (T ) are given as a fixed sequence of parameters.We greedily construct p by choosing r (T ) for each time step T = 1, . . ., T max by minimizing the tentative loss value L (T ) ≡ L(p (T ) ) for where ρ t,T = ρ (t) /( T t=0 ρ (t) ).The feature mean η m (p (T ) ) for each step is also represented as a Algorithm 1 Pseudocode of the entropic herding

Input:
set of feature functions ϕ m : X → R, target values of the feature means µ m , weight parameters Λ m , a sequence of step size ε (T )   1: choose an initial distribution r (0) and let η perform several optimization steps for min q∈Q m∈M a ] − H(q) (Eq.( 7)) 4: let r (T ) be the obtained distribution (Eq. ( 8)) 6: end for weighted sum: For particular types of weight sequences (ρ (T ) ), the feature means can be iteratively computed based on previous time instances as follows: For example, if the coefficients geometrically decay at the rate ρ ∈ (0, 1) as ρ (T ) = ρ Tmax−T for T > 0 and ρ (0) = ρ Tmax /(1 − ρ), we can update the feature mean with ε T ) is constant over T , then we set ε (T ) = 1 T +1 .We approximate the entropy term H(p (T ) ) by the weighted sum of the components: Then, similarly, the update equation for H is obtained as The minimization problem to be solved at time T is given as follows: From the convexity of entropy, H(T ) is the lower bound of H(p (T ) ); hence, L(T ) is the upper bound of the original target function L (T ) .The step of Eq. ( 7) is derived from the greedy construction of the solution that minimizes L as follows.By using Eqs.( 17) and ( 19), the target function can be rewritten as For all T , let us define Then, L(T ) also follows the update rule similar to Eq. ( 17) up to the linear term with respect to ε (T ) : If ε (T ) is small, by neglecting the higher-order term O((ε (T ) ) 2 ), the optimization problem at time T can be reduced to a minimization of F (T −1) (r (T ) ), where From Eq. ( 17) and ( 22), the coefficients a (T ) m follow the update rule as which is equivalent to Eq. ( 8).In summary, the proposed algorithm is derived from the greedy construction of the solution of the optimization problem min L(q): it selects a distribution r (T ) in each step by minimizing the tentative loss.The minimization can be reduced to the minimization of F (T −1) (r (T ) ) that has a parametrized form, and the algorithm updates the parameters by calculating their temporal differences.The algorithm thus derived, referred to as entropic herding, has a form similar to original herding.

Entropy maximization in herding
Let us further study the relationship between entropic herding, original herding, and the maximum entropy principle.

Target function minimization and parameter learning of the Gibbs distribution
Figure 1 is a schematic of the relationship between the L minimization and parameter learning of the Gibbs distribution.Each probability distribution on X is represented as a filled circle in the figure.
Each moment condition can be represented as a vector µ, and the vector space consists of possible target vectors projected onto the horizontal direction of the schematic.The vertical axis represents the entropy of the distribution.For each condition, the set of distributions satisfying the moment condition ϕ m (p) = µ m for all m is denoted by a dotted line.The top-most point of the line is the distribution with the highest entropy value under the condition.In other words, for each moment condition µ i , the Gibbs distribution π θ i is obtained by entropy maximization (red arrow) along the corresponding dotted line.The target distribution π θ adopted by the maximum entropy principle is the solution of entropy maximization under the moment condition of interest.The topmost points for various conditions form a family of Gibbs distributions represented as Eq. ( 2) and denoted by the black solid curve.The parameter learning finds this target along this curve (green arrow).In contrast, the entropic herding finds the target directly by minimizing the target function L that represents both moment condition and entropy maximization (blue arrow).The search space, which is represented by Eq. ( 14), is different from the family of Gibbs distrubitions.
Let π a (T −1) be the Gibbs distribution defined as Eq. ( 2) with parameters θ = a (T −1) .The optimization problem of minimizing F (T −1) (q) is equivalent to minimizing the Kullback-Leibler (KL) divergence KL(q π a (T −1) ), the solution of which is given by q = π a (T −1) as follows: However, the Gibbs distribution is known to produce difficulties in applications such as the calculation of expectations.Fortunately, a suboptimal distribution r (T ) may be sufficient for decreasing L, like the tractable version of the herding introduced in (Welling, 2009a).Therefore, in practice, we can perform the optimization on a restricted set of distributions, as in variational inference, and allow a suboptimal solution for the optimization step (Eq.( 7)).Let Q denote the set of candidate distributions.We refer to this tractable version of the proposed algorithm as entropic herding, in contrast to the algorithm using the exact solution r (T ) = π a (T −1) .

Point herding
Next, we show that the proposed algorithm can be interpreted as an extension of the original herding.Recall that, in the original herding in section 2.1, the optimization problem to generate the sample x (T ) and the update rule of weight w m are represented in Eq. ( 3) and ( 4), respectively.
Let us consider the proposed algorithm with the candidate distribution set Q restricted to point distributions.Then, the distribution r (T ) is the point distribution at x (T ) , which is obtained by solving because the entropy term of the optimization problem Eq. ( 8) can be dropped.Let us further suppose that ε (T ) = 1 T +1 and Λ m are the same for all features.We introduce the variable transformation of weights as w m .Then, the above optimization problem is equivalent to Eq. ( 3) with the relationship w m = w m .
The update rule for w Because η m (r (T ) ) = ϕ m (x (T ) ) holds, this update rule is also equivalent to Eq. ( 4).Therefore, the original herding can be interpreted as the tractable version of the proposed method by restricting Q to the point distributions.
In the following, we refer to the original algorithm as point herding.In this case, the weights ρ (T )  in Eq. ( 14) are constant for all T because we set ε (T ) = 1 T +1 .This agrees with the fact that the Fig. 1 Schematic of the relationship between the L minimization and the maximum entropy principle point herding puts equal weights on the output sequence while taking the average of the features, the difference of which with the input moment is minimized (see Eq. ( 5)).

Diversification of output distributions by dynamic mixing in herding
The minimizer of Eq. ( 9) is the Gibbs distribution π θ * , where θ * is given by Eq. ( 10), but there is a gap between the distribution obtained from the tractable entropic herding and the minimizer.This is because it solves the optimization problem with the approximate target function and by using sub-optimal greedy updates, as described in Sections 3.2 and 4.2.Here, we discuss the inherent characteristics of entropic herding for decreasing the gap, which cannot be fully described as greedy optimization.
The target function for greedy optimization in (tractable) entropic herding is L (Eq. ( 20)).In deriving L from L, we introduce an approximation for the entropy term H(p (T ) ) by its lower bound H(T ) , which is the weighted average of the entropy values of the components as defined in Eq. ( 18).The gap between H and H can be calculated using the following equation: where Then, the gap is represented by If the coefficients ρ t,T are fixed, the gap is determined by c .If this gap is positive, we have an additional decrease in the true target function L compared to the approximate target L of the greedy minimization.
Suppose that t is a random variable drawn with a probability of ρ t ,T , and x is drawn from r (t ) .Then, H (T ) c (x) can be interpreted as the average entropy of the conditional distribution of t , given x.If the probability weights assigned to x by the components t, represented as ρ t,T r (t) (x), are unbalanced over t, then H (T ) c (x) is small so that the gap H(p (T ) ) − H(T ) is large.In contrast, if r (t) are the same for all t so that r (t) = p (T ) , then for all x.Then, the gap H(p (T ) ) − H(T ) becomes zero, which is the minimum value.That is, we achieve an additional decrease in L if the generated sequence of distributions r (t) is diverse.This can happen if a (t) are diverse, as the optimization problems in Eq. ( 7), defined by the parameters a (t) , become diverse.In the original herding, the weight dynamics is weakly chaotic, and thus, it can generate diverse samples.We also expect that the coupled system of Eq. ( 7) and ( 8) in the high-dimensional space will exhibit complex behavior that achieves diversity in {a (t) } and {r (t) }.The extra decrease is bounded from above by H (T ) ρ , which usually increases as T increases.For example, if ρ t,T is constant over t, then H (T ) ρ = log(T + 1).In summary, the proposed algorithm, which is a generalization of herding, minimizes the loss function L using the following two means: (a) Explicit optimization: greedy minimization of L, which is the upper bound of L, by solving Eq. ( 7) with the entropy term.(b) Implicit diversification: extra decrease of L from the diversity of the components r (T ) achieved by the complex behavior of the coupled dynamics of the optimization and the update step (Eq.( 7) and ( 8), respectively).
The complex behavior of the herding contributes to the optimization implicitly through an increase of the gap H − H by diversification of the samples.In addition, the proposed entropic herding can improve the output through explicit entropy maximization.Thus, we can generalize the concept of herding by regarding it as an iterative algorithm having two components as described above.

Probabilistic modeling with entropic herding
The extension of herding to the proposed entropic herding expands its application as follows.
The first important difference introduced by the extension is that the output becomes a mixture of probabilistic distributions.Thus, we can consider the use of entropic herding in probabilistic modeling.Let π θ be the distribution obtained from the maximum entropy principle.For each time step T , the tentative distribution p (T ) = T t=0 ρ t,T r (t) (x) is obtained by minimizing L, which represents the maximum entropy principle.Then, we can expect it to be close to π θ and use it as the probabilistic model p output derived from the data.It should be noted that the model thus obtained p output = p (T ) is different from π θ .It includes the difference between π θ and the minimizer π θ * of L by the finiteness of the weights Λ m in Eq. ( 9) and includes the difference between π θ * and p output because of the inexact optimization.We can also obtain another model p output by further aggregating the tentative distributions p (T ) , expecting additional diversification.For example, we can use the average as the output probabilistic model.This is again a mixture of the output sequence (r (T ) ) that only differs from p (Tmax) in the coefficients.A more specific method for output aggregation is described in Appendix A.
If Q is set appropriately and the number of components is not too large, the probability density function of the output p output can be easily calculated.The availability of the density function value enables us to use likelihood-based model evaluation tools in statistics and machine learning, such as cross-validation.This can also be used to select the parameters of the algorithm such as Λ m .
If X is continuous, we cannot use point herding to model the probability density because the sample points can only represent the non-smooth delta functions.Even if X is discrete, when the number of samples T is much smaller than the number of possible states |X |, the zero-probability weight should be assigned to a large fraction of states.The zero-probability weight causes infiniteness when the log-likelihood is calculated.In addition, the probability mass can only take a multiple of 1/T , which causes inaccuracy especially for a state with a small probability mass.Many samples are required to obtain accurate probability mass values for likelihood-based methods.In contrast, the output of entropic herding has sample efficiency because each output component can assign non-zero probability mass to many states.This is demonstrated by numerical examples in the next section.
Another important difference is that entropic herding explicitly optimizes the entropy term.Therefore, entropic herding can generate a distribution that has a higher entropy value than the point herding output.In addition, we can control how it focuses on the entropy or moment information of the data by changing the parameter Λ m in the target function.While point herding diversifies the sample points by its complex behavior, the optimization problem solved in each time step does not explicitly contain Λ m so that the balance between entropy maximization and moment error minimization is not controlled.
Moreover, if we restrict the candidate distribution set Q to be appropriately simple, we can easily obtain any number of random samples from p output by sampling from r (t) for each randomly sampled index t.This sample generation process can also be parallelized because each sample generation is independent.For example, we can use independent random variables following normal distributions or Bernoulli distributions as Q, as shown in the numerical examples below.
Theoretically, the exact solution of the problem Eq. ( 7) is obtained as π a (T −1) .However, we can enjoy the advantages of the entropic herding described above when Q is restricted to simple, analytic, and smooth distributions, even though the optimization performance is suboptimal.
In summary, entropic herding has both the merits of herding and probabilistic modeling using the distribution mixture.

Numerical Examples
In this section, we present numerical examples of entropic herding and its characteristics.We also present comparisons between entropic herding and several conventional methods.A detailed description of the experimental procedure is provided in Appendix A. The definitions of the detailed parameters, not described here, are also provided in the Appendix.

One-dimensional bimodal distribution
As a target distribution, we take a onedimensional bimodal distribution where Z is the normalizing factor.The histogram of this distribution is shown by the orange bars in each panel of Fig. 3.For this distribution, we consider a set of four polynomial feature functions for herding as follows: From the moment values µ m = E p bi [ϕ m (x)], the maximum entropy principle reproduces the target distribution p bi .Using this feature set, we ran point herding and entropic herding.Figure 2 shows the time series of the generated sequence, and Figure 3 shows the histogram compared to the target distribution p bi .For entropic herding, we set the candidate distribution set Q as where N (µ, σ 2 ) denotes a one-dimensional normal distribution with mean µ and variance σ 2 .Figures 2(a) and 3(a) are for point herding.We observe the periodic sequence, and this causes a large difference in the histogram around x = 0. Point herding has complex dynamics of the high-dimensional weight vector that achieves random-like behavior even in the deterministic system.However, in this case, the dimension of the weight vector is only four.This low dimensionality of the system can be a cause of the periodic output.
In contrast, Figs.2(b) and 3(b) show the results for entropic herding.Although this trajectory is periodic as well, the output in each step is a distribution with a positive variance such that it can represent a more diverse distribution.The difference in the histograms is reduced, as shown in Fig. 3(b).
In addition, the periodic behavior of point herding can be mitigated by introducing a stochastic factor into the system.Figures 2(c) and 3(c) show the results of point herding with the stochastic Metropolis update (see Appendix A for details).We observed some improvements in the difference in the histogram.The stochastic update has the role of increasing the diversity of samples, as described in (b) in Section 4.4.Thus, this algorithm is conceptually in line with entropic herding, although not included in the proposed mathematical formulation.Entropic herding, which generates a sequence of distributions, can also be combined with stochastic update.The results of this combination are shown in Figs.2(d) and 3(d), respectively.We can see that the periodic behavior of point herding is diversified.The improvements in the difference between the histograms are significant especially for point herding (Fig. 3(a) and 3(c)).

Boltzmann machine
Next, we present a numerical example of entropic herding for a higher dimensional distribution.We consider the Boltzmann machine with N = 10 variables.The state vector is x = (x 1 , . . ., x N ) , where x i ∈ {−1, +1} for all i ∈ {1, . . ., N }.The target distribution is defined as follows: where W ij are the randomly drawn coupling weights, and Z is the normalizing factor.For simplicity, we did not include bias terms.For this distribution, we use a set of feature functions {φ ij | i < j}, as follows: The dimension of the weight vector is N (N −1)/2.With this feature set, we ran entropic herding and obtained the output sequence of 320 distributions.The input target value µ was obtained from the feature mean E pBM [φ ij ] calculated with the definition Eq. ( 47).We used the candidate distribution set Q, defined as the set of the following distributions: where p i ∈ [0, 1] are the parameters and x 1 , . . ., x N are independent.We also generated 320 identically and independently distributed random samples from the target p BM for comparison.Figure 4 is a scatter plot comparing the probability mass between the output and the target distribution for each state.Figure 4(a) shows the result for the empirical distribution of random samples.The Boltzmann machine has 1024 states for N = 10, but we generated only 320 samples in this experiment.Therefore, the mass values for the empirical distribution have discrete values.In addition, there are many states with mass values of zero.We can observe the large difference in mass values between the empirical distribution and the target distribution.
In contrast, Fig. 4(b) shows the results of entropic herding.As in the above example for bimodal distribution, each output of entropic herding can represent a more diverse distribution than a single sample.For most of the states that have large probability weights, the difference in mass value is within a factor of 1.5.This is much smaller than that in the case of the random samples.

Model selection
Using the Boltzmann machine above, we present a numerical example of the dependency of parameter choice on output and the model selection for entropic herding.In the experiment, Λ m for all m was proportional to the scalar parameter λ.
Figure 5 shows the moment error and the entropy of the output distribution for different values of λ and number of samples.The error is measured by the sum squared error between the feature mean of the output and the target distribution.The results of the identical number of samples are represented by points connected by black broken lines.By comparing these results, we observe that the error in the feature means mostly decreases with increasing λ.In contrast, we can obtain a more diverse distribution with a large entropy value by decreasing λ.Therefore, there is a tradeoff between accuracy and diversity when choosing parameter values.
We can compare the output for various parameters by comparing the KL-divergence between the where the value p output (x) can be easily evaluated for each x.Note that if we have a validation dataset instead of the target distribution, we can also compare the negative log-likelihood for the validation set. Figure 6 shows the KL-divergence for various values of λ and the number of samples.We observe that the optimal λ is dependent on the number of samples.

UCI wine quality dataset
Finally, we present an example of an application of entropic herding to real data.We used a wine quality dataset (Cortez et al, 2009) from the UCI data repository1 .It was composed of 11 physicochemical features of 4898 wines.The wines are classified into red and white, which have different distributions of feature values.We applied some preprocessing to the data, including log-transformation and z-score standardization.A summary of the features and preprocessing methods is provided in Table 1.A simple model for this distribution is a multivariate normal distribution, defined as follows: where x = (x 1 , . . ., x 11 ) is the vector of the feature values, and Z is the normalizing factor.The parameter W in this model can be easily estimated from the covariance matrix of the features.This model is unimodal and has a symmetry such that it is invariant under the transformation x ← −(x − µ) + µ.
However, as shown in Fig. 7(a) and 7(b), the distribution of this dataset is asymmetric, and bimodal distributions can be observed in the pair plots.To model such a distribution, we improve the model by introducing higher-order terms as follows: The direct parameter estimation for this model is difficult.However, entropic herding can be applied Fig. 4 Probability mass values of (a) the empirical distribution of random samples and (b) the output distribution of the entropic herding for the Boltzmann machine compared to the target distribution.Each state of the target Boltzmann machine is represented by a point.The states with zero probability mass are represented by a small value of 10 −5 and in red plus signs.The vertical axis represents the probability mass of the obtained distribution.The horizontal axis represents those of the target distribution.A logarithmic scale was used for both axes.The point on the dashed diagonal line indicates that the two probability masses are identical.The point between two dotted lines indicates the difference in mass value is within a factor of 1.5.The result with λ = 13 was chosen by model validation to draw inferences from the feature statistics of the data.We use the following feature set: where each feature is defined as We added φ (1) i to control the mean of each variable.Using the maximum entropy principle with the moment values taken from an assumed background distribution Eq. ( 52), we reproduce the distribution where the coefficients corresponding to φ (1) i are zero.We used the candidate distribution set Q, defined as a set of the following distributions: where µ i ∈ R and σ i > 0.01 for all i ∈ {1, . . ., 11} are the parameters and x 1 , . . ., x 11 are independent.
Figure 7 shows the pair plot of the distribution of the dataset and the distribution obtained from entropic herding.We picked three variables in the plot for ease of comparison.The plot for all variables will be presented in Appendix B. We observe that the distribution obtained well represents the characteristics of the dataset distribution.Particularly, the asymmetry and the two modes are  We can use the herding output as a probabilistic model.Figure 9 shows the negative loglikelihood of the validation data.We observe that the model corresponding to the true class of wine assigns larger likelihood values than the other models.We used the results for the classification of red and white wine by using the difference in the log-likelihood as a score.The AUC for the validation set was 0.998.The score was close to the AUC value obtained using the log-likelihood of fitted multivariate normal distribution (0.995) and linear logistic regression (0.998).
The simple analytic form of the output distribution can also be used for the probabilistic estimation of missing values.We generated a dataset with missing values by dropping x 4 from the validation set for white wine.The output distribution is p output (x) = 1 Tmax Tmax T =1 r (T ) (x), where r (T ) ∈ Q is given by the parameter (µ i ) denote the marginal distribution of x i for r (T ) , which is a normal distribution.The conditional distribution of x 4 on the other variables is expressed as follows: where T ) , respectively.Figure 10(a) shows a violin plot of the conditional distribution for 50 randomly sampled data.Figure 10(b) shows the results of the multivariate normal distribution.The standard deviations of the estimations shown in Fig. 10(b) are identical because they are from the same multivariate normal distribution.Comparing these plots, we see that entropic herding is better for the more flexible model than the multivariate normal distribution.The dotted horizontal line shows the true value, and the short horizontal lines show the [10, 90] quantile of the estimated distribution.We counted the number of data with true values in this range.The proportion of such data was 79.7% for entropic herding and 51.1% for multivariate normal distribution.We can conclude that estimation by entropic herding was better calibrated.

Discussion
The most significant difference between proposed entropic herding and original point herding is that entropic herding represents the output distribution as a mixture of probability distributions.As for the applications of entropic herding, some of the desirable properties of density calculation and sampling, discussed in Section 4.5, are the results of using the distribution mixture for the output.There are also many probabilistic modeling methods that use the distribution mixture, such as the Gaussian mixture model and kernel density estimation (Parzen, 1962).All of these methods, including entropic herding, share the above characteristics.
The most important difference between entropic herding and other methods is that it does not require specific data points and only uses the aggregated moment information of the features.The use of aggregated information has recently attracted increasing attention (Law et al, 2018;Sheldon and Dietterich, 2011;Tanaka et al, 2019a,b;Zhang et al, 2020).Sometimes, we can only use the aggregated information for privacy reasons.For example, statistics, such as population density or traffic volumes, are often aggregated to mean values by spatial regions, which often have various granularities (Law et al, 2018;Tanaka et al, 2019a,b).In addition to data availability, features can be selected to avoid irrelevant information depending on the focus of the study and data quality.These advantages are common to entropic and point herding methods, but nonetheless distinctive when compared with other probabilistic modeling methods.Notably, kernel herding (Chen et al, 2010) is a prominent variant of herding that has a convergence guarantee, but it does not share the aforementioned advantages because it requires individual data points to use the features defined in the reproducing kernel Hilbert space.
The framework of entropic herding does not depend on the choice of the feature functions ϕ m and candidate distribution Q.In practice, the requirement for Q is the availability of the calculation and optimization of the expectation E q [ϕ m (x)] and the entropy H(q) for q ∈ Q.In Table 1 Summary of the UCI wine quality dataset and the preprocessing applied.The columns "variable" and "name" show the correspondence between the indices and names of the features.The column "range" shows the minimum and maximum values in the dataset prior to preprocessing.The rightmost two columns represent the applied preprocessing.The checkmark ( ) indicates that preprocessing in the column is applied."Log-transformation" means the transformation x ← log 10 x.The checkmark with an asterisk ( *) indicates that there exist data such that log 10 x → −∞.We assigned −5.0 instead to these cases.Linear transformation is applied to each variable to make the average zero and the variance one, as represented in the column "z-score" variable name range log-transformation z-score x 1 fixed acidity [3.
Fig. 8 Distribution of x 4 (horizontal axis) and x 8 (vertical axis) obtained by entropic herding for white wine.The width of each cell in the plot is 0.1.The red circles represent 20 distributions r (T ) randomly drawn from The sizes of the circles along the vertical and horizontal axes represent the standard deviations of x 4 and x 8 , respectively the numerical examples, we used the distribution of independent random variables for Q for simplicity.In recent years, many methods have been developed for the generative expression of a probability distribution, for example, using neural networks (Goodfellow et al, 2014;Kingma and Welling, 2014) and decision trees (Criminisi, 2012).We expect that this study will serve as a theoretical framework for the more advanced use of these sophisticated generative models which can use them as a mixture.
Regarding computational efficiency, the most computationally intensive part of entropic herding at present is the optimization step.We must repeatedly solve the optimization problems generated from the weight dynamics.If the amount of weight update in each step is small, we can assume that the problems in the consecutive steps are similar.As described in the Appendix, we used gradient descent from the latest solution for the optimization of the experiment and demonstrated its feasibility.However, exploiting the characteristics of repetitive optimization will produce further improvement.
The gradient descent in entropic herding is easily realizable if the analytic form of entropy and the expectation of the feature over the candidate distribution q ∈ Q are available.However, if the analytic form is not available, we must Fig. 9 Negative log-likelihood for the validation data.The red crossed marks and green plus signs represent the red and white wines in the validation data, respectively.The horizontal and vertical axes represent the negative log-likelihood − log p output (x) for models obtained by entropic herding for red and white wines, respectively.The dashed line represents where the two negative log-likelihoods are identical resort to more sophisticated optimization methods.The problem solved in the optimization step is equivalent to obtaining the distribution approximation by minimizing the KL-divergence (see Eq. ( 33)).This problem often appears in the field of machine learning, such as in variational Bayes inference.To extend the application of herding, it can be combined with recent optimization techniques developed in this field.

Conclusion
In this paper, we proposed an algorithm called entropic herding as an extension of herding.
By using the proposed algorithm as a framework, we discussed the connection between herding and the maximum entropy principle.Specifically, entropic herding is based on the minimization of the target function L. This function, which is composed of the feature moment error and entropy term, represents the maximum entropy principle.Herding minimizes this function in two ways.The first is the minimization of the upper bound of the target function by solving the optimization problem in each step.The second is the diversification of the samples by the complex dynamics of the high-dimensional weight vector.
We also studied the output of entropic herding through a mathematical analysis of the optimal distribution of this minimization problem.
We also clarified the difference between entropic and point herding for application both theoretically and numerically.We demonstrated that point herding can be extended by explicitly considering entropy maximization in the optimization step by using distributions rather than points for the candidates.The output sequence of the entropic herding has more efficiency in the number of samples than the point herding because each output is a distribution that can assign a probability mass to many states.The output sequence of the distribution can be used as a mixture distribution that allows independent sample generation.The mixture distribution also has an analytic form.Therefore, model validation using likelihood and inference through conditional distribution is also possible.
As discussed in Section 6, entropic herding allows flexibility in the choice of the feature set and candidate distribution.We expect entropic herding to be used as a framework for developing effective algorithms based on the distribution mixture.
The feature value for the distribution is defined as η m (p) ≡ E p [φ m (x)].After standardization, the average of the feature values used for the target value becomes zero, and we use the same Λ m for all m.Namely, we used instead of using Eq. ( 22).Note that this is equivalent to a , which can also be obtained by substituting Λ m = λ/σ m into Eq.( 22).Here, the parameter tuning of Λ m is simplified to optimize the single global parameter λ.Note that the preprocessing simplifies parameter tuning and is not generally necessary.When we do not have information on the standard deviation, we can still use entropic herding by tuning each Λ m .
The optimization problem in Eq. ( 7) was solved using the gradient method.A different parameterized candidate distribution set Q is used for each case, and the parameter is optimized by repeating a small update following the gradient of the target function.The optimization for each T is performed by iterating the number k update of optimization steps.The obtained state r (T ) is used as the initial state for the next optimization at T + 1.The amount of update in each optimization step was modified using the Adam method (Kingma and Ba, 2015) to maintain the numerical stability of the procedure.The hyperparameters in Adam were set to (β 1 , β 2 , ε) = (0.8, 0.99, 10 −8 ) (see (Kingma and Ba, 2015)).The gradient after the modification was multiplied by the learning rate, denoted by η learn .At the beginning of the inner loop of optimization, for each T , the rolling means maintained by Adam were reset to the initial value of zero.
In some cases, we used modified weight values depending on the optimization state.Instead of a (T −1) m , we used the weight values a m defined as follows: where q cur denotes the current state in the inner loop of the optimization.Note that this is different from r (T −1) , except for the beginning of the inner loop.This modification was also used for numerical stability.The justification is as follows: Eq. ( 23) is equivalent to The terms dependent on r (T ) are By neglecting small higher order term O((ε (T ) ) 2 ), the minimization is reduced to solving r (T ) = arg min q∈Q m∈M a m η m (q) − H(q) , which is equivalent to Eq. ( 7) with the substitution of a m by a m .
We introduced a stochastic jump in the optimization step in some cases.In this case, a jump is proposed with a probability of p jump in each optimization step.The candidate distribution q ∈ Q is drawn randomly and accepted as the next state if it has a better target function value than the distribution of the current state.The method of candidate generation is described for each problem in the following sections.
The amount of the update in each step, denoted by ε (T ) , is set to the same value.Namely, we set ε (T ) = ε herding for each T .This means that the distribution weights ρ t,T in Section 3.2 geometrically decay at the rate of ρ = 1 − ε herding .
To eliminate nonstationary behavior depending on the initial condition, we set a burn-in period in the algorithm similarly to conventional MCMC algorithms.After the herding run with T max = T burnin + T output , the output sequence, except for the burn-in period, is aggregated into an output mixture distribution.The output mixture is obtained as follows: We implemented the method above using the automatic differentiation provided by the Theano (Bergstra et al, 2010) framework.The default settings for the above parameters are summarized in Table 2.The values in the table were used unless explicitly stated otherwise.

A.1 One-dimensional bimodal distribution
Using the Metropolis-Hastings method, 10000 samples were generated and used as the input.We used the candidate distribution set Q defined as The four feature means over the candidate distribution, denoted by η i (q) = E q [φ i (x)] = E q x i , are obtained as follows: η 2 (q) = E q x 2 = µ 2 + σ 2 , (69) As they all have analytic expressions, the gradients with respect to the parameters can be easily obtained.
We applied variable transformation l = log σ and optimized the parameter set (µ, l) ∈ R × [log 0.01, +∞) in the algorithm.We reported the results for λ = 100 in this study.
For the case of entropic herding with stochastic update, p jump = 0.1 is used.When a random jump is proposed, the candidate distribution is determined by using the current value of l and drawing µ randomly from [µ min , µ max ], where µ min and µ max are the minimum and maximum values of µ that have so far appeared in the procedure.
Point herding with Metropolis updates was implemented for comparison.In this case, a random jump is proposed in each step (p jump = 1).The candidate is accepted according to the Metropolis rule.That is, it is accepted with a probability min(1, exp(−∆F )), where ∆F is the increase in the target function F a (T −1) (x).In this case, the modified weight values a m are not used.

A.2 Boltzmann machine
The parameters W ij (i < j) for p BM are drawn from a normal distribution with mean zero and variance (0.2) 2 /N .We then added some structural interactions to increase nontrivial correlations; we assigned W 45 = 0 and W i(i+1) = −0.3 for i = 4.We obtained the values of µ m and σ m used for feature standardization by calculating the expectation following the definition of the target model p BM .
We used the candidate distribution set Q, defined as consisting of the following set of random variables: where p i ∈ [0, 1] are the parameters and x 1 , . . ., x N are independent.The feature means over the candidate distribution have an analytic expression such that the gradient can be obtained easily: We applied the variable transformation s i = log pi 1−pi and optimized the parameter set (s 1 , . . . ,s N ) ∈ R N in the algorithm.This variable transformation states the relationship between the gradients as ∂ ∂si f = dpi dsi ∂ ∂pi f .However, the coefficient dpi dsi = p i (1 − p i ) can be very small if |s i | is large.We dropped this coefficient, which did not affect the sign of the update, and used ∂ ∂pi f as the gradient.
The candidate distribution for the random jump is obtained by setting the sign of each s i randomly, keeping the absolute value |s i |.

A.3 UCI wine quality data
The input data were split into training and validation sets.Twenty percent (20%) of the data for red and white wine each, were randomly selected as the validation set.The remaining training set was used to obtain the feature statistics needed by the algorithm.After model validation, we reported the results for λ = 200 in this study.
The parameterization of the candidate distributions Q is similar to the case of the bimodal distribution above.We used the candidate distribution set Q, defined as consisting of the set of the following random variables: where µ i ∈ R and σ i > 0.01 for all i ∈ {1, . . ., 11} are the parameters and x 1 , . . ., x 11 are independent.
The means of the features φ (1) i , φ i , and φ (4) i over the candidate distribution can be obtained in the same way as in the case of the bimodal distribution above.In addition, the mean of φ (2) ij is calculated as follows: They also have analytic expressions, allowing the gradients to be obtained easily.
The variable transformation l i = log σ i was applied to the optimizaiton algorithm.The candidate distribution for the random jump is taken for each (µ i , l i ) as in the case of the bimodal distribution above.
The plots of the distributions in Figs. 7, 8 and 10 are made from sufficient number of random samples drawn from the output distribution p output .
Let D ⊂ R M be the set of parameter vectors θ such that the integrals in Eq. ( 78) and ( 80 ).
The conditions (83) are likely to be relaxed by decreasing α m and increasing β m .However, it is not always the case because A m , which is the lower bound of the feature value on the face may decrease when D is expanded.The conditions (83) will be satisfied, if we can sufficiently expand D without changing the bounds A m and B m that appears in the condition.Based on this strategy, the following condition ensures the existence of a solution for any µ ∈ R M .
The following are simple cases for Theorem 3, where both X and ϕ m are bounded.

Corollary 4 If either
• X is discrete and finite, or • X is a bounded closed subset of R M and, for all m ∈ M, ϕ m (x) is a bounded function on X , then, there exists a solution of Eq. ( 84) for any µ ∈ R M .
Note that if X is unbounded, the distribution Eq. ( 80) is not well-defined for all θ ∈ R M , that is, D = R M .Therefore, the existence of the solution of Eq. ( 10) is not guaranteed for all µ ∈ R M , because D ⊂ D that satisfies the conditions of Theorem 1 for all µ ∈ R M may not exist.However, we still have a chance to assure the existence of a solution by explicitly obtaining D, which may depends on µ, as illustrated in Example 1.

Fig. 2
Fig. 2 Output sequences of (a) point herding and (b) entropic herding for the one-dimensional bimodal distribution p bi .The horizontal axis represents the index of the step denoted by t in the text.The vertical axis represents the space X = R and is divided into bins of width 0.1.The color represents the probability mass for each bin, where yellow corresponds to a larger mass.The probability mass values were normalized to [0, 1] in each plot before being colorized.Panels (c) and (d) show output sequences of point herding and entropic herding with stochastic optimization steps, respectively

Fig. 3
Fig. 3 Histogram of the output distribution of (a) point herding and (b) entropic herding compared to the one-dimensional bimodal target distribution p bi .The horizontal axis represents the space X = R and is divided into bins of width 0.1.The vertical axis represents the probability mass of each bin.Panels (c) and (d) show the output distributions of point herding and entropic herding with stochastic optimization steps, respectively

Fig. 5
Fig. 5 Moment error and the entropy value of the output distribution p output for various values of weight parameter λ and output length T output ∈ {20, 40, 80, 160, 320}.The mean values of 10 trials are shown.The horizontal axis represents the sum squared error of the feature, which is defined as SSE = m (η m (p output ) − µ m ) 2 .It is represented on a logarithmic scale.The vertical axis represents the entropy value, H(p output ).The horizontal dashed line represents the entropy of target p BM .The results for each λ are represented as points with the same color and are connected by lines.The points corresponding to identical T output are connected by black lines

Fig. 6
Fig. 6 KL-divergence KL(p BM p output ) for different values of weight parameter λ and output length T output ∈ {20, 40, 80, 160, 320}.The mean values of 10 trials are shown.The horizontal and vertical axes represent T output and KL-divergence, respectively.The logarithmic scale was used for both axes.The results for each λ are represented as points and are connected by lines

Fig. 7
Fig.7Pair plot of the distribution of the dataset and the distribution obtained from entropic herding.For ease of comparison, we selected three variables (x 1 , x 4 , x 8 ), which are shown from top to bottom and from left to right in each panel.The plots for all variables are presented in the Appendix.The plots on the diagonal represent the histograms of the variables.The vertical axis has its own scale of probability mass and does not correspond to the scale shown in the plot.The other plots represent the probability density with cells of width 0.5.The horizontal and vertical axes represent the variables corresponding to rows and columns, respectively.The left ((a) and (c)) and right ((b) and (d)) panels correspond to the red and white wine, respectively.The top ((a) and (b)) and bottom ((c) and (d)) panels are for the dataset and entropic herding, respectively

Fig. 10
Fig. 10 The violin plots of the conditional distribution of x 4 given other variables for 50 randomly sampled validation data.The horizontal axis represents the data index.The vertical axis corresponds to the value of x 4 .The dotted horizontal line represents the true values in the data.The short horizontal lines show the [10, 90] quantile of the estimated distribution.(a) The output distribution of entropic herding.(b) The multivariate normal distribution

Fig. 11 Fig. 12
Fig.11For the red wine, the plot of the distributions of the dataset and distribution obtained from the entropic herding is shown.(a) The dataset.(b) The distribution obtained by entropic herding.The meaning of the plot is the same as that of Fig.7, but all variables in the model are displayed

Table 2
Default settings for each experiment in this paper.A description of these parameters is in the text.The check mark ( ) in column a m indicates that the modified weight value (Eq.(62)) is used.εherding T output T burnin η learn k update a m p jump